CUDA Errors with YOLOv5 Object Detection on CPAI 2.5.x – Seeking Insights

Sardon · Post by **Sardon** » Sun Feb 18, 2024 10:32 am

I'm reaching out to share a perplexing issue I've encountered with the integration CPAI with my BI setup, hoping to find if anyone else has experienced something similar or could offer any insights. The problem first manifested around 01:45 am on 14/02/2024, and despite troubleshooting efforts, it recurred this morning, indicating a persistent underlying issue.

Initially, the system logs from 14th February showed an error related to CUDA, specifically mentioning "an illegal memory access was encountered". This issue caused a loop of errors until a system reboot was performed at 9:06 am.

Here is the exact log entry for reference:

Code: Select all

2024-02-14 01:39:57: Object Detection (YOLOv5 6.2): Retrieved objectdetection_queue command 'custom' in Object Detection (YOLOv5 6.2)
2024-02-14 01:39:57: Object Detection (YOLOv5 6.2): Detecting using ipcam-combined in Object Detection (YOLOv5 6.2)
2024-02-14 01:39:57: Response received (#reqid 85bde494-89d3-429d-a21b-c10b9430c5a8 for command custom)
2024-02-14 01:39:57: Object Detection (YOLOv5 6.2):  [RuntimeError] : Traceback (most recent call last):
  File "C:\Program Files\CodeProject\AI\modules\ObjectDetectionYOLOv5-6.2\detect.py", line 141, in do_detection
    det                  = detector(img, size=640)
  File "C:\Program Files\CodeProject\AI\runtimes\bin\windows\python37\venv\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Program Files\CodeProject\AI\runtimes\bin\windows\python37\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "C:\Program Files\CodeProject\AI\runtimes\bin\windows\python37\venv\lib\site-packages\yolov5\models\common.py", line 669, in forward
    with dt[0]:
  File "C:\Program Files\CodeProject\AI\runtimes\bin\windows\python37\venv\lib\site-packages\yolov5\utils\general.py", line 158, in __enter__
    self.start = self.time()
  File "C:\Program Files\CodeProject\AI\runtimes\bin\windows\python37\venv\lib\site-packages\yolov5\utils\general.py", line 167, in time
    torch.cuda.synchronize()
  File "C:\Program Files\CodeProject\AI\runtimes\bin\windows\python37\venv\lib\site-packages\torch\cuda\__init__.py", line 566, in synchronize
    return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
 in Object Detection (YOLOv5 6.2)

I upgraded yesterday from CPAI v2.5.1 to v2.5.4, hoping the update would resolve the issue which took place on the 14th. However, this morning, the same CUDA error reappeared, this time indicating "an illegal instruction was encountered". The error persisted until a reboot was done just after 9 am. Below is the log excerpt from today's occurrence:

Code: Select all

2024-02-18 04:52:40: Object Detection (YOLOv5 6.2): Detecting using ipcam-combined in Object Detection (YOLOv5 6.2)
2024-02-18 04:52:40: Response rec'd from Object Detection (YOLOv5 6.2) command 'custom' (#reqid ad29496b-5caf-4d8b-b02f-7fcc6c7ab605) ['No objects found']  took 22ms
2024-02-18 04:52:40: Client request 'custom' in queue 'objectdetection_queue' (#reqid 7e1b52dd-b880-4c35-b7c6-0f076127faab)
2024-02-18 04:52:40: Request 'custom' dequeued from 'objectdetection_queue' (#reqid 7e1b52dd-b880-4c35-b7c6-0f076127faab)
2024-02-18 04:52:40: Object Detection (YOLOv5 6.2): Retrieved objectdetection_queue command 'custom' in Object Detection (YOLOv5 6.2)
2024-02-18 04:52:40: Object Detection (YOLOv5 6.2): Detecting using ipcam-combined in Object Detection (YOLOv5 6.2)
2024-02-18 04:52:40: Response rec'd from Object Detection (YOLOv5 6.2) command 'custom' (#reqid 7e1b52dd-b880-4c35-b7c6-0f076127faab)
2024-02-18 04:52:40: Object Detection (YOLOv5 6.2):  [RuntimeError] : Traceback (most recent call last):
  File "C:\Program Files\CodeProject\AI\runtimes\bin\windows\python37\venv\lib\site-packages\yolov5\models\common.py", line 715, in forward
    max_det=self.max_det)  # NMS
  File "C:\Program Files\CodeProject\AI\runtimes\bin\windows\python37\venv\lib\site-packages\yolov5\utils\general.py", line 920, in non_max_suppression
    x = torch.cat((box, conf, j.float(), mask), 1)[conf.view(-1) > conf_thres]
RuntimeError: CUDA error: an illegal instruction was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Program Files\CodeProject\AI\modules\ObjectDetectionYOLOv5-6.2\detect.py", line 141, in do_detection
    det                  = detector(img, size=640)
  File "C:\Program Files\CodeProject\AI\runtimes\bin\windows\python37\venv\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Program Files\CodeProject\AI\runtimes\bin\windows\python37\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "C:\Program Files\CodeProject\AI\runtimes\bin\windows\python37\venv\lib\site-packages\yolov5\models\common.py", line 717, in forward
    scale_boxes(shape1, y[i][:, :4], shape0[i])
  File "C:\Program Files\CodeProject\AI\runtimes\bin\windows\python37\venv\lib\site-packages\yolov5\utils\general.py", line 162, in __exit__
    self.dt = self.time() - self.start  # delta-time
  File "C:\Program Files\CodeProject\AI\runtimes\bin\windows\python37\venv\lib\site-packages\yolov5\utils\general.py", line 167, in time
    torch.cuda.synchronize()
  File "C:\Program Files\CodeProject\AI\runtimes\bin\windows\python37\venv\lib\site-packages\torch\cuda\__init__.py", line 566, in synchronize
    return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal instruction was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
 in Object Detection (YOLOv5 6.2)

The repeating nature of this error is particularly concerning as it undermines the reliability of object detection capabilities, which are crucial for the functionality I rely on. It's disconcerting to see the system fail in such a manner, especially considering the otherwise commendable performance improvements in the AI aspects of the software.

Has anyone else encountered similar issues, particularly with CUDA errors causing system instability? Any advice on troubleshooting or resolving this would be immensely appreciated. I'm also posting this query on the CodeProject site to cast a wider net for potential solutions.

Thank you in advance for your time and assistance.

TimG · Post by **TimG** » Sun Feb 18, 2024 9:19 pm

I was going to suggest posting it over there as it does appear to be a CPAI issue. I've had no issues at all here with my GTX1650/ CUDA/ YOLO5, and we probably have the same drivers.

Sardon · Post by **Sardon** » Tue Feb 20, 2024 5:14 pm

Hi Tim,

I'm currently grappling with persistent CUDA errors and am exploring every potential remedy. My system incorporates a dual GPU setup, using the integrated GPU for display (i9-10850k) and the GTX 1650 exclusively for AI computational tasks. Yet, the issue remains unresolved, prompting a revaluation of this strategy.

Additionally, I've initiated a discussion on the Code Project forum, where Chris M has requested further details. This effort aims to widen the quest for a solution. I'm eager to learn about your setup, do you employ an iGPU alongside your GTX 1650, or is the GTX 1650 your sole GPU for both display and processing?

Your insights could shed light on the root cause of my CUDA errors and the impact of my dual-GPU configuration.

I have to say that the rationale behind my dual-setup choice was to maximise resource efficiency, offloading display functions to the iGPU to free up the GTX 1650 for AI processing. I'm beginning to wonder if this ambitious setup might be the culprit behind the errors.

Please let me know and P.S. How much memory does your 1650 have ? Chris M thinks it may be a limitation due to the card only having 4GB ram. Surely can't be the case.

Regards
SN

TimG · Post by **TimG** » Tue Feb 20, 2024 7:58 pm

Sardon wrote: ↑Tue Feb 20, 2024 5:14 pm I'm eager to learn about your setup, do you employ an iGPU alongside your GTX 1650, or is the GTX 1650 your sole GPU for both display and processing?

Your insights could shed light on the root cause of my CUDA errors and the impact of my dual-GPU configuration.

I have to say that the rationale behind my dual-setup choice was to maximise resource efficiency, offloading display functions to the iGPU to free up the GTX 1650 for AI processing. I'm beginning to wonder if this ambitious setup might be the culprit behind the errors.

Please let me know and P.S. How much memory does your 1650 have ? Chris M thinks it may be a limitation due to the card only having 4GB ram. Surely can't be the case.

I run BI5 on a headless pc, so there is no monitor connected, just a dummy HDMI plug in the GTX1650 so that I can log in to it with "Splashtop" which is like Teamviewer. The cpu is an i7-8700k and the pc has 16GB of RAM. The iGPU was used for BI5 acceleration, but I stopped using that when sub-streams came along. My GTX1650 is a 4GB RAM version. Task Manager screen looks like this:

: Screenshot 2024-02-20 195434.png (205.2 KiB) Viewed 3744 times

TimG · Post by **TimG** » Tue Feb 20, 2024 10:18 pm

I am a version or two behind with CPAI v2.5x but am up to date with BI5.

Regular back ups is the only reason I don't fall down too many rabbit holes, as I like to try things to help other forum members.

Sardon · Post by **Sardon** » Wed Feb 21, 2024 4:34 pm

Thank you for sharing that info Tim.

I'm still facing errors with my setup, which are frankly beginning to frustrate me. Out of curiosity, may I ask what size model you are utilising?

Chris M recommended opting for a smaller model size to strike a balance, in fact here is his quote

"I have a NVIDIA 1030 and it's pretty basic but still does OK. The trick is definitely to balance accuracy and model size, so I'd definitely recommend trying a smaller model size (the dashboard allows you to do this via the gear icons on each module). Also try to minimise the number of modules you have that are using the GPU.

I'm sceptical that the model size is not the root of the issue. Despite altering my configuration to not use the IGPU solely for display purposes (it's also headless as I utilise RDP), I encountered errors again this morning, indicating that something is amiss.

The plan is to now follow Chris M's advice and test with a smaller model size to see if the error persists, which prompted me to ask about your preferred model size. I've noticed it takes around 10/12 hours for this bug to occur in CPAI. So if all goes well, since I set this small model and no errors by tomorrow, it may be the model size. God knows ! LOL

Anyway , If I continue to face issues with a smaller model, it will necessitate a thorough reassessment of my approach. It's particularly disappointing as I've made significant progress with CPAI, which has been performing admirably until now, only to be hindered by this vexing software issue, rather typical, in all fairness.

At this juncture, I'm contemplating a complete uninstallation and fresh installation as a potential solution. I'll observe how this unfolds tonight. Additionally, I've noticed that CPAI's inference times are exceedingly high upon initial start-up but seem to stabilise and improve to lower millisecond durations with continued use. This pattern, starting in the thousands and gradually reducing might be normal, although it's a consistent observation.

Could you elucidate the differences between these model sizes and their respective impacts? On a positive note, CPAI successfully just detected someone at my front door, so it's still operational using a small model to some extent.

Would be good to know what the different size models contain or how they are built. I looked on CPAI and didn't find anything

Now, I shall turn my attention to the other discussion regarding daylight saving time. Hopefully I can get that sorted too!

Thanks again man for all the help
SN

TimG · Post by **TimG** » Wed Feb 21, 2024 5:56 pm

The Info button on YOLOv5 6.2 shows this, so Model size=Medium:

Module 'Object Detection (YOLOv5 6.2)' 1.9.0 (ID: ObjectDetectionYOLOv5-6.2)
Valid: True
Module Path: <root>\modules\ObjectDetectionYOLOv5-6.2
AutoStart: True
Queue: objectdetection_queue
Runtime: python3.7
Runtime Loc: Shared
FilePath: detect_adapter.py
Pre installed: False
Start pause: 1 sec
LogVerbosity:
Platforms: all,!raspberrypi,!jetson
GPU Libraries: installed if available
GPU Enabled: enabled
Parallelism: 0
Accelerator:
Half Precis.: enable
Environment Variables
APPDIR = <root>\modules\ObjectDetectionYOLOv5-6.2
CUSTOM_MODELS_DIR = <root>\modules\ObjectDetectionYOLOv5-6.2\custom-models
MODELS_DIR = <root>\modules\ObjectDetectionYOLOv5-6.2\assets
MODEL_SIZE = Medium
USE_CUDA = True
YOLOv5_AUTOINSTALL = false
YOLOv5_VERBOSE = false
Status Data: {
"successfulInferences": 1237095,
"failedInferences": 554,
"numInferences": 1237649,
"numItemsFound": 1403967,
"averageInferenceMs": 50.89386587125483,
"histogram": {
"car": 1291233,
"person": 51705,
"truck": 36987,
"bird": 11131,
"DayPlate": 7780,
"bus": 3621,
"bicycle": 308,
"horse": 11,
"dog": 973,
"cow": 28,
"deer": 49,
"cat": 41,
"motorcycle": 73,
"NightPlate": 5,
"pig": 2,
"bear": 3,
"sheep": 17
}
}
Started: 31 Jan 2024 8:35:56 AM GMT Standard Time
LastSeen: 21 Feb 2024 5:52:45 PM GMT Standard Time
Status: Started
Requests: 1237660 (includes status calls)
Provider: CUDA
CanUseGPU: True
HardwareType: GPU

Installation Log
2024-01-28 13:18:01: Installing CodeProject.AI Analysis Module
2024-01-28 13:18:01: ======================================================================
2024-01-28 13:18:01: CodeProject.AI Installer
2024-01-28 13:18:01: ======================================================================
2024-01-28 13:18:01: 143.8Gb of 237Gb available on
2024-01-28 13:18:01: General CodeProject.AI setup
2024-01-28 13:18:01: Creating Directories...Done
2024-01-28 13:18:01: GPU support
2024-01-28 13:18:02: CUDA Present...Yes (CUDA 11.8, cuDNN 8.5)
2024-01-28 13:18:02: ROCm Present...No
2024-01-28 13:18:04: Reading ObjectDetectionYOLOv5-6.2 settings.......Done
2024-01-28 13:18:04: Installing module Object Detection (YOLOv5 6.2) 1.9.0
2024-01-28 13:18:04: Installing Python 3.7
2024-01-28 13:18:04: Python 3.7 is already installed
2024-01-28 13:18:04: Creating Virtual Environment (Shared)...Virtual Environment already present
2024-01-28 13:18:04: Confirming we have Python 3.7 in our virtual environment...present
2024-01-28 13:18:20: Downloading Standard YOLO models...Expanding...Done.
2024-01-28 13:18:20: Copying contents of models-yolo5-pt.zip to assets...done
2024-01-28 13:18:37: Downloading Custom YOLO models...Expanding...Done.
2024-01-28 13:18:37: Copying contents of custom-models-yolo5-pt.zip to custom-models...done
2024-01-28 13:18:37: Installing Python packages for Object Detection (YOLOv5 6.2)
2024-01-28 13:18:37: [0;Installing GPU-enabled libraries: If available
2024-01-28 13:18:38: Ensuring Python package manager (pip) is installed...Done
2024-01-28 13:18:40: Ensuring Python package manager (pip) is up to date...Done
2024-01-28 13:18:40: Python packages specified by requirements.windows.cuda.txt
2024-01-28 13:18:41: - Installing Pandas, a data analysis / data manipulation tool...Already installed
2024-01-28 13:18:43: - Installing CoreMLTools, for working with .mlmodel format models...Already installed
2024-01-28 13:18:44: - Installing OpenCV, the Open source Computer Vision library...Already installed
2024-01-28 13:18:45: - Installing Pillow, a Python Image Library...Already installed
2024-01-28 13:18:46: - Installing SciPy, a library for mathematics, science, and engineering...Already installed
2024-01-28 13:18:47: - Installing PyYAML, a library for reading configuration files...Already installed
2024-01-28 13:18:48: - Installing PyTorch, an open source machine learning framework...Already installed
2024-01-28 13:18:49: - Installing TorchVision, for working with computer vision models...Already installed
2024-01-28 13:19:39: - Installing Ultralytics YoloV5 package for object detection in images...(checked) Done
2024-01-28 13:19:40: - Installing Seaborn, a data visualization library based on matplotlib...Already installed
2024-01-28 13:19:40: Installing Python packages for the CodeProject.AI Server SDK
2024-01-28 13:19:42: Ensuring Python package manager (pip) is installed...Done
2024-01-28 13:19:44: Ensuring Python package manager (pip) is up to date...Done
2024-01-28 13:19:44: Python packages specified by requirements.txt
2024-01-28 13:19:45: - Installing Pillow, a Python Image Library...Already installed
2024-01-28 13:19:47: - Installing Charset normalizer...Already installed
2024-01-28 13:19:48: - Installing aiohttp, the Async IO HTTP library...Already installed
2024-01-28 13:19:49: - Installing aiofiles, the Async IO Files library...Already installed
2024-01-28 13:19:50: - Installing py-cpuinfo to allow us to query CPU info...Already installed
2024-01-28 13:19:52: - Installing Requests, the HTTP library...Already installed
2024-01-28 13:19:57: Fusing layers...
2024-01-28 13:19:57: YOLOv5.1m summary: 391 layers, 21805053 parameters, 0 gradients
2024-01-28 13:19:57: Adding AutoShape...
2024-01-28 13:19:58: Self test: Self-test passed
2024-01-28 13:19:58: Module setup time 00:01:56.48
2024-01-28 13:19:58: Setup complete
2024-01-28 13:19:58: Total setup time 00:01:57.35
Installer exited with code 0

TimG · Post by **TimG** » Wed Feb 21, 2024 6:06 pm

I have CPAI working on 5 2MP cameras and a 5MP camera. Clearly it's on the sub-streams ( BI5 AI settings/ "Use main stream if available" is un-ticked). Three of the 2MP cameras and the 5MP can trigger at similar times as they are all on the front of the house (Left High ptz, Reolink doorbell, Drive camera, Right High ptz) and I have had no visible problems with CPAI or CUDA.

Sardon · Post by **Sardon** » Thu Feb 22, 2024 9:05 am

Your system has been running smooth for nearly a month.

Started: 31 Jan 2024 8:35:56 AM GMT Standard Time
LastSeen: 21 Feb 2024 5:52:45 PM GMT Standard Time

My system is now showing a pattern of when it crashes. It's crashing after approximately 10-12 hours of continual use. I knew the small model weren't the issue nor the memory size of the GPU as indicated by Chris M and this troubles me.

For Chris M to put it down to a model size is not the kind of support response I'd expect especially coming from someone who is the co-founder of the project.

I'm going to write another response with my pattern findings and advise despite setting the model to small it's still bombing out with the same errors.

The only work-around I can think of at present is for me to restart the service twice per day which is pants in all fairness, it's just not sustainable having to do so but there is nothing else I can do at this stage.

I'm well gutted to be honest... oh well, maybe I'll have to go back to to original setup NON CPAI mode with false alerts LOL

Sardon · Post by **Sardon** » Thu Feb 22, 2024 10:31 am

Given the ongoing issues, I've decided to experiment with the .NET version of YOLO, which uses DirectML. I'm hoping this change might alleviate the crashes that have occur ever 10/12 hours with the CUDA variant.

It is baffling though, that considering my setup mirrors Tim's, identical GPU, identical drivers etc, yet, mine buckles every 10-12 hours. Anyway I've fired off a response over on the CPAI forum, we'll see...

Switching to the .NET variant is primarily for testing purposes. I'm a bit apprehensive about potential impacts on inference times, but we'll see. Here's to hoping it can make it through a day without faltering. Fingers crossed.

Catch you later with updates!

CUDA Errors with YOLOv5 Object Detection on CPAI 2.5.x – Seeking Insights

CUDA Errors with YOLOv5 Object Detection on CPAI 2.5.x – Seeking Insights

Re: CUDA Errors with YOLOv5 Object Detection on CPAI 2.5.x – Seeking Insights

Re: CUDA Errors with YOLOv5 Object Detection on CPAI 2.5.x – Seeking Insights

Re: CUDA Errors with YOLOv5 Object Detection on CPAI 2.5.x – Seeking Insights

Re: CUDA Errors with YOLOv5 Object Detection on CPAI 2.5.x – Seeking Insights

Re: CUDA Errors with YOLOv5 Object Detection on CPAI 2.5.x – Seeking Insights

Re: CUDA Errors with YOLOv5 Object Detection on CPAI 2.5.x – Seeking Insights

Re: CUDA Errors with YOLOv5 Object Detection on CPAI 2.5.x – Seeking Insights

Re: CUDA Errors with YOLOv5 Object Detection on CPAI 2.5.x – Seeking Insights

Re: CUDA Errors with YOLOv5 Object Detection on CPAI 2.5.x – Seeking Insights