6. Limitations
The following are known issues with the current release.
- A security vulnerability issue required profiling tools to disable all the features
for non-root or non-admin users. As a result, CUPTI cannot profile the application when
using a Windows 419.17 or Linux 418.43 or later driver. More details about the issue
and the solutions can be found on this
web page.
Note: Starting with CUDA 10.2, CUPTI allows tracing features for non-root and non-admin users on desktop platforms. But events and metrics profiling is still restricted for non-root and non-admin users.
- Profiling results might be inconsistent when auto boost is enabled. Profiler tries to disable auto boost by default. But it might fail to do so in some conditions and profiling will continue and results will be inconsistent. API cuptiGetAutoBoostState() can be used to query the auto boost state of the device. This API returns error CUPTI_ERROR_NOT_SUPPORTED on devices that don't support auto boost. Note that auto boost is supported only on certain Tesla devices with compute capability 3.0 and higher.
- CUPTI doesn't populate the activity structures which are deprecated, instead the newer version of the activity structure is filled with the information.
- Because of the low resolution of the timer on Windows, the start and end timestamps can be same for activities having short execution duration on Windows.
- The application which calls CUPTI APIs cannot be used with Nvidia tools like nvprof, Nvidia Visual Profiler, Nsight Compute, Nsight Systems, Nvidia Nsight Visual Studio Edition, cuda-gdb and cuda-memcheck.
- PCIE and NVLINK records are not captured when CUPTI is initialized lazily after the CUDA initialization.
- CUPTI fails to profile the OpenACC application when the OpenACC library linked with the application has missing definition of the OpenACC API routine/s. This is indicated by the error code CUPTI_ERROR_OPENACC_UNDEFINED_ROUTINE.
- OpenACC profiling might fail when OpenACC library is linked statically in the user application. This happens due to the missing definition of the OpenACC API routines needed for the OpenACC profiling, as compiler might ignore definitions for the functions not used in the application. This issue can be mitigated by linking the OpenACC library dynamically.
- Unified memory profiling is not supported on the ARM architecture.
- Profiling a C++ application which overloads the new operator at the global scope and uses any CUDA APIs like cudaMalloc() or cudaMallocManaged() inside the overloaded new operator will result in a hang.
- Devices with compute capability 6.0 and higher introduce a new feature, compute
preemption, to give fair chance for all compute contexts while running long tasks.
With compute preemption feature-
Compute preemption can affect events and metrics collection. The following are known issues with the current release:
To avoid compute preemption affecting profiler results try to isolate the context being profiled:
- Devices with compute capability 6.0 and higher support demand paging. When the kernel is scheduled for the first time, all the pages allocated using cudaMallocManaged and that are required for execution of the kernel are fetched in the global memory when GPU faults are generated. Profiler requires multiple passes to collect all the metrics required for kernel analysis. The kernel state needs to be saved and restored for each kernel replay pass. For devices with compute capability 6.0 and higher and platforms supporting Unified memory, in the first kernel iteration the GPU faults will be generated and all pages will be fetched in the global memory. Second iteration onwards GPU page faults will not occur. This will significantly affect the memory related events and timing. The time taken from trace will include the time required to fetch the pages but most of the metrics profiled in multiple iterations will not include time/cycles required to fetch the pages. This causes inconsistency in the profiler results.
- When profiling an application that uses CUDA Dynamic Parallelism (CDP) there are
several limitations to the profiling tools.
- Compilation of samples autorange_profiling and userrange_profiling requires a host compiler which supports C++11 features. For some g++ compilers, it is required to use the flag -std=c++11 to turn on C++11 features.
- PC Sampling is not supported on Tegra platforms.
- As of CUDA 11.4 and R470 TRD1 driver release, CUPTI is supported in a vGPU environment which requires a vGPU license. If the license is not obtained after 20 minutes, the reported performance data including metrics from the GPU will be inaccurate. This is because of a feature in vGPU environment which reduces performance but retains functionality as specified here.
- CUPTI is not supported on NVIDIA Crypto Mining Processors (CMP). This is reported using the error code CUPTI_ERROR_CMP_DEVICE_NOT_SUPPORTED. For more information, please visit the web page.
Profiling
Event and Metric API
Profiling and Perfworks Metric API