CUDA C Programming Guide
CUDA_C_Programming_Guide
User Manual: Pdf
Open the PDF directly: View PDF
Page Count: 301 [warning: Documents this large are best viewed by clicking the View PDF Link!]
- Table of Contents
- List of Figures
- List of Tables
- Introduction
- Programming Model
- Programming Interface
- 3.1. Compilation with NVCC
- 3.2. CUDA C Runtime
- 3.2.1. Initialization
- 3.2.2. Device Memory
- 3.2.3. Shared Memory
- 3.2.4. Page-Locked Host Memory
- 3.2.5. Asynchronous Concurrent Execution
- 3.2.6. Multi-Device System
- 3.2.7. Unified Virtual Address Space
- 3.2.8. Interprocess Communication
- 3.2.9. Error Checking
- 3.2.10. Call Stack
- 3.2.11. Texture and Surface Memory
- 3.2.12. Graphics Interoperability
- 3.3. Versioning and Compatibility
- 3.4. Compute Modes
- 3.5. Mode Switches
- 3.6. Tesla Compute Cluster Mode for Windows
- Hardware Implementation
- Performance Guidelines
- CUDA-Enabled GPUs
- C Language Extensions
- B.1. Function Execution Space Specifiers
- B.2. Variable Memory Space Specifiers
- B.3. Built-in Vector Types
- B.4. Built-in Variables
- B.5. Memory Fence Functions
- B.6. Synchronization Functions
- B.7. Mathematical Functions
- B.8. Texture Functions
- B.8.1. Texture Object API
- B.8.1.1. tex1Dfetch()
- B.8.1.2. tex1D()
- B.8.1.3. tex1DLod()
- B.8.1.4. tex1DGrad()
- B.8.1.5. tex2D()
- B.8.1.6. tex2DLod()
- B.8.1.7. tex2DGrad()
- B.8.1.8. tex3D()
- B.8.1.9. tex3DLod()
- B.8.1.10. tex3DGrad()
- B.8.1.11. tex1DLayered()
- B.8.1.12. tex1DLayeredLod()
- B.8.1.13. tex1DLayeredGrad()
- B.8.1.14. tex2DLayered()
- B.8.1.15. tex2DLayeredLod()
- B.8.1.16. tex2DLayeredGrad()
- B.8.1.17. texCubemap()
- B.8.1.18. texCubemapLod()
- B.8.1.19. texCubemapLayered()
- B.8.1.20. texCubemapLayeredLod()
- B.8.1.21. tex2Dgather()
- B.8.2. Texture Reference API
- B.8.2.1. tex1Dfetch()
- B.8.2.2. tex1D()
- B.8.2.3. tex1DLod()
- B.8.2.4. tex1DGrad()
- B.8.2.5. tex2D()
- B.8.2.6. tex2DLod()
- B.8.2.7. tex2DGrad()
- B.8.2.8. tex3D()
- B.8.2.9. tex3DLod()
- B.8.2.10. tex3DGrad()
- B.8.2.11. tex1DLayered()
- B.8.2.12. tex1DLayeredLod()
- B.8.2.13. tex1DLayeredGrad()
- B.8.2.14. tex2DLayered()
- B.8.2.15. tex2DLayeredLod()
- B.8.2.16. tex2DLayeredGrad()
- B.8.2.17. texCubemap()
- B.8.2.18. texCubemapLod()
- B.8.2.19. texCubemapLayered()
- B.8.2.20. texCubemapLayeredLod()
- B.8.2.21. tex2Dgather()
- B.8.1. Texture Object API
- B.9. Surface Functions
- B.9.1. Surface Object API
- B.9.1.1. surf1Dread()
- B.9.1.2. surf1Dwrite
- B.9.1.3. surf2Dread()
- B.9.1.4. surf2Dwrite()
- B.9.1.5. surf3Dread()
- B.9.1.6. surf3Dwrite()
- B.9.1.7. surf1DLayeredread()
- B.9.1.8. surf1DLayeredwrite()
- B.9.1.9. surf2DLayeredread()
- B.9.1.10. surf2DLayeredwrite()
- B.9.1.11. surfCubemapread()
- B.9.1.12. surfCubemapwrite()
- B.9.1.13. surfCubemapLayeredread()
- B.9.1.14. surfCubemapLayeredwrite()
- B.9.2. Surface Reference API
- B.9.2.1. surf1Dread()
- B.9.2.2. surf1Dwrite
- B.9.2.3. surf2Dread()
- B.9.2.4. surf2Dwrite()
- B.9.2.5. surf3Dread()
- B.9.2.6. surf3Dwrite()
- B.9.2.7. surf1DLayeredread()
- B.9.2.8. surf1DLayeredwrite()
- B.9.2.9. surf2DLayeredread()
- B.9.2.10. surf2DLayeredwrite()
- B.9.2.11. surfCubemapread()
- B.9.2.12. surfCubemapwrite()
- B.9.2.13. surfCubemapLayeredread()
- B.9.2.14. surfCubemapLayeredwrite()
- B.9.1. Surface Object API
- B.10. Read-Only Data Cache Load Function
- B.11. Time Function
- B.12. Atomic Functions
- B.13. Warp Vote Functions
- B.14. Warp Match Functions
- B.15. Warp Shuffle Functions
- B.16. Warp matrix functions [PREVIEW FEATURE]
- B.17. Profiler Counter Function
- B.18. Assertion
- B.19. Formatted Output
- B.20. Dynamic Global Memory Allocation and Operations
- B.21. Execution Configuration
- B.22. Launch Bounds
- B.23. #pragma unroll
- B.24. SIMD Video Instructions
- Cooperative Groups
- CUDA Dynamic Parallelism
- D.1. Introduction
- D.2. Execution Environment and Memory Model
- D.3. Programming Interface
- D.3.1. CUDA C/C++ Reference
- D.3.2. Device-side Launch from PTX
- D.3.3. Toolkit Support for Dynamic Parallelism
- D.4. Programming Guidelines
- Mathematical Functions
- C/C++ Language Support
- F.1. C++11 Language Features
- F.2. C++14 Language Features
- F.3. Restrictions
- F.3.1. Host Compiler Extensions
- F.3.2. Preprocessor Symbols
- F.3.3. Qualifiers
- F.3.4. Pointers
- F.3.5. Operators
- F.3.6. Run Time Type Information (RTTI)
- F.3.7. Exception Handling
- F.3.8. Standard Library
- F.3.9. Functions
- F.3.10. Classes
- F.3.11. Templates
- F.3.12. Trigraphs and Digraphs
- F.3.13. Const-qualified variables
- F.3.14. Deprecation Annotation
- F.3.15. C++11 Features
- F.3.15.1. Lambda Expressions
- F.3.15.2. std::initializer_list
- F.3.15.3. Rvalue references
- F.3.15.4. Constexpr functions and function templates
- F.3.15.5. Constexpr variables
- F.3.15.6. Inline namespaces
- F.3.15.7. thread_local
- F.3.15.8. __global__ functions and function templates
- F.3.15.9. __device__/__constant__/__shared__ variables
- F.3.15.10. Defaulted functions
- F.3.16. C++14 Features
- F.4. Polymorphic Function Wrappers
- F.5. Experimental Feature: Extended Lambdas
- F.6. Code Samples
- Texture Fetching
- Compute Capabilities
- Driver API
- CUDA Environment Variables
- Unified Memory Programming
- K.1. Unified Memory Introduction
- K.2. Programming Model
- K.2.1. Managed Memory Opt In
- K.2.2. Coherency and Concurrency
- K.2.2.1. GPU Exclusive Access To Managed Memory
- K.2.2.2. Explicit Synchronization and Logical GPU Activity
- K.2.2.3. Managing Data Visibility and Concurrent CPU + GPU Access with Streams
- K.2.2.4. Stream Association Examples
- K.2.2.5. Stream Attach With Multithreaded Host Programs
- K.2.2.6. Advanced Topic: Modular Programs and Data Access Constraints
- K.2.2.7. Memcpy()/Memset() Behavior With Managed Memory
- K.2.3. Language Integration
- K.2.4. Querying Unified Memory Support
- K.2.5. Advanced Topics
- K.3. Performance Tuning