This documentation has been prepared to assist the implementation of the Axolotl training and fine tuning platform
Environment Creation
The first step in the process is ensuring our machine is optimised to use the NVIDIA platform and GPUs
Base Distribution: Ubuntu 20.04
We are going to use Ubuntu 20.04 as the base image because:
-This version is known for its long-term support (LTS)
-Ubuntu is well-supported by NVIDIA
-Works well as remote instance so we can access high powered GPUs
-These remote environments allows all team members to access our platform
Integrated Development Environment (IDE)
We recommend using Visual Studio Code (VS Code) for our Integrated Development Environment
Support for Remote Development: VS Code allows remote development support, crucial for accessing and managing virtual machines with powerful GPUs
Integrated Terminal and Docker Support: The integrated terminal in VS Code enables direct interaction with command-line tools, essential for managing Docker containers and executing model training scripts.
Extensive Language Support: Large language model development often involves multiple programming languages (like Python, C++). VS Code supports a wide range of languages and their specific tooling, which is critical for such multifaceted development.
Version Control Integration: With built-in Git support, VS Code makes it easier to track and manage changes in code
Virtual Machine Requirements:
-Docker*
-CUDA (Version 12.1)*: parallel computing platform and programming model
-NVIDIA NGC: NVIDIA Container toolkit for access to NVIDIA Docker Container
-NVIDIA CUDA Toolkit*: compiler for CUDA, translates CUDA code into executable programs
-GCC: the compiler required for development using the CUDA Toolkit
-GLIBC: the GNU Project's implementation of the C standard library. Includes facilities for basic file I/O, string manipulation, mathematical functions, and various other standard utilities.
*Please note, Continuum's base virtual machine installation script installs Docker, the NVIDIA Container Toolkit and CUDA Driver 12.1 as well as the NVIDIA Container Toolkit
Check the virtual machine is ready
To ensure that the virtual machine is set up for the training and fine tuning of large language models, follow the instructions below:
Check the installation of the NVIDIA CUDA Toolkit
What is the CUDA Toolkit?
The NVIDIA CUDA Toolkit provides a development environment for creating high performance GPU-accelerated applications.
With the CUDA Toolkit, you can develop, optimize, and deploy your applications on GPU-accelerated systems.
The toolkit includes GPU-accelerated libraries, debugging and optimization tools, a C/C++ compiler, and a runtime library to deploy your application.
We will be installing the NVIDA CUDA Toolkit, version 12.1
First, check to see if the CUDA Toolkit is installed, we can check to see whether the core compiler is installed, the NVIDIA CUDA Compiler (NVCC)
Nvidia CUDA Compiler (NVCC) is a part of the CUDA Toolkit. It is the compiler for CUDA, responsible for translating CUDA code into executable programs.
NVCC takes high-level CUDA code and turns it into a form that can be understood and executed by the GPU. It handles the partitioning of code into segments that can be run on either the CPU or GPU, and manages the compilation of the GPU parts of the code.
First, to check if NVCC is installed and its version, run
nvcc--version
If the NVIDIA CUDA Toolkit has been installed, this will be the output:
nvcc:NVIDIA (R) Cuda compiler driverCopyright (c) 2005-2022 NVIDIA CorporationBuiltonWed_Sep_21_10:33:58_PDT_2022Cudacompilationtools,release12.1,V11.8.8912.1.r11.8/compiler.31833905_0
You should see release 12.1 - which indicates the CUDA Toolkit Version 12.1 has been successfully installed.
If NVCC is notinstalled, then go ahead an install the CUDA Toolkit 12.1.
The CUDA Toolkit download website is located here:
The web application at this site will ask you to define your installation set up.
For our base virtual machine this will be:
Variable
Parameter
Operation System
Linux
Architecture
x86_64 (64 bit)
Distribution
Ubuntu
Version
20.04
Installer Type
deb (local)
The detail instructions and explanation of the process is below:
Reference: Installation of CUDA Toolkit 12.1
Using the local installation approach
deb (local)
This installer is a Debian package file that contains all the necessary CUDA files.
It's a large file because it includes everything needed for the installation.
Step 1: Download the CUDA Repository Pin
Open your terminal and execute the following command. This will download a pin file for the CUDA repository:
To verify the installation, you can check the CUDA version:
nvcc--version
Post installation, to check if NVCC has been installed successfully and check its version, run
nvcc--version
If the NVIDA CUDA Toolkit has been installed, this will be the output:
nvcc:NVIDIA (R) Cuda compiler driverCopyright (c) 2005-2022 NVIDIA CorporationBuiltonWed_Sep_21_10:33:58_PDT_2022Cudacompilationtools,release12.1,V11.8.89.r11.8/compiler.31833905_0
If NVCC is pointing to an older version of CUDA despite upgrading to 12.1 you will need to follow the instructions below. CUDA must be on PATH!
Make sure CUDA 12.1 is on PATH
If you've followed the CUDA installation instructions and still find that an old version of NVCC (NVIDIA CUDA Compiler) is installed, there are several steps you can take to troubleshoot and resolve the issue:
Verify the Installation:
Ensure that CUDA was installed correctly without errors. You can check the installation logs for any errors or warnings that might have occurred during the installation process.
Use the command to list all CUDA-related packages installed and their versions. Enter into the terminal:
dpkg-l|grepcuda
This will provide a list of all CUDA related packages:
Below are some example files that this command highlights, for example:
cuda, cuda-12-1: These are meta-packages for CUDA version 12.1. Installing a meta-package installs all the components of CUDA.
cuda-cccl-12-1: CUDA CCCL (CUDA C++ Core Library) is part of the CUDA Toolkit and provides essential libraries for CUDA C++ development.
cuda-command-line-tools-12-1: This package includes command-line tools for CUDA, such as nvcc (NVIDIA CUDA Compiler), which is crucial for compiling CUDA code.
cuda-compiler-12-1: This package includes the CUDA compiler, which is essential for converting CUDA code into code that can run on Nvidia GPUs.
cuda-demo-suite-12-1: This package contains demos showcasing the capabilities and features of CUDA.
cuda-documentation-12-1: Provides the documentation for CUDA, useful for developers to understand and use CUDA APIs.
cuda-driver-dev-12-1: Includes development resources for the CUDA driver, such as headers and stub libraries.
cuda-libraries-12-1, cuda-libraries-dev-12-1: These meta-packages include libraries necessary for CUDA development and their development counterparts.
cuda-nvcc-12-1: NVIDIA CUDA Compiler (NVCC) is a tool for compiling CUDA code.
cuda-repo-ubuntu2004-12-1-local: Contains repository configuration files for the CUDA toolkit
Check Environment Variables:
Ensure that your environment variables are pointing to the new CUDA installation. Specifically, check the PATH and LD_LIBRARY_PATH environment variables via the following commands:
echo $PATH
echo $LD_LIBRARY_PATH
Update these variables if they are pointing to an older CUDA version.
This command adds /usr/local/cuda-12.1/bin to the PATH environment variable.
PATH is a list of directories the shell searches for executable files. Adding CUDA's bin directory makes it possible to run CUDA tools and compilers directly from the command line without specifying their full path.
${PATH:+:${PATH}} is a shell parameter expansion pattern that appends the existing PATH variable to the new path. If PATH is unset, nothing is appended.
LD_LIBRARY_PATH is an environment variable specifying directories where libraries are searched for first, before the standard set of directories.
This command adds the lib64 directory of the CUDA installation to LD_LIBRARY_PATH, which is necessary for 64-bit systems. It ensures that the system can find and use the CUDA libraries.
Verify NVCC Version:
After updating the environment variables, check the NVCC version again using:
nvcc --version
This should reflect the new version if the environment variables are set correctly.
Update Alternatives:
Sometimes, multiple versions of CUDA can coexist, and the system may still use the old version. Use update-alternatives to configure the default CUDA version.
Run sudo update-alternatives --config nvcc to choose the correct version of NVCC.
sudoupdate-alternatives--config
Reference: What is included in the CUDA Toolkit?
CUDA contains components and tools designed to facilitate development, optimisation, and deployment of GPU-accelerated applications.
These components cater to various aspects of GPU computing, from low-level programming to high-level library support.
Here's an overview of the contents CUDA 12.1 toolkit:
NVCC (NVIDIA CUDA Compiler): NVCC is the primary compiler that converts CUDA code into binary executable form.
CUDA Libraries: CUDA 12.1 includes a range of GPU-accelerated libraries for different types of computing and data processing tasks. These include:
cuBLAS: A GPU-accelerated implementation of the Basic Linear Algebra Subprograms (BLAS) library.
cuFFT: A library for performing Fast Fourier Transforms (FFT) on the GPU.
cuRAND: A library for generating random numbers on the GPU.
cuDNN: A GPU-accelerated library for deep neural networks.
NVIDIA NPP (NVIDIA Performance Primitives): A collection of GPU-accelerated image, video, and signal processing functions.
Thrust: A C++ parallel programming library resembling the C++ Standard Library.
CUDA Runtime and Driver APIs: These APIs allow applications to manage devices, memory, and program executions on GPUs. The runtime API is designed for high-level programming, while the driver API provides lower-level control.
CUDA Debugger (cuda-gdb): A tool to debug CUDA applications running on the GPU.
CUDA Profiler (nvprof and Nsight Systems): Tools for profiling the performance of CUDA applications. These profilers help in identifying performance bottlenecks.
Nsight Eclipse Edition: An integrated development environment (IDE) for developing and debugging CUDA applications on Linux and macOS.
CUDA Samples and Documentation: A set of sample code and comprehensive documentation to help developers understand and use various features of the CUDA toolkit.
CUDA Toolkit SDK: Contains additional libraries, code samples, and resources for software development.
Support for Various GPU Architectures: Each CUDA version supports specific Nvidia GPU architectures. CUDA 12.1 would include support for the latest architectures available at its release.
Multi-Language Support: While CUDA is primarily used with C and C++, it also supports other languages like Python (via libraries like PyCUDA).
GPU-accelerated Machine Learning and AI Libraries: Integration with libraries and frameworks for AI and machine learning, such as TensorFlow and PyTorch, which can leverage CUDA for GPU acceleration.
CUDA Reference Materials*
Please find the NVIDIA CUDA Installation Guide for Linux
NVIDIA CUDA Installation Guide for Linux
The NVIDIA CUDA Installation Guide for Linux outlines the process for installing the CUDA Toolkit on a Linux system.
System Requirements:
Lists the prerequisites for using NVIDIA CUDA, including a CUDA-capable GPU, a supported version of Linux with gcc compiler and toolchain, and the CUDA Toolkit.
Provides information about compatible Linux distributions and kernel versions, including links to resources for specific Linux distributions like RHEL, SLES, Ubuntu LTS, and L4T.
OS Support Policy:
Details the support policy for various Linux distributions like Ubuntu, RHEL, CentOS, Rocky Linux, SUSE SLES, OpenSUSE Leap, Debian, Fedora, and KylinOS, including their end-of-support life (EOSS) timelines.
Host Compiler Support Policy:
Describes the supported compilers for different architectures (x86_64, Arm64 sbsa, POWER 9), including GCC, Clang, NVHPC, XLC, ArmC/C++, and ICC.
Emphasizes the importance of using a compatible host compiler for compiling CPU host code in CUDA sources.
Pre-installation Actions:
Outlines steps to prepare for CUDA Toolkit and Driver installation, including verifying the presence of a CUDA-capable GPU, a supported version of Linux, gcc installation, correct kernel headers and development packages, and considering conflicting installation methods.
Notes the possibility of overriding install-time prerequisite checks using the -override flag.
Verification Steps:
Provides commands to verify the presence of a CUDA-capable GPU
(lspci | grep -i nvidia),
he supported version of Linux (uname -m && cat /etc/*release), and the gcc installation (gcc --version).
Explains how to check for the correct kernel headers and development packages.
Ubuntu-Specific Installation Instructions:
Details steps to prepare Ubuntu for CUDA installation, including installing kernel headers and removing outdated signing keys.
Describes two installation methods for Ubuntu: local repo and network repo, including steps for each method.
For both methods, provides common instructions to update the Apt repository cache and install the CUDA SDK, and includes a reminder to reboot the system after installation.
If you are interested, you can familiarise yourself with CUDA Best Practices:
Below is a summary of CUDA Programming
CUDA Tutorial - CUDA Programming is difficult!
CUDA's Role in Machine Learning
CUDA, NVIDIA's parallel computing platform, is particularly important in machine learning because many machine learning tasks involve linear algebra, like matrix multiplications and vector operations. CUDA is optimised for these kinds of operations, offering significant performance improvements over traditional CPU processing.
Installation and Setup Requirements
To use CUDA, you need to install the CUDA toolkit and have an NVIDIA GPU, as CUDA does not work with other types of GPUs like AMD. The setup process varies depending on your operating system (Linux, Windows, or potentially Mac).
Prerequisite Knowledge
a basic understanding of C or C++ programming is necessary to work with CUDA. Concepts like memory allocation (malloc) and freeing memory are used without detailed explanations. If you're only familiar with Python, you might find following the CUDA examples challenging.
CUDA Programming Basics
CUDA code is similar to C/C++ but includes additional functions and data types for parallel computing. Understanding the transition from C to CUDA is critical for grasping how parallelisation is implemented in CUDA.
Understand CUDA's Grid and Block Model
The explanation of CUDA's grid and block model is crucial. It's important to understand how to configure blocks and threads within a grid to effectively parallelize tasks. When defining grid and block dimensions, remember that they directly influence the number of threads that will execute your code and how they are organized. Incorrect configurations can lead to inefficient use of GPU resources or even cause your program to behave unexpectedly.
Memory Management in CUDA
In CUDA, memory allocation and management are crucial, especially since you're dealing with both host (CPU) and device (GPU) memory. Incorrect memory handling can lead to crashes or incorrect program results.
Mapping Problem Domain to CUDA's Architecture
The idea of mapping a matrix's shape to CUDA's grid shape illustrates a key concept in CUDA programming: effectively mapping your problem domain to CUDA's architecture. This can be challenging, as it requires a good understanding of both your application's requirements and CUDA's parallel execution model.
Performance Considerations
One of the objectives of using CUDA is to enhance performance. It's important to note that not all problems will see a dramatic performance increase with CUDA, and sometimes the overhead of managing GPU resources can outweigh the benefits for smaller or less complex tasks.
Initialisation and Memory Representation
It's important to understand how multi-dimensional data structures like matrices are represented in memory, especially since CUDA typically deals with flattened arrays. This understanding is crucial for correctly indexing elements during calculations.
Understanding the Mapping of Computation to CUDA Threads
Each thread in CUDA is assigned a specific part of the computation, like a segment of a matrix-vector multiplication. This mapping is critical for efficient parallel computation. It's important to correctly calculate the indices each thread will work on, taking into account both block and thread indices.
Boundary Conditions in CUDA Computations
When working with CUDA, it’s important to handle boundary conditions carefully. If the dimensions of your data do not exactly match the grid and block dimensions you've configured, you must ensure that your code correctly handles these edge cases to avoid out-of-bounds memory access, which can lead to incorrect results or crashes.
Host-Device Memory Management
The tutorial highlights the complexity of managing memory between the host (CPU) and the device (GPU). Data must be explicitly allocated and transferred between the host and device memories. This process adds an extra layer of complexity to CUDA programming and requires careful handling to ensure data integrity and to avoid memory leaks.
Grid and Block Dimension Calculations
Calculating the dimensions of grids and blocks is a critical aspect of CUDA programming. The dimensions influence how many threads are launched and how they are organised. Misconfigurations here can lead to inefficient use of GPU resources or even failure to execute the program correctly.
CUDA - Version Control
Managing CUDA versions and resolving conflicts can be critical.
Understanding CUDA and its Interaction with Libraries
Compatibility:
CUDA and GPU Compatibility: Each GPU model supports certain CUDA versions. Ensure your GPU is compatible with the CUDA version you plan to use.
Library Dependencies: Machine learning libraries like TensorFlow, PyTorch, etc., are built against specific CUDA and cuDNN versions. Using incompatible versions can lead to errors.
CUDA Toolkit vs. CUDA Runtime:
CUDA Toolkit: Includes the CUDA runtime and development environment (compiler, debugger, etc.). Needed for compiling CUDA-enabled applications.
CUDA Runtime: Required to run CUDA-enabled applications. Often, libraries like PyTorch bundle the necessary CUDA runtime components, so you might not need a system-wide CUDA installation.
Continuum - managing CUDA Versions and Conflicts
Isolate Environments:
Use separate Python environments using Anaconda for different projects.This isolation allows you to install different versions of libraries (and their corresponding CUDA versions) without conflict.
Example: Have one Conda environment for TensorFlow with CUDA 11.0 and another for PyTorch with CUDA 10.2.
Containerization:
Use Docker or similar containerisation tools. They encapsulate the entire runtime environment, including the specific versions of CUDA, cuDNN, and other dependencies.
Nvidia provides Docker images (NGC Containers) with TensorFlow, PyTorch, etc., pre-installed with the required CUDA and cuDNN versions.
Local CUDA Installation:
If you need to compile CUDA code, install the CUDA Toolkit locally.
Be mindful of the CUDA version if you have multiple projects requiring different versions.
Check Compatibility Before Installation:
Before installing a deep learning library, check its compatibility with your CUDA version.
Use the installation commands tailored to specific CUDA versions (as seen with PyTorch
Check the installation of the NVIDIA Container Toolkit
The NVIDIA Container Toolkit enables users to build and run GPU-accelerated containers.
The toolkit includes a container runtime library and utilities to automatically configure containers to leverage NVIDIA GPUs.
This allows you to use NVIDIA Container in Docker.
Background and Explanation
The NVIDIA Container Toolkit is designed to integrate NVIDIA GPUs into containerised applications. It's compatible with various container runtimes and consists of several components:
NVIDIA Container Runtime (nvidia-container-runtime): An OCI-compliant runtime for Docker or containerd, enabling the use of NVIDIA GPUs in containers.
NVIDIA Container Runtime Hook (nvidia-container-toolkit / nvidia-container-runtime-hook): A component executing prestart scripts to configure GPU access in containers.
NVIDIA Container Library and CLI (libnvidia-container1, nvidia-container-cli):These provide a library and CLI for automatically configuring containers with NVIDIA GPU support, independent of the container runtime.
The toolkit's architecture allows for integration with various container runtimes like Docker, containerd, cri-o, and lxc. Notably, the NVIDIA Container Runtime is not required for cri-o and lxc.
The toolkit comprises main packages: nvidia-container-toolkit, nvidia-container-toolkit-base, libnvidia-container-tools, and libnvidia-container1, with specific dependencies between them.
Older packages like nvidia-docker2 and nvidia-container-runtime are now deprecated and merged into the nvidia-container-toolkit.
The Architecture
Key functionalities
NVIDIA Container Runtime Hook:Implements a runC prestart hook, configuring GPU devices in containers based on the container's config.json.
NVIDIA Container Runtime: A wrapper around runC, modifying the OCI runtime spec for GPU support.
NVIDIA Container Toolkit CLI:Offers utilities for configuring runtimes and generating Container Device Interface (CDI) specifications.
For installation, the nvidia-container-toolkit package is generally sufficient.
The toolkit's packages are available on GitHub, useful for both online and air-gapped installations. The repository also hosts experimental releases of the software.
Check for Docker
Before installing ensure you have Docker installed
docker--version
The output should be as below:
Dockerversion26.0.2,build3c863ff
To install the NVIDIA Container Toolkit, follow the instructions below:
---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:00:05.0 Off | 0 |
| N/A 35C P0 55W / 400W| 5MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
Compatibility Testing
The CUDA development environmentrelies on tight integration with the host development environment, including the host compiler and C runtime libraries, and is therefore only supported on Ubuntu versions that have been qualified for the CUDA Toolkit release.
Now that we have installed the NVIDIA CUDA Toolkit and the NVIDIA Container Toolkit, we need to ensure our virtual machine is compatible with these installations.
Compatibility is critical
The material below provides instructions on how to ensure the NVIDIA Drivers are compatible with the host system.
With the NVIDIA CUDA Toolkit installed, we need to ensure the host machine is compatible with this Toolkit.
Compatibility between CUDA 12.1 and the host development environment
This table lists the kernel versions, default GCC (GNU Compiler Collection)versions, andGLIBC (GNU C Library) versions for two different LTS (Long-Term Support) releases of Ubuntu.
Distribution
Kernel
Default GCC
GLIBC
Ubuntu 22.04 LTS
5.15.0-43
11.2.0
2.35
Ubuntu 20.04 LTS
5.13.0-46
9.3.0
2.31
Check the Kernel compatibility
To check the kernel versionof your Ubuntu 20.04 system, you can use the uname command in the terminal. The uname command with different options provides various system information, including the kernel version. Here's how you can do it:
Run the uname command to get the kernel version by typing the following command and press Enter:
uname-r
The output should be this on a typical Ubuntu WSL2 distribution:
5.15.133.1-microsoft-standard-WSL2
or this on a typical Ubuntu 20.04 virtual machine
5.4.0-167-generic
As you can see here the first Linux kernel is 5.15.133.1 - which is compatible with the CUDA Toolkit installed (range of 5.13.0 to 5.13.46).
The second Linux kernel is also compatible at 5.4 (range of 5.13 to 5.46)
What is a kernel?
A kernel is the core component of an operating system (OS).
It acts as abridge between applications and the actual data processing done at the hardware level.
The kernel's responsibilities include managing the system's resources and allowing multiple programs to run and use these resources efficiently. Here are some key aspects of a kernel:
Resource Management
The kernel manages hardware resources like the CPU, memory, and disk space. It allocates resources to various processes, ensuring that each process receives enough resources to function effectively while maintaining overall system efficiency.
Process Management
It handles the creation, scheduling, and termination of processes. The kernel decides which processes should run when and for how long, a process known as scheduling. This is critical in multi-tasking environments where multiple processes require CPU attention.
Memory Management
The kernel controls how memory is allocated to various processes and manages memory access, ensuring that each process has access to the memory it needs without interfering with other processes. It also manages virtual memory, allowing the system to use disk space as an extension of RAM.
Device Management
It acts as an intermediary between the hardware and software of a computer. For instance, when a program needs to read a file from a disk, it requests this service from the kernel, which then communicates with the disk drive’s hardware to read the data.
Security and Access Control
The kernel enforces access control policies, preventing unauthorised access to the system and its resources. It manages user permissions and ensures that processes have the required privileges to execute their tasks.
System Calls
These are the mechanisms through which user-space applications interact with the kernel. For example, when an application needs to open a file, it makes a system call, which is handled by the kernel.
Types of Kernels
Monolithic Kernels: These kernels include various services like the filesystem, device drivers, network interfaces, etc., within one large kernel. Example: Linux.
Microkernels: These kernels focus on minimal functionality, providing only basic services like process and memory management. Other components like device drivers are run in user space. Example: Minix.
Hybrid Kernels: These are a mix of monolithic and microkernel architectures. Example: Windows NT kernel.
Examples of Kernels
Linux Kernel: Used in Linux distributions.
Windows NT Kernel: Used in various versions of Microsoft Windows.
XNU Kernel: Used in macOS and iOS.
Check GNU Compiler Compatibility
NVIDIA CUDA Libraries work in conjunction with GCC (GNU Compiler Collection) on Linux systems.
GCC is commonly used for compiling the host (CPU) part of the code, while CUDA tools like nvcc (NVIDIA CUDA Compiler) are used for compiling the device (GPU)part of the code.
The CUDA Toolkit includes wrappers and libraries that facilitate the integration between the CPU and GPU parts of the code.
NVIDIA provides compatibility information for specific versions of GCC, especially on Linux systems where GCC is a common choice for compiling the host code.
The CUDA runtime libraries, which are installed separately, are sufficient for running CUDA applications on systems with compatible NVIDIA GPUs.
The gcc compiler is required for development using the CUDA Toolkit
To reiterate - when developing applications that use both CPU and GPU, developers might use GCC for compiling the CPU part of the code, while CUDA tools (like nvcc - NVIDIA CUDA Compiler) are used for compiling the GPU part.
The CUDA toolkit often includes compatibility information with specific versions of GCC, especially on Linux systems, where GCC is a common choice for compiling the host code.
Run the following command to check the installed version of GCC:
gcc--version
The first line of the output will show the version number. Ensure it matches the default GCC version listed in your table for your Ubuntu version.
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2)
9.4.0 <--- This is the GCC version number
If you do nothave GCC installed, execute the following:
Installation of GCC via installing 'build essentials'
The build-essential meta-package in Ubuntu is a collection of tools and packages needed for compiling and building software.
This package is particularly useful for developers and those compiling software from source. Here's a detailed summary of each package included in build-essential and information about where these packages are typically stored on Ubuntu 20.04:
dpkg-dev
Purpose: This package is a collection of development tools required to handle Debian (.deb) packages. It includes utilities to unpack, build, and upload Debian source packages, making it an essential tool for packaging software for Debian-based systems like Ubuntu.
Storage Location: The tools and scripts from dpkg-dev are usually stored in /usr/bin/ and /usr/share/dpkg/.
make
Purpose: make is a build automation tool that automatically builds executable programs and libraries from source code by reading files called Makefiles.It's crucial for compiling large programs where it manages dependencies and only recompiles parts of the program that have changed.
Storage Location: The make executable is typically found in /usr/bin/make.
libc6-dev
Purpose: This package contains the development libraries and header files for the GNU C Library. It's essential for compiling C and C++ programs, as it includes standard libraries and headers.
Storage Location: The headers and libraries are generally located in /usr/include/ and /usr/lib/ respectively.
gcc/g++
Purpose: These are the GNU Compiler Collection for C and C++ languages. gcc is for compiling C programs, while g++ is used for C++ programs. They are fundamental for software development in these languages.
Storage Location: The compilers are usually found in /usr/bin/.
When you install the build-essential package on Ubuntu, it automatically installs these components and their dependencies. This package streamlines the setup process for a development environment by bundling these critical tools together.
To install build-essential on Ubuntu 20.04, you can use the following command in the terminal:
sudoaptupdatesudoaptinstallbuild-essential
This command will download and install the build-essentialpackage along with its dependencies. The packages are typically stored in the locations mentioned above, following the standard file system hierarchy of Linux systems. This structure helps in maintaining a standardized path for binaries, libraries, and other files, making it easier for users and other software to locate them.
Post installation of build essentials, check the GCC version you have:
gcc --version
The output should now prove you have GCC installed:
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2)
This version 9.4 should work with CUDA 12.1, which requires at least 9.3. GCC is considered 'backward compatible', so this version of 9.4 should be fine.
What is GCC and why is the version important?
GCC is a collection of compilers for various programming languages.
Although it started primarily for C (hence the original name GNU C Compiler), it now supports C++, Objective-C, Fortran, Ada, Go, and D.
Cross-Platform Compatibility
GCC can be used on many different types of operating systems and hardware architectures. This cross-platform capability makes it a versatile tool for developers who work in diverse environments.
Optimization and Portability
GCC offers a wide range of options for code optimization, making it possible to tune performance for specific hardware or application requirements. It also emphasizes portability, enabling developers to compile their code on one machine and run it on another without modification.
Standard Compliance
GCC strives to adhere closely to various programming language standards, including those for C and C++. This compliance ensures that code written and compiled with GCC is compatible with other compilers following the same standards.
Debugging and Error Reporting
GCC is known for its helpful debugging features and detailed error reporting, which are invaluable for developers in identifying and fixing code issues.
Integration with Development Tools
GCC easily integrates with various development tools and environments. It's commonly used in combination with IDEs, debuggers, and other tools, forming a complete development ecosystem.
Check GLIBC Compatibility
The GNU C Library, commonly known as glibc, is an important component of GNU systems and Linux distributions.
GLIBC is the GNU Project's implementation of the C standard library. It provides the system's core libraries. This includes facilities for basic file I/O, string manipulation, mathematical functions, and various other standard utilities.
To check the GLIBC version:
ldd--version
The first line of the output will show the version number. For example:
ldd (Ubuntu GLIBC2.31-0ubuntu9.9) 2.31
The output should be:
ldd (Ubuntu GLIBC2.31-0ubuntu9.12) 2.31 <--This is the version
Compare this with the GLIBC version in your table.
The GLIBC version of 2.31 is the same as the version required for the NVIDIA CUDA Toolkit
What is GLIBC?
Definition: GLIBC is the GNU Project's implementation of the C standard library. Despite its name, it now also directly supports C++ (and indirectly other programming languages).
Purpose: It provides the system's core libraries. This includes facilities for basic file I/O, string manipulation, mathematical functions, and various other standard utilities.
Compatibility: It's designed to be compatible with the POSIX standard, the Single UNIX Specification, and several other open standards, while also extending them in various ways.
System Calls and Kernel: glibc serves as a wrapper for system calls to the Linux kernel and other essential functions. This means that most applications on a Linux system depend on glibc to interact with the underlying kernel.
Portability: It's used in systems that range from embedded systems to servers and supercomputers, providing a consistent and reliable base across various hardware architectures.
Checking GLIBC Version
To check the version of glibc on a Linux system, you can use the ldd command, which prints the shared library dependencies. The version of glibc will be displayed as part of this output. Here's how to do it:
Run the Command: Type the following command and press Enter:
ldd--version
The first line of the output will typically show the glibc version. For example, it might say ldd (Ubuntu GLIBC 2.31-0ubuntu9.2) 2.31, where "2.31" is the version of glibc.
Importance in Development
Compatibility: When developing software for Linux, it's crucial to know the version of glibc your application will be running against, as different versions may have different features and behaviors.
Portability: For applications intended to run on multiple Linux distributions, understanding glibc compatibility is key to ensuring broad compatibility.
System-Level Programming: For low-level system programming, knowledge of glibc is essential as it provides the interface to many kernel-level services and system resources.
Debugging: Understanding glibc can be crucial for debugging, especially for complex applications that perform a lot of system-level operations.
With the NVIDIA CUDA Toolkit's compatibility with host installations, the next step is to do a check for compatibility
Process for checking installations have been successful
First, check your Ubuntu version. Ensure it matches Ubuntu 20.04, which is our designated Linux operating system
lsb_release-a
Then, verify that your system is based on the x86_64 architecture.Run:
uname-m
The output should be:
x86_64
To check if your system has a CUDA-capable NVIDIA GPU, run
nvidia-smi
You should see an output like this, which details the NVIDIA Drivers installed and the CUDA Version.
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 On | 00000000:00:05.0 Off | 0 |
| N/A 36C P0 56W / 400W| 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1314 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
If this output is not visible, we mustinstall the NVIDIA Drivers
A full analysis
To do this all at once...
If you would like a full printout of your system features, enter this command into the terminal:
The output from the terminal will provide you all the information necessary to check system information for compatibility.
Typical analysis of the output from a A-100 80GB instance
Machine Architecture: x86_64
Your system uses the 64-bit version of the x86 architecture. This is a standard architecture for modern desktops and servers, supporting more memory and larger data sizes compared to 32-bit systems.
Kernel Details
Kernel Name: Linux, indicating that your operating system is based on the Linux kernel.
Kernel Release: 5.4.0-167-generic. This specifies the version of the Linux kernel you are running. 'Generic' here implies a standard kernel version that is versatile for various hardware setups.
Kernel Version: #184-Ubuntu SMP. This shows a specific build of the kernel, compiled with Symmetric Multi-Processing (SMP) support, allowing efficient use of multi-core processors. The timestamp shows the build date.
Hostname: ps1rgbvhl
This is the network identifier for your machine, used to distinguish it in a network environment.
Operating System: GNU/Linux
This indicates that you're using a GNU/Linux distribution, a combination of the Linux kernel with GNU software.
Detailed Kernel Version
This reiterates your kernel version and build details. It also mentions the GCC version used for building the kernel (9.4.0), which affects compatibility with certain software.
CPU Information: Intel(R) Xeon(R) Gold 6342 CPU @ 2.80GHz
The system is powered by an Intel Xeon Gold 6342 processor, which is a high-performance, server-grade CPU. The 2.80 GHz frequency indicates its base clock speed.
Memory Information: MemTotal: 92679772 kB
The system has a substantial amount of RAM (approximately 92.68 GB). This is a significant size, suitable for memory-intensive applications and multitasking.
Ubuntu Distribution Information
Distributor ID: Ubuntu. This shows the Linux distribution you're using.
Description: Ubuntu 20.04.6 LTS, indicating the specific version and that it's a Long-Term Support (LTS) release.
Release: 20.04, the version number.
Codename: focal, the internal codename for this Ubuntu release.
NVCC Version
The output details the version of the NVIDIA CUDA Compiler (NVCC) as 12.1, built in February 2023. NVCC is a key component for compiling CUDA code, essential for developing applications that leverage NVIDIA GPUs for parallel processing tasks.
In summary, the output paints a picture of a powerful, 64-bit Linux system with a high-performance CPU and a significant amount of RAM, running an LTS version of Ubuntu.
The presence of the NVCC with CUDA version 12.1 indicates readiness for CUDA-based development, particularly in fields like data science, machine learning, or any computationally intensive tasks that can benefit from GPU acceleration.
Installation of .NET SDK - required for Polyglot Notebooks
Installation of .NET
.NET is a free, open-source, and cross-platform framework developed by Microsoft.
It is used for building various types of applications, including web applications, desktop applications, cloud-based services, and more. .NET provides a rich set of libraries and tools for developers to create robust and scalable software solutions.
Add the Microsoft package repository
Installing with APT can be done with a few commands. Before you install .NET, run the following commands to add the Microsoft package signing key to your list of trusted keys and add the package repository.
The .NET SDK allows you to develop apps with .NET. If you install the .NET SDK, you don't need to install the corresponding runtime. To install the .NET SDK, run the following commands:
The ASP.NET Core Runtime allows you to run apps that were made with .NET that didn't provide the runtime. The following commands install the ASP.NET Core Runtime, which is the most compatible runtime for .NET. In your terminal, run the following commands:
As an alternative to the ASP.NET Core Runtime, you can install the .NET Runtime, which doesn't include ASP.NET Core support: replace aspnetcore-runtime-8.0 in the previous command with dotnet-runtime-8.0:
sudo apt-get install -y dotnet-runtime-8.0
If you want to change the GCC Version in your environment
You can change your GCC version in a Conda environment.
Here's how you can change the GCC version in a Conda environment:
Create a New Conda Environment (Optional)
If you don't already have a specific environment for your CUDA work, create one:
conda create -axolotl python=3.10
Activate the Conda Environment:
conda activate axolotl
Install a Specific GCC Version:
conda install gcc_linux-64=gcc_version
Replace gcc_version with the version of GCC you need, for example, 9.4.0.
Verify GCC Version:
gcc --version
Install CUDA Toolkit (if needed):
If you haven't installed CUDA in your environment, you can do so using Conda (if available) or follow the CUDA Toolkit's installation guide:
conda install cudatoolkit=x.x
Replace x.x with the version of the CUDA Toolkit you need.
If you want to change the version of CUDA being used in your environment
The Conda installation for CUDA is an efficient way to install and manage the CUDA Toolkit, especially when working with Python environments.
Conda Overview
Conda can facilitate the installation of the CUDA Toolkit.
Installing CUDA Using Conda
Basic installation command: conda install cuda -c nvidia.
This command installs all components of the CUDA Toolkit.
Uninstalling CUDA Using Conda
Uninstallation command: conda remove cuda.
It removes the CUDA Toolkit installed via Conda.
Special Tip: After uninstallation, check for any residual files or dependencies that might need manual removal.
Installing Previous CUDA Releases
Install specific versions using: conda install cuda -c nvidia/label/cuda-<version>.
Replace <version> with the desired CUDA version (e.g., 11.3.0).
Special Tip: Installing previous versions can be crucial for compatibility with certain applications or libraries. Always check version compatibility with your project requirements.
Practical Example: Installing CUDA Toolkit:
Create a virtual environment
conda install -c "nvidia/label/cuda-11.8.0" cuda-nvcc: Installs the NVIDIA CUDA Compiler (nvcc) from the specified NVIDIA channel on Conda. This is aligned with the CUDA version 11.8.0, ensuring compatibility with the specific version of PyTorch being used.
conda install -c anaconda cmake: Installs CMake, a cross-platform tool for managing the build process of software using a compiler-independent method.
conda install -c anaconda cmake: Installs 'lit', a tool for executing LLVM's integrated test suites.
Additional Tools Installation (Optional):
conda install -c anaconda cmake: Installs CMake, a cross-platform tool for managing the build process of software using a compiler-independent method.
conda install -c conda-forge lit: Installs 'lit', a tool for executing LLVM's integrated test suites.
Installing PyTorch and Related Libraries
The pip installcommand is used to install specific versions of PyTorch (torch), along with its sister libraries torchvision and torchaudio. The --index-url specifies the PyTorch wheel for CUDA 11.8, ensuring that the installed PyTorch version is compatible with CUDA 11.8.
These commands add a new PPA (Personal Package Archive) for Ubuntu toolchain tests and install GCC 11 and G++ 11. These are needed for building certain components that require C++ compilation, particularly for deepspeed, a deep learning optimization library.
Checking to see whether the revised version of CUDA is installed
CUDA in Conda Environments
When you create a Conda environment and install a specific version of CUDA (like 11.8 in your case), you are installing CUDA toolkit libraries that are compatible with that version within that environment.
This installation does not change the system-wide CUDA version, nor does it affect what nvidia-smi displays.
The Conda environment's CUDA version is used by the programs and processes running within that environment. It's independent of the system-wide CUDA installation.
Verifying CUDA Version in Conda Environment
To check the CUDA version in your Conda environment, you should not rely on nvidia-smi. Instead, you can check the version of the CUDA toolkit you have installed in your environment. This can typically be done by checking the version of specific CUDA toolkit packages installed in the environment, like cudatoolkit.
You can use a command like conda list cudatoolkit within your Conda environment to see the installed version of the CUDA toolkit in that environment.
Compatibility
It's important to ensure that the CUDA toolkit version within your Conda environment is compatible with the version supported by your NVIDIA driver (as indicated by nvidia-smi). If the toolkit version in your environment is higher than the driver's supported version, you may encounter compatibility issues.
In summary, nvidia-smi shows the maximum CUDA version supported by your GPU's driver, not the version used in your current Conda environment. To check the CUDA version in a Conda environment, use Conda-specific commands to list the installed packages and their versions.
Another way of putting it:
CUDA Driver Version:The version reported by nvidia-smi is the CUDA driver version installed on your system, which is 12.3 in your case. This is the version of the driver software that allows your operating system to communicate with the NVIDIA GPU.
CUDA Toolkit Version in PyTorch:When you install PyTorch with a specific CUDA toolkit version (like cu118 for CUDA 11.8), it refers to the version of the CUDA toolkit libraries that PyTorch uses for GPU acceleration. PyTorch packages these libraries with itself, so it does not rely on the system-wide CUDA toolkit installation.
Compatibility: The key point is compatibility. Your system's CUDA driver version (12.3) is newer and compatible with the CUDA toolkit version used by PyTorch (11.8). Generally, a newer driver version can support older toolkit versions without issues.
Functionality Check:As long as torch.cuda.is_available() returns True, it indicates that PyTorch is able to interact with your GPU using its bundled CUDA libraries, and you should be able to run CUDA-accelerated PyTorch operations on your GPUs.
In summary, your setup is fine for running PyTorch with GPU support. The difference in the CUDA driver and toolkit versions is normal and typically not a problem as long as the driver version is equal to or newer than the toolkit version required by PyTorch.
Test Compatibility
Below are some scripts to create to test for compatibility.
These scripts will test that both your CPU and GPU are correctly processing the CUDA code. It will also test to make sure there are no compatibility issues between the installed GCC version and the CUDA Toolkit version you are using.
Compatibility Test Scripts
To test the compatibility of your GCC version with the CUDA Toolkit version installed, you can use a simple CUDA program. Below is a basic script for a CUDA program that performs a simple operation on the GPU. This script will help you verify that your setup is correctly configured for CUDA development.
First, create a simple CUDA program. Let's call it test_cuda.cu:
#include <stdio.h>
#include <cuda_runtime.h>
// Kernel function to add two vectors
__global__ void add(int n, float *x, float *y) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
int main(void) {
int N = 1<<25; // 33.6M elements
float *x, *y;
cudaEvent_t start, stop;
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(N, x, y);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
printf("Time taken: %f ms\n", milliseconds);
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i]-3.0f));
printf("Max error: %f\n", maxError);
cudaEventDestroy(start);
cudaEventDestroy(stop);
cudaFree(x);
cudaFree(y);
return 0;
}
Next, create a shell script to compile and run this CUDA program. Name this script test_cuda_compatibility.sh:
#!/bin/bash
# Define the CUDA file
cuda_file="test_cuda.cu"
# Define the output executable
output_executable="test_cuda_executable"
# Compile the CUDA program
nvcc $cuda_file -o $output_executable
# Check if the compilation was successful
if [ $? -eq 0 ]; then
echo "Compilation successful. Running the CUDA program..."
./$output_executable
else
echo "Compilation failed."
fi
This script compiles the test_cuda.cu file using nvcc, the NVIDIA CUDA compiler, and then runs the compiled executable if the compilation is successful.
How to Use the Script:
Save the CUDA program code in a file named test_cuda.cu.
Save the shell script in a file named test_cuda_compatibility.sh.
Make the shell script executable:
chmod +x test_cuda_compatibility.sh
Run the script:
./test_cuda_compatibility.sh
If everything is set up correctly, the script will compile the CUDA program and run it, resulting in output from both the CPU and GPU.
If there are compatibility issues between GCC and the CUDA Toolkit, the script will likely fail during compilation, and you'll see error messages indicating what went wrong.
Remember: Compatibility between the GCC version and the CUDA Toolkit is crucial. Make sure the GCC version you choose is compatible with your CUDA Toolkit version.
Where are you now?
We have now created a deep learning development environment optimised for NVIDIA GPUs, with compatibility across key components.
We have so far:
-Installed CUDA Toolkit and Drivers
-Set up the NVIDIA Container Toolkit to allow access to NVIDIA Docker containers
-Ensured Host Compatibility by verifying variables such as GCC (GNU Compiler Collection) and GLIBC (GNU C Library) are compatible with the CUDA version.
-Created a Compatibility Check Script: Developing a script to check for compatibility issues
With these components in place, your environment is tailored for deep learning development. It supports the development and execution of deep learning models, leveraging the computational power of GPUs for training and inference tasks.
==> With the environment established for NVIDIA GPUS, the next step is creating the virtual environment for Axolotl and installing the code base