# Creation of Environment

## <mark style="color:blue;">**Environment Creation**</mark>

The first step in the process is ensuring our machine is optimised to use the NVIDIA platform and GPUs

## <mark style="color:blue;">Base Distribution: Ubuntu 20.04</mark>

We are going to use <mark style="color:purple;">**Ubuntu 20.04**</mark> as the base image because:<br>

-This version is known for its long-term support (LTS)

-Ubuntu is well-supported by NVIDIA

-Works well as remote instance so we can access high powered GPUs

-These remote environments allows all team members to access our platform

## <mark style="color:blue;">Integrated Development Environment (IDE)</mark>&#x20;

We recommend using Visual Studio Code (VS Code)  for our <mark style="color:yellow;">Integrated Development Environment</mark>

1. <mark style="color:green;">**Support for Remote Development**</mark><mark style="color:green;">:</mark> VS Code allows remote development support, crucial for accessing and managing virtual machines with powerful GPUs
2. <mark style="color:green;">**Integrated Terminal and Docker Support**</mark><mark style="color:green;">:</mark> The integrated terminal in VS Code enables direct interaction with command-line tools, essential for managing Docker containers and executing model training scripts.&#x20;
3. <mark style="color:green;">**Extensive Language Support**</mark><mark style="color:green;">:</mark> Large language model development often involves multiple programming languages (like Python, C++).  VS Code supports a wide range of languages and their specific tooling, which is critical for such multifaceted development.
4. <mark style="color:green;">**Version Control Integration**</mark><mark style="color:green;">:</mark> With built-in Git support, VS Code makes it easier to track and manage changes in code

## <mark style="color:blue;">Virtual Machine Requirements:</mark>

-Docker\*

-CUDA (Version <mark style="color:yellow;">12.1</mark>)\*: parallel computing platform and programming model

-NVIDIA NGC: NVIDIA Container toolkit for access to NVIDIA Docker Container

-NVIDIA CUDA Toolkit\*:  <mark style="color:yellow;">compiler for CUDA</mark>, translates CUDA code into executable programs

-GCC:  the <mark style="color:blue;">**compiler**</mark> required for development using the CUDA Toolkit

-GLIBC: the GNU Project's implementation of the C standard library.  Includes facilities for basic file I/O, string manipulation, mathematical functions, and various other standard utilities.

#### \*Please note, Continuum's base virtual machine installation script <mark style="color:purple;">installs Docker, the NVIDIA Container Toolkit and CUDA Driver 12.1</mark> as well <mark style="color:purple;">as the</mark> <mark style="color:yellow;">NVIDIA Container Toolkit</mark>

## <mark style="color:blue;">Check the virtual machine is ready</mark>&#x20;

To ensure that the virtual machine is set up for the training and fine tuning of large language models, <mark style="color:purple;">follow the instructions below:</mark>

## <mark style="color:blue;">Check the installation of the</mark> <mark style="color:yellow;">NVIDIA CUDA Toolkit</mark>

<mark style="color:green;">What is the CUDA Toolkit?</mark>

The NVIDIA CUDA Toolkit provides a <mark style="color:yellow;">development environment</mark> for creating high performance GPU-accelerated applications.&#x20;

With the CUDA Toolkit, you can develop, optimize, and deploy your applications on GPU-accelerated systems.

The toolkit includes GPU-accelerated libraries, debugging and optimization tools, a C/C++ compiler, and a runtime library to deploy your application.

### We will be installing the NVIDA CUDA Toolkit, <mark style="color:yellow;">version 12.1</mark>

First, check to see if the CUDA Toolkit is installed, we can check to see whether the core compiler is installed, the NVIDIA CUDA Compiler (NVCC)

<mark style="color:blue;">**Nvidia CUDA Compiler (NVCC)**</mark> is a part of the CUDA Toolkit.  It is the <mark style="color:yellow;">compiler for CUDA</mark>, responsible for translating CUDA code into executable programs.  &#x20;

NVCC takes high-level CUDA code and turns it into a form that can be understood and executed by the GPU.  It handles the partitioning of code into segments that can be run on either the CPU or GPU, and manages the compilation of the GPU parts of the code.

First, to check if <mark style="color:yellow;">NVCC is installed</mark> and its version, run

```bash
nvcc --version
```

If the NVIDIA CUDA Toolkit has been installed, this will be the output:

<pre class="language-bash"><code class="lang-bash">nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 12.1, V11.8.89
<a data-footnote-ref href="#user-content-fn-1">Build cuda_</a>12.1.r11.8/compiler.31833905_0
</code></pre>

You should see <mark style="color:yellow;">release 12.1</mark> - which indicates the <mark style="color:yellow;">CUDA Toolkit Version 12.1</mark> has been successfully installed.

If NVCC is not installed, then go ahead an install the CUDA Toolkit 12.1. &#x20;

The CUDA Toolkit download website is located here:

{% embed url="<https://developer.nvidia.com/cuda-12-1-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=WSL-Ubuntu&target_version=2.0&target_type=deb_local>" %}
CUDA 12.1 Download
{% endembed %}

The web application at this site will ask you to define your installation set up. &#x20;

For our base virtual machine this will be:

| Variable         |     Parameter    |
| ---------------- | :--------------: |
| Operation System |       Linux      |
| Architecture     | x86\_64 (64 bit) |
| Distribution     |      Ubuntu      |
| Version          |       20.04      |
| Installer Type   |    deb (local)   |

<figure><img src="/files/3zBPjiIYKHs3PMQlAXa0" alt=""><figcaption><p>This website will provide you the instructions for installing the CUDA Toolkit 12.1</p></figcaption></figure>

The detail instructions and explanation of the process is below:

<details>

<summary>Reference: <mark style="color:green;">Installation of CUDA Toolkit 12.1</mark></summary>

### <mark style="color:blue;">Using the local installation approach</mark>

<mark style="color:blue;">**deb (local)**</mark>

* This installer is a Debian package file that contains all the necessary CUDA files.
* It's a large file because it includes everything needed for the installation.

#### <mark style="color:purple;">Step 1: Download the CUDA Repository Pin</mark>

Open your terminal and execute the following command. This will <mark style="color:yellow;">download a pin file</mark> for the CUDA repository:

{% code overflow="wrap" %}

```bash
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
```

{% endcode %}

#### <mark style="color:purple;">Step 2: Move the Pin File</mark>

Move the downloaded pin file to `/etc/apt/preferences.d/`. This will prioritize the CUDA repository over others:

```bash
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
```

#### <mark style="color:purple;">Step 3: Download the CUDA Repository Package</mark>

Now, download the Debian package for the CUDA repository:

{% code overflow="wrap" %}

```bash
wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda-repo-ubuntu2004-12-1-local_12.1.0-530.30.02-1_amd64.deb
```

{% endcode %}

#### <mark style="color:purple;">Step 4: Install the CUDA Repository Package</mark>

Install the downloaded package using `dpkg`:

```bash
sudo dpkg -i cuda-repo-ubuntu2004-12-1-local_12.1.0-530.30.02-1_amd64.deb
```

#### <mark style="color:purple;">Step 5: Copy the GPG Keyring</mark>

Copy the GPG keyring file to the `/usr/share/keyrings/` directory. This step is necessary for the authentication of the CUDA repository:

```bash
sudo cp /var/cuda-repo-ubuntu2004-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
```

#### <mark style="color:purple;">Step 6: Update the Package Lists</mark>

Update the package lists to include packages from the newly added CUDA repository:

```bash
sudo apt-get update
```

#### <mark style="color:purple;">Step 7: Install CUDA</mark>

Finally, install CUDA:

```bash
sudo apt-get -y install cuda
```

### Post-Installation

After the installation is complete, you will need to <mark style="color:yellow;">set up the environment variables for CUDA.</mark>&#x20;

Add the following lines to your `.bashrc` or `.bash_profile` to set `PATH` and `LD_LIBRARY_PATH`:

```bash
export PATH=/usr/local/cuda-12.1/bin${PATH:+:${PATH}}
```

```bash
export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
```

After editing, apply the changes:

```bash
source ~/.bashrc
```

#### Verify Installation

To verify the installation, you can check the CUDA version:

```bash
nvcc --version
```

</details>

Post installation, to check if <mark style="color:yellow;">NVCC has been installed successfully</mark> and check its version, run

```bash
nvcc --version
```

If the NVIDA CUDA Toolkit has been installed, this will be the output:

<pre class="language-bash"><code class="lang-bash">nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 12.1, V11.8.89
<a data-footnote-ref href="#user-content-fn-1">Build cuda_11.8</a>.r11.8/compiler.31833905_0
</code></pre>

{% hint style="warning" %}
If NVCC is pointing to an older version of CUDA despite upgrading to 12.1 you will need to follow the instructions below.  CUDA must be on PATH!
{% endhint %}

<details>

<summary><mark style="color:yellow;">Make sure CUDA 12.1 is on PATH</mark></summary>

If you've followed the CUDA installation instructions and still find that an old version of NVCC (NVIDIA CUDA Compiler) is installed, there are several steps you can take to troubleshoot and resolve the issue:

### <mark style="color:blue;">**Verify the Installation**</mark><mark style="color:blue;">:</mark>

Ensure that CUDA was installed correctly without errors. You can check the installation logs for any errors or warnings that might have occurred during the installation process.

Use the command to <mark style="color:yellow;">list all CUDA-related packages</mark> installed and their versions.  Enter into the terminal:

```bash
dpkg -l | grep cuda
```

This will provide a list of all CUDA related packages:

Below are some example files that this command highlights, for example:

<mark style="color:blue;">**cuda, cuda-12-1**</mark><mark style="color:blue;">:</mark> These are meta-packages for CUDA version 12.1. Installing a meta-package <mark style="color:yellow;">installs all the components of CUDA.</mark>

<mark style="color:blue;">**cuda-cccl-12-1**</mark><mark style="color:blue;">:</mark> CUDA CCCL (CUDA C++ Core Library) is part of the CUDA Toolkit and provides <mark style="color:yellow;">essential libraries for CUDA C++ development.</mark>

<mark style="color:blue;">**cuda-command-line-tools-12-1**</mark><mark style="color:blue;">:</mark> This package includes <mark style="color:yellow;">command-line tools for CUDA,</mark> such as `nvcc` (NVIDIA CUDA Compiler), which is crucial for compiling CUDA code.

<mark style="color:blue;">**cuda-compiler-12-1**</mark><mark style="color:blue;">:</mark> This package includes the <mark style="color:yellow;">CUDA compiler,</mark> which is essential for converting CUDA code into code that can run on Nvidia GPUs.

<mark style="color:blue;">**cuda-demo-suite-12-1**</mark><mark style="color:blue;">:</mark> This package <mark style="color:yellow;">contains demos showcasing the capabilities</mark> and features of CUDA.

<mark style="color:blue;">**cuda-documentation-12-1**</mark><mark style="color:blue;">:</mark> Provides the <mark style="color:yellow;">documentation for CUDA,</mark> useful for developers to understand and use CUDA APIs.

<mark style="color:blue;">**cuda-driver-dev-12-1**</mark><mark style="color:blue;">:</mark> Includes <mark style="color:yellow;">development resources for the CUDA driver,</mark> such as headers and stub libraries.

<mark style="color:blue;">**cuda-libraries-12-1, cuda-libraries-dev-12-1**</mark><mark style="color:blue;">:</mark> These meta-packages include <mark style="color:yellow;">libraries necessary for CUDA development and their development counterparts.</mark>

<mark style="color:blue;">**cuda-nvcc-12-1**</mark><mark style="color:blue;">:</mark> NVIDIA CUDA Compiler (NVCC) is a tool for compiling CUDA code.

<mark style="color:blue;">**cuda-repo-ubuntu2004-12-1-local**</mark><mark style="color:blue;">:</mark> Contains repository configuration files for the CUDA toolkit

### <mark style="color:blue;">**Check Environment Variables**</mark><mark style="color:blue;">:</mark>

* Ensure that your environment variables are pointing to the new CUDA installation. Specifically, check the <mark style="color:yellow;">`PATH`</mark> and <mark style="color:yellow;">`LD_LIBRARY_PATH`</mark> environment variables via the following commands:

```bash
echo $PATH
```

```bash
echo $LD_LIBRARY_PATH
```

Update these variables if they are pointing to an older CUDA version.&#x20;

If CUDA is not on PATH:

### <mark style="color:blue;">**Export PATH**</mark>

```bash
export PATH=/usr/local/cuda-12.4/bin${PATH:+:${PATH}}
```

This command adds <mark style="color:yellow;">`/usr/local/cuda-12.1/bin`</mark> to the `PATH` environment variable.

<mark style="color:yellow;">`PATH`</mark> is a list of directories the shell searches for executable files. Adding CUDA's bin directory makes it possible to run CUDA tools and compilers directly from the command line without specifying their full path.

`${PATH:+:${PATH}}` is a <mark style="color:yellow;">shell parameter expansion pattern</mark> that appends the existing `PATH` variable to the new path. If <mark style="color:yellow;">`PATH`</mark> is unset, nothing is appended.

### <mark style="color:blue;">LD\_LIBRARY\_PATH Variable Setup</mark>

**Export LD\_LIBRARY\_PATH for 64-bit OS**:

{% code overflow="wrap" %}

```bash
export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
```

{% endcode %}

<mark style="color:yellow;">`LD_LIBRARY_PATH`</mark> is an environment variable specifying directories where libraries are searched for first, before the standard set of directories.

This command <mark style="color:yellow;">adds the</mark> <mark style="color:yellow;"></mark><mark style="color:yellow;">`lib64`</mark> <mark style="color:yellow;"></mark><mark style="color:yellow;">directory of the CUDA installation to</mark> <mark style="color:yellow;"></mark><mark style="color:yellow;">`LD_LIBRARY_PATH`</mark>, which is necessary for 64-bit systems. It ensures that the system can find and use the CUDA libraries.

### <mark style="color:blue;">**Verify NVCC Version**</mark><mark style="color:blue;">:</mark>

After updating the environment variables, check the NVCC version again using:

```
nvcc --version
```

This should reflect the new version if the environment variables are set correctly.

**Update Alternatives**:

* Sometimes, multiple versions of CUDA can coexist, <mark style="color:yellow;">and the system may still use the old version.</mark> Use `update-alternatives` to configure the default CUDA version.
* Run `sudo update-alternatives --config nvcc` to choose the correct version of NVCC.

```bash
sudo update-alternatives --config
```

</details>

<details>

<summary>Reference: <mark style="color:green;">What is included in the CUDA Toolkit?</mark></summary>

\
CUDA contains  components and tools designed to facilitate development, optimisation, and deployment of GPU-accelerated applications.&#x20;

These components cater to various aspects of GPU computing, from low-level programming to high-level library support.&#x20;

Here's an overview of the contents CUDA 12.1 toolkit:

1. <mark style="color:blue;">**NVCC (NVIDIA CUDA Compiler)**</mark><mark style="color:blue;">:</mark> NVCC is the primary compiler that converts CUDA code into binary executable form.
2. <mark style="color:blue;">**CUDA Libraries**</mark><mark style="color:blue;">:</mark> CUDA 12.1 includes a range of GPU-accelerated libraries for different types of computing and data processing tasks. These include:
   * <mark style="color:blue;">**cuBLAS**</mark><mark style="color:blue;">:</mark> A GPU-accelerated implementation of the <mark style="color:yellow;">Basic Linear Algebra</mark> Subprograms (BLAS) library.
   * <mark style="color:blue;">**cuFFT**</mark><mark style="color:blue;">:</mark> A library for performing <mark style="color:yellow;">Fast Fourier Transforms (FFT)</mark> on the GPU.
   * <mark style="color:blue;">**cuRAND**</mark><mark style="color:blue;">:</mark> A library for <mark style="color:yellow;">generating random numbers</mark> on the GPU.
   * <mark style="color:blue;">**cuDNN**</mark>: A GPU-accelerated library for <mark style="color:yellow;">deep neural networks.</mark>
   * <mark style="color:blue;">**NVIDIA NPP (NVIDIA Performance Primitives)**</mark><mark style="color:blue;">:</mark> A collection of GPU-accelerated image, video, and signal processing functions.
   * <mark style="color:blue;">**Thrust**</mark><mark style="color:blue;">:</mark> A C++ parallel programming library resembling the C++ Standard Library.
3. <mark style="color:blue;">**CUDA Runtime and Driver APIs**</mark><mark style="color:blue;">:</mark> These APIs allow applications to manage devices, memory, and program executions on GPUs. The runtime API is designed for high-level programming, while the driver API provides lower-level control.
4. <mark style="color:blue;">**CUDA Debugger (cuda-gdb)**</mark><mark style="color:blue;">:</mark> A tool to debug CUDA applications running on the GPU.
5. <mark style="color:blue;">**CUDA Profiler (nvprof and Nsight Systems)**</mark><mark style="color:blue;">:</mark> Tools for profiling the performance of CUDA applications. These profilers help in identifying performance bottlenecks.
6. <mark style="color:blue;">**Nsight Eclipse Edition**</mark><mark style="color:blue;">:</mark> An integrated development environment (IDE) for developing and debugging CUDA applications on Linux and macOS.
7. <mark style="color:blue;">**CUDA Samples and Documentation**</mark><mark style="color:blue;">:</mark> A set of sample code and comprehensive documentation to help developers understand and use various features of the CUDA toolkit.
8. <mark style="color:blue;">**CUDA Toolkit SDK**</mark><mark style="color:blue;">:</mark> Contains additional libraries, code samples, and resources for software development.
9. <mark style="color:blue;">**Support for Various GPU Architectures**</mark><mark style="color:blue;">:</mark> Each CUDA version supports specific Nvidia GPU architectures. CUDA 12.1 would include support for the latest architectures available at its release.
10. <mark style="color:blue;">**Multi-Language Support**</mark>: While CUDA is primarily used with C and C++, it also supports other languages like Python (via libraries like PyCUDA).
11. <mark style="color:blue;">**GPU-accelerated Machine Learning and AI Libraries**</mark><mark style="color:blue;">:</mark> Integration with libraries and frameworks for AI and machine learning, such as TensorFlow and PyTorch, which can leverage CUDA for GPU acceleration.

</details>

## <mark style="color:green;">CUDA</mark> <mark style="color:yellow;">Reference Materials\*</mark>

Please find the NVIDIA CUDA Installation Guide for Linux

{% embed url="<https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>" %}
Reference Materials for CUDA Installation for Linux distributions
{% endembed %}

<details>

<summary><mark style="color:green;">NVIDIA CUDA Installation Guide for Linux</mark></summary>

\
The NVIDIA CUDA Installation Guide for Linux outlines the process for installing the CUDA Toolkit on a Linux system.&#x20;

1. **System Requirements**:
   * Lists the prerequisites for using NVIDIA CUDA, including a CUDA-capable GPU, a supported version of Linux with gcc compiler and toolchain, and the CUDA Toolkit.
   * Provides information about compatible Linux distributions and kernel versions, including links to resources for specific Linux distributions like RHEL, SLES, Ubuntu LTS, and L4T.
2. **OS Support Policy**:
   * Details the support policy for various Linux distributions like Ubuntu, RHEL, CentOS, Rocky Linux, SUSE SLES, OpenSUSE Leap, Debian, Fedora, and KylinOS, including their end-of-support life (EOSS) timelines.
3. **Host Compiler Support Policy**:
   * Describes the supported compilers for different architectures (x86\_64, Arm64 sbsa, POWER 9), including GCC, Clang, NVHPC, XLC, ArmC/C++, and ICC.
   * Emphasizes the importance of using a compatible host compiler for compiling CPU host code in CUDA sources.
4. **Pre-installation Actions**:
   * Outlines steps to prepare for CUDA Toolkit and Driver installation, including verifying the presence of a CUDA-capable GPU, a supported version of Linux, gcc installation, correct kernel headers and development packages, and considering conflicting installation methods.
   * Notes the possibility of overriding install-time prerequisite checks using the `-override` flag.
5. **Verification Steps**:

   * Provides commands to verify the presence of a CUDA-capable GPU&#x20;

   (`lspci | grep -i nvidia`),
6. he supported version of Linux (`uname -m && cat /etc/*release`), and the gcc installation (`gcc --version`).
   * Explains how to check for the correct kernel headers and development packages.
7. **Ubuntu-Specific Installation Instructions**:
   * Details steps to prepare Ubuntu for CUDA installation, including installing kernel headers and removing outdated signing keys.
   * Describes two installation methods for Ubuntu: local repo and network repo, including steps for each method.
   * For both methods, provides common instructions to update the Apt repository cache and install the CUDA SDK, and includes a reminder to reboot the system after installation.

</details>

If you are interested, you can familiarise yourself with <mark style="color:green;">**CUDA Best Practices:**</mark>

{% embed url="<https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#cuda-compatibility-and-upgrades>" %}
CUDA Best Practices Guide
{% endembed %}

### <mark style="color:purple;">Below is a summary of CUDA Programming</mark>

<details>

<summary><mark style="color:green;">CUDA Tutorial - CUDA Programming is difficult!</mark></summary>

<mark style="color:blue;">**CUDA's Role in Machine Learning**</mark>

CUDA, NVIDIA's parallel computing platform, is particularly important in machine learning because many <mark style="color:yellow;">machine learning tasks involve linear algebra, like matrix multiplications and vector operations.</mark>  CUDA is optimised for these kinds of operations, offering significant performance improvements over traditional CPU processing.

<mark style="color:blue;">**Installation and Setup Requirements**</mark>

To use CUDA, you need to install the CUDA toolkit and have an <mark style="color:blue;">NVIDIA GPU</mark>, as CUDA does not work with other types of GPUs like AMD. The setup process varies depending on your operating system (Linux, Windows, or potentially Mac).

<mark style="color:blue;">**Prerequisite Knowledge**</mark>

a basic understanding of C or C++ programming is necessary to work with CUDA.  Concepts like memory allocation (`malloc`) and freeing memory are used without detailed explanations. If you're only familiar with Python, you might find following the CUDA examples challenging.

<mark style="color:blue;">**CUDA Programming Basics**</mark>

CUDA code is similar to C/C++ but includes additional functions and data types for parallel computing.  Understanding the transition from C to CUDA is critical for grasping how parallelisation is implemented in CUDA.

<mark style="color:blue;">**Understand CUDA's Grid and Block Model**</mark>

The explanation of CUDA's grid and block model is crucial. It's important to understand <mark style="color:yellow;">how to configure blocks and threads within a grid to effectively parallelize tasks.</mark> When defining grid and block dimensions, remember that they directly influence the number of threads that will execute your code and how they are organized. Incorrect configurations can lead to inefficient use of GPU resources or even cause your program to behave unexpectedly.

<mark style="color:blue;">**Memory Management in CUDA**</mark>

In CUDA, memory allocation and management are crucial, especially since you're dealing with both host (CPU) and device (GPU) memory. Incorrect memory handling can lead to crashes or incorrect program results.

<mark style="color:blue;">**Mapping Problem Domain to CUDA's Architecture**</mark>

The idea of <mark style="color:yellow;">mapping a matrix's shape to CUDA's grid shape illustrates a key concept in CUDA programming</mark>: effectively mapping your problem domain to CUDA's architecture. This can be challenging, as it requires a good <mark style="color:yellow;">understanding of both your application's requirements and CUDA's parallel execution model.</mark>

<mark style="color:blue;">**Performance Considerations**</mark>

One of the objectives of using CUDA is to enhance performance. It's important to note that not all problems will see a dramatic performance increase with CUDA, and sometimes the overhead of managing GPU resources can outweigh the benefits for smaller or less complex tasks.

<mark style="color:blue;">**Initialisation and Memory Representation**</mark>

It's important to understand how multi-dimensional data structures like matrices are represented in memory, especially since CUDA typically deals with flattened arrays. This understanding is crucial for correctly indexing elements during calculations.

<mark style="color:blue;">**Understanding the Mapping of Computation to CUDA Threads**</mark>

Each thread in CUDA is assigned a specific part of the computation, like a segment of a matrix-vector multiplication. This mapping is critical for efficient parallel computation. It's important to correctly calculate the indices each thread will work on, taking into account both block and thread indices.

<mark style="color:blue;">**Boundary Conditions in CUDA Computations**</mark>

When working with CUDA, it’s important to <mark style="color:yellow;">handle boundary conditions carefully.</mark>  If the dimensions of your data do not exactly match the grid and block dimensions you've configured, you must ensure that your code correctly handles these edge cases to avoid out-of-bounds memory access, which can lead to incorrect results or crashes.

<mark style="color:blue;">**Host-Device Memory Management**</mark>

The tutorial <mark style="color:yellow;">highlights the complexity of managing memory between the host (CPU) and the device (GPU)</mark>.  Data must be explicitly allocated and transferred between the host and device memories. This process adds an extra layer of complexity to CUDA programming and requires careful handling to ensure data integrity and to avoid memory leaks.

<mark style="color:blue;">**Grid and Block Dimension Calculations**</mark>

Calculating the dimensions of grids and blocks is a critical aspect of CUDA programming.  The dimensions influence how many threads are launched and how they are organised. Misconfigurations here can lead to inefficient use of GPU resources or even failure to execute the program correctly.

</details>

<details>

<summary><mark style="color:green;">CUDA - Version Control</mark></summary>

Managing CUDA versions and resolving conflicts can be critical.

### <mark style="color:purple;">Understanding CUDA and its Interaction with Libraries</mark>

<mark style="color:green;">**Compatibility:**</mark>

* <mark style="color:blue;">**CUDA and GPU Compatibility**</mark><mark style="color:blue;">:</mark> Each GPU model supports certain CUDA versions. Ensure your GPU is compatible with the CUDA version you plan to use.
* <mark style="color:blue;">**Library Dependencies**</mark><mark style="color:blue;">:</mark> Machine learning libraries like TensorFlow, PyTorch, etc., are *<mark style="color:yellow;">built against specific CUDA and cuDNN versions</mark>*<mark style="color:yellow;">.</mark> Using incompatible versions can lead to errors.

<mark style="color:green;">**CUDA Toolkit vs. CUDA Runtime**</mark><mark style="color:green;">:</mark>

* <mark style="color:blue;">**CUDA Toolkit**</mark><mark style="color:blue;">:</mark> Includes the CUDA runtime and development environment (compiler, debugger, etc.). Needed for compiling CUDA-enabled applications.
* <mark style="color:blue;">**CUDA Runtime**</mark><mark style="color:blue;">:</mark> Required to run CUDA-enabled applications. Often, libraries like PyTorch bundle the necessary CUDA runtime components, so you might not need a system-wide CUDA installation.

### <mark style="color:purple;">Continuum - managing CUDA Versions and Conflicts</mark>

1. **Isolate Environments**:
   * Use separate Python environments using Anaconda for different projects. This isolation allows you to install different versions of libraries (and their corresponding CUDA versions) without conflict.
   * Example: Have one Conda environment for TensorFlow with CUDA 11.0 and another for PyTorch with CUDA 10.2.
2. **Containerization**:
   * Use Docker or similar containerisation tools. They encapsulate the entire runtime environment, including the specific versions of CUDA, cuDNN, and other dependencies.
   * Nvidia provides Docker images (NGC Containers) with TensorFlow, PyTorch, etc., pre-installed with the required CUDA and cuDNN versions.
3. **Local CUDA Installation**:
   * If you need to compile CUDA code, install the CUDA Toolkit locally.
   * Be mindful of the CUDA version if you have multiple projects requiring different versions.
4. **Check Compatibility Before Installation**:
   * Before installing a deep learning library, check its compatibility with your CUDA version.
   * Use the installation commands tailored to specific CUDA versions (as seen with PyTorch&#x20;

</details>

## <mark style="color:blue;">Check the installation of the</mark> <mark style="color:yellow;">NVIDIA Container Toolkit</mark>

The NVIDIA Container Toolkit enables users to build and run GPU-accelerated containers.&#x20;

The toolkit includes a container runtime [library](https://github.com/NVIDIA/libnvidia-container) and utilities to automatically configure containers to leverage NVIDIA GPUs.

This allows you to use NVIDIA Container in Docker.

### <mark style="color:blue;">Background and Explanation</mark>

<figure><img src="/files/HYzH9n1A3rFHqZn7HnR3" alt=""><figcaption></figcaption></figure>

The <mark style="color:blue;">**NVIDIA Container Toolkit**</mark> is designed to integrate NVIDIA GPUs into containerised applications. It's compatible with various container runtimes and consists of several components:

1. <mark style="color:blue;">**NVIDIA Container Runtime (nvidia-container-runtime):**</mark> An OCI-compliant runtime for Docker or containerd, enabling the use of NVIDIA GPUs in containers.
2. <mark style="color:blue;">**NVIDIA Container Runtime Hook (nvidia-container-toolkit / nvidia-container-runtime-hook):**</mark> A component executing prestart scripts to configure GPU access in containers.
3. <mark style="color:blue;">**NVIDIA Container Library and CLI (libnvidia-container1, nvidia-container-cli):**</mark> These provide a library and CLI for automatically configuring containers with NVIDIA GPU support, independent of the container runtime.

The toolkit's architecture allows for integration with various container runtimes like Docker, containerd, cri-o, and lxc. Notably, the NVIDIA Container Runtime is not required for cri-o and lxc.

The toolkit comprises main packages: nvidia-container-toolkit, nvidia-container-toolkit-base, libnvidia-container-tools, and libnvidia-container1, with specific dependencies between them.&#x20;

Older packages like nvidia-docker2 and nvidia-container-runtime are now deprecated and merged into the nvidia-container-toolkit.

### <mark style="color:blue;">The Architecture</mark>

<figure><img src="/files/GDZItgUO2FUDTjxsyngI" alt=""><figcaption></figcaption></figure>

### <mark style="color:blue;">Key functionalities</mark>

* <mark style="color:blue;">**NVIDIA Container Runtime Hook:**</mark> Implements a runC prestart hook, configuring GPU devices in containers based on the container's `config.json`.
* <mark style="color:blue;">**NVIDIA Container Runtime:**</mark> A wrapper around runC, modifying the OCI runtime spec for GPU support.
* <mark style="color:blue;">**NVIDIA Container Toolkit CLI:**</mark> Offers utilities for configuring runtimes and generating Container Device Interface (CDI) specifications.

For installation, the <mark style="color:yellow;">`nvidia-container-toolkit`</mark> package is generally sufficient.&#x20;

The toolkit's packages are available on GitHub, useful for both online and air-gapped installations. The repository also hosts experimental releases of the software.

### <mark style="color:blue;">Check for Docker</mark>

Before installing ensure you have Docker installed

```bash
docker --version
```

The output should be as below:

```bash
Docker version 26.0.2, build 3c863ff
```

To install the NVIDIA Container Toolkit, follow the instructions below:

<details>

<summary><mark style="color:green;">Installation of the NVIDIA Container Toolkit</mark></summary>

Configure the repository

{% code overflow="wrap" fullWidth="true" %}

```bash
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
```

{% endcode %}

Update the packages list from the repository:

```bash
sudo apt-get update
```

Install the NVIDIA Container Toolkit packages:

```bash
sudo apt-get install -y nvidia-container-toolkit
```

#### Configuring Docker

Configure the container runtime by using the `nvidia-ctk` command:

```bash
sudo nvidia-ctk runtime configure --runtime=docker
```

The `nvidia-ctk` command modifies the `/etc/docker/daemon.json` file on the host. The file is updated so that Docker can use the NVIDIA Container Runtime.

Restart the Docker daemon:

```bash
sudo systemctl restart docker
```

</details>

Post installation, you can check to see if the Container Toolkit is working:

```bash
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
```

The output should look like this:

```
---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB           On | 00000000:00:05.0 Off |                    0 |
| N/A   35C    P0               55W / 400W|      5MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
```

### <mark style="color:blue;">Compatibility Testing</mark>

The CUDA development environment <mark style="color:yellow;">relies on tight integration with the host development environment</mark>, ***including the host compiler and C runtime libraries***, and is therefore only supported on Ubuntu versions that have been qualified for the CUDA Toolkit release.

Now that we have installed the NVIDIA CUDA Toolkit and the NVIDIA Container Toolkit, we need to ensure our virtual machine is compatible with these installations.

{% hint style="danger" %} <mark style="color:orange;">Compatibility is critical</mark>
{% endhint %}

The material below provides instructions on how to <mark style="color:yellow;">ensure the NVIDIA Drivers are compatible with the host system.</mark>

With the NVIDIA CUDA Toolkit installed, we need to ensure the host machine is compatible with this Toolkit.

### <mark style="color:blue;">Compatibility between CUDA 12.1 and the host development environment</mark>

This table lists the kernel versions, default <mark style="color:purple;">**GCC (GNU Compiler Collection)**</mark> versions, and <mark style="color:purple;">**GLIBC (GNU C Library) versions**</mark> for two different LTS (Long-Term Support) releases of Ubuntu.

<table data-full-width="false"><thead><tr><th>Distribution</th><th align="center">Kernel</th><th align="center">Default GCC</th><th align="center">GLIBC</th></tr></thead><tbody><tr><td>Ubuntu 22.04 LTS</td><td align="center">5.15.0-43</td><td align="center">11.2.0</td><td align="center">2.35</td></tr><tr><td><mark style="color:green;">Ubuntu 20.04 LTS</mark></td><td align="center"><mark style="color:red;">5.13.0-46</mark></td><td align="center"><mark style="color:red;">9.3.0</mark></td><td align="center"><mark style="color:red;">2.31</mark></td></tr></tbody></table>

### <mark style="color:blue;">Check the Kernel compatibility</mark>

To check the <mark style="color:blue;">**kernel version**</mark> of your Ubuntu 20.04 system, you can use the <mark style="color:yellow;">`uname`</mark> command in the terminal. The <mark style="color:yellow;">`uname`</mark> command with different options provides various system information, including the kernel version. Here's how you can do it:

**Run the&#x20;**<mark style="color:yellow;">**`uname`**</mark>**&#x20;command t**o get the kernel version by typing the following command and press Enter:

```bash
uname -r
```

The output should be this on a typical <mark style="color:yellow;">Ubuntu WSL2 distribution:</mark>

```bash
5.15.133.1-microsoft-standard-WSL2
```

or this on a <mark style="color:yellow;">typical Ubuntu 20.04</mark> virtual machine

```bash
5.4.0-167-generic
```

As you can see here the first  <mark style="color:blue;">Linux kernel is 5.15.133.1 -</mark> which is compatible with the CUDA Toolkit installed (range of 5.13.0 to 5.13.46). &#x20;

The second Linux kernel is also compatible at 5.4 (range of 5.13 to 5.46)

<details>

<summary><mark style="color:green;">What is a kernel?</mark></summary>

<mark style="color:yellow;">A kernel is the core component of an operating system (OS)</mark>.&#x20;

It acts as a bridge between applications and the actual data processing done at the hardware level<mark style="color:yellow;">.</mark>&#x20;

The kernel's responsibilities include managing the system's resources and allowing multiple programs to run and use these resources efficiently. Here are some key aspects of a kernel:

<mark style="color:blue;">**Resource Management**</mark>

The kernel <mark style="color:yellow;">manages hardware resources like the CPU, memory, and disk space</mark>. It allocates resources to various processes, ensuring that each process receives enough resources to function effectively while maintaining overall system efficiency.

<mark style="color:blue;">**Process Management**</mark>

It handles the creation, scheduling, and termination of processes. The kernel decides which processes should run when and for how long, a process known as <mark style="color:yellow;">scheduling.</mark> This is critical in multi-tasking environments where multiple processes require CPU attention.

<mark style="color:blue;">**Memory Management**</mark>

The kernel controls how <mark style="color:yellow;">memory is allocated to various processes and manages memory access</mark>, ensuring that each process has access to the memory it needs without interfering with other processes. It also manages virtual memory, allowing the system to use disk space as an extension of RAM.

<mark style="color:blue;">**Device Management**</mark>

It acts as an <mark style="color:yellow;">intermediary between the hardware and software of a computer.</mark>  For instance, when a program needs to read a file from a disk, it requests this service from the kernel, which then communicates with the disk drive’s hardware to read the data.

<mark style="color:blue;">**Security and Access Control**</mark>

The kernel enforces access control policies, preventing unauthorised access to the system and its resources. It manages user permissions and ensures that processes have the required privileges to execute their tasks.

<mark style="color:blue;">**System Calls**</mark>

These are the mechanisms through which user-space applications interact with the kernel.  For example, when an application needs to open a file, it makes a system call, which is handled by the kernel.

<mark style="color:blue;">**Types of Kernels**</mark>

* **Monolithic Kernels**: These kernels include various services like the filesystem, device drivers, network interfaces, etc., within one large kernel. Example: Linux.
* **Microkernels**: These kernels focus on minimal functionality, providing only basic services like process and memory management. Other components like device drivers are run in user space. Example: Minix.
* **Hybrid Kernels**: These are a mix of monolithic and microkernel architectures. Example: Windows NT kernel.

<mark style="color:blue;">Examples of Kernels</mark>

* **Linux Kernel**: Used in Linux distributions.
* **Windows NT Kernel**: Used in various versions of Microsoft Windows.
* **XNU Kernel**: Used in macOS and iOS.

</details>

### <mark style="color:blue;">Check GNU Compiler Compatibility</mark>

NVIDIA CUDA Libraries work in conjunction with <mark style="color:blue;">**GCC (GNU Compiler Collection)**</mark> on Linux systems.&#x20;

GCC is commonly used for compiling the <mark style="color:blue;">**host (CPU)**</mark> part of the code, while CUDA tools like nvcc (NVIDIA CUDA Compiler) are used for compiling the <mark style="color:blue;">**device (GPU)**</mark> part of the code.

The CUDA Toolkit includes wrappers and libraries that *<mark style="color:yellow;">**facilitate the integration between the CPU and GPU parts of the code.**</mark>* &#x20;

NVIDIA provides compatibility information for specific versions of GCC, especially on Linux systems where GCC is a common choice for compiling the host code.

The CUDA runtime libraries, which are installed separately, are sufficient for running CUDA applications on systems with compatible NVIDIA GPUs.

{% hint style="warning" %}
The <mark style="color:blue;">**gcc compiler**</mark> is required for development using the CUDA Toolkit
{% endhint %}

To reiterate - when developing applications that use both CPU and GPU, <mark style="color:yellow;">developers might use GCC for compiling the CPU part of the code</mark>, while <mark style="color:yellow;">CUDA tools (like nvcc - NVIDIA CUDA Compiler) are used for compiling the GPU part</mark>.

The CUDA toolkit often includes compatibility information with specific versions of GCC, especially on Linux systems, where GCC is a common choice for compiling the host code.

Run the following command to check the <mark style="color:blue;">installed version of GCC:</mark>

```bash
gcc --version
```

The first line of the output will show the version number. Ensure it matches the default GCC version listed in your table for your Ubuntu version.

<pre><code><strong>gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 
</strong>9.4.0  &#x3C;--- This is the GCC version number
</code></pre>

&#x20;If you <mark style="color:yellow;">**do not**</mark> have GCC installed, execute the following:

<details>

<summary><mark style="color:green;">Installation of GCC via installing 'build essentials'</mark></summary>

The <mark style="color:yellow;">`build-essential`</mark> meta-package in Ubuntu is a collection of tools and packages needed for <mark style="color:yellow;">compiling and building software.</mark>&#x20;

This package is particularly useful for developers and those compiling software from source. Here's a detailed summary of each package included in `build-essential` and information about where these packages are typically stored on Ubuntu 20.04:

### <mark style="color:purple;">**dpkg-dev**</mark>

* **Purpose**: This package is a collection of <mark style="color:yellow;">development tools required to handle Debian (.deb) packages.</mark> It includes utilities to unpack, build, and upload Debian source packages, making it an essential tool for packaging software for Debian-based systems like Ubuntu.
* **Storage Location**: The tools and scripts from <mark style="color:yellow;">`dpkg-dev`</mark> are usually stored in <mark style="color:yellow;">`/usr/bin/`</mark> and <mark style="color:yellow;">`/usr/share/dpkg/`</mark><mark style="color:yellow;">.</mark>

### <mark style="color:purple;">**make**</mark>

* **Purpose**: `make` is a build automation tool that automatically builds executable programs and libraries from source code by reading files called Makefiles.  It's crucial for compiling large programs where it manages dependencies and only recompiles parts of the program that have changed.
* **Storage Location**: The `make` executable is typically found in <mark style="color:yellow;">`/usr/bin/make`</mark><mark style="color:yellow;">.</mark>

### <mark style="color:purple;">**libc6-dev**</mark>

* **Purpose**: This package contains the development libraries and header files for the GNU C Library. It's essential for <mark style="color:yellow;">compiling C and C++ programs</mark>, as it includes standard libraries and headers.
* **Storage Location**: The headers and libraries are generally located in `/usr/include/` and `/usr/lib/` respectively.

### <mark style="color:purple;">**gcc/g++**</mark>

* **Purpose**: These are the GNU Compiler Collection for C and C++ languages. <mark style="color:yellow;">`gcc`</mark> is for compiling C programs, while <mark style="color:yellow;">`g++`</mark> is used for C++ programs. They are fundamental for software development in these languages.
* **Storage Location**: The compilers are usually found in <mark style="color:yellow;">`/usr/bin/`</mark><mark style="color:yellow;">.</mark>

When you install the <mark style="color:yellow;">`build-essential`</mark> package on Ubuntu, it automatically installs these components and their dependencies. This package streamlines the setup process for a development environment by bundling these critical tools together.

To install <mark style="color:yellow;">`build-essential`</mark> on Ubuntu 20.04, you can use the following command in the terminal:

```bash
sudo apt update
sudo apt install build-essential
```

This command will download and install the <mark style="color:yellow;">`build-essential`</mark> package along with its dependencies. The packages are typically stored in the locations mentioned above, following the standard file system hierarchy of Linux systems. This structure helps in maintaining a standardized path for binaries, libraries, and other files, making it easier for users and other software to locate them.

</details>

Post installation of build essentials, check the GCC version you have:

```
gcc --version
```

The output should <mark style="color:blue;">now prove you have GCC installed:</mark>

<pre class="language-bash"><code class="lang-bash">gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) <a data-footnote-ref href="#user-content-fn-2">9.4.0</a>
</code></pre>

This version 9.4 should work with CUDA 12.1, which requires at least 9.3.   GCC is considered 'backward compatible', so this version of 9.4 should be fine.&#x20;

<details>

<summary><mark style="color:green;">What is GCC and why is the version important?</mark></summary>

GCC is a <mark style="color:purple;">**collection of compilers for various programming languages**</mark>**.**&#x20;

Although it started primarily for C (hence the original name GNU C Compiler), it now supports C++, Objective-C, Fortran, Ada, Go, and D.

<mark style="color:blue;">**Cross-Platform Compatibility**</mark>

GCC can be used on many different types of operating systems and hardware architectures. This cross-platform capability makes it a versatile tool for developers who work in diverse environments.

<mark style="color:blue;">**Optimization and Portability**</mark>

GCC offers a wide range of options for code optimization, making it possible to <mark style="color:purple;">tune performance for specific hardware or application requirements</mark>. It also emphasizes portability, enabling developers to compile their code on one machine and run it on another without modification.

<mark style="color:blue;">**Standard Compliance**</mark>

GCC strives to adhere closely to various programming language standards, including those for C and C++. This compliance ensures that code written and compiled with GCC is compatible with other compilers following the same standards.

<mark style="color:blue;">**Debugging and Error Reporting**</mark>

GCC is known for its helpful debugging features and detailed error reporting, which are invaluable for developers in identifying and fixing code issues.

<mark style="color:blue;">**Integration with Development Tools**</mark>

GCC easily integrates with various development tools and environments. It's commonly used in combination with IDEs, debuggers, and other tools, forming a complete development ecosystem.

</details>

### <mark style="color:blue;">Check GLIBC Compatibility</mark>

The GNU C Library, commonly known as glibc, is an important component of GNU systems and Linux distributions.&#x20;

GLIBC is the GNU Project's implementation of the C standard library.   It provides the system's core libraries. This includes facilities for basic file I/O, string manipulation, mathematical functions, and various other standard utilities.

To check the <mark style="color:blue;">GLIBC version:</mark>

```bash
ldd --version
```

The first line of the output will show the version number. For example:

```bash
ldd (Ubuntu GLIBC 2.31-0ubuntu9.9) 2.31
```

The output should be:

```bash
ldd (Ubuntu GLIBC 2.31-0ubuntu9.12) 2.31  <--This is the version
```

Compare this with the GLIBC version in your table.  &#x20;

The <mark style="color:yellow;">GLIBC version of 2.31</mark> is the same as the version required for the NVIDIA CUDA Toolkit

<details>

<summary><mark style="color:green;">What is GLIBC?</mark></summary>

1. **Definition**: GLIBC is the <mark style="color:yellow;">GNU Project's implementation of the C standard library</mark>. Despite its name, it now also directly supports C++ (and indirectly other programming languages).
2. **Purpose**: It provides the system's core libraries. This includes facilities for basic file I/O, string manipulation, mathematical functions, and various other standard utilities.
3. **Compatibility**: It's designed to be compatible with the POSIX standard, the Single UNIX Specification, and several other open standards, while also extending them in various ways.
4. **System Calls and Kernel**: glibc serves as a wrapper for system calls to the Linux kernel and other essential functions. This means that most applications on a Linux system depend on glibc to interact with the underlying kernel.
5. **Portability**: It's used in systems that range from embedded systems to servers and supercomputers, providing a consistent and reliable base across various hardware architectures.

<mark style="color:blue;">**Checking GLIBC Version**</mark>

To check the version of glibc on a Linux system, you can use the <mark style="color:yellow;">`ldd`</mark> command, which prints the shared library dependencies. The version of glibc will be displayed as part of this output. Here's how to do it:

**Run the Command**: Type the following command and press Enter:

```bash
ldd --version
```

The first line of the output will typically show the glibc version. For example, it might say `ldd (Ubuntu GLIBC 2.31-0ubuntu9.2) 2.31`, where "2.31" is the version of glibc.

#### <mark style="color:blue;">Importance in Development</mark>

1. **Compatibility**: When developing software for Linux, it's crucial to know the version of glibc your application will be running against, as different versions may have different features and behaviors.
2. **Portability**: For applications intended to run on multiple Linux distributions, understanding glibc compatibility is key to ensuring broad compatibility.
3. **System-Level Programming**: For low-level system programming, knowledge of glibc is essential as it provides the interface to many kernel-level services and system resources.
4. **Debugging**: Understanding glibc can be crucial for debugging, especially for complex applications that perform a lot of system-level operations.

</details>

#### With the NVIDIA CUDA Toolkit's compatibility with host installations, the next step is to do a check for compatibility

### <mark style="color:blue;">Process for checking installations have been successful</mark>

First, <mark style="color:yellow;">check your Ubuntu version</mark>. Ensure it matches <mark style="color:blue;">Ubuntu 20.04,</mark> which is our designated Linux operating system

```bash
lsb_release -a
```

Then, verify that your system is based on the x86\_64 architecture. Run:

```bash
uname -m
```

The output should be:

```
x86_64
```

To check if your system has a <mark style="color:blue;">**CUDA-capable NVIDIA GPU**</mark>**,** run

```bash
nvidia-smi
```

You should see an output like this, which details the <mark style="color:blue;">NVIDIA Drivers installed</mark> and the <mark style="color:blue;">CUDA Version</mark>.

<pre><code>+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    <a data-footnote-ref href="#user-content-fn-3">CUDA Version: 12.1</a>     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  <a data-footnote-ref href="#user-content-fn-4">NVIDIA A100-SXM4-80GB</a>           On | 00000000:00:05.0 Off |                    0 |
| N/A   36C    P0               56W / 400W|      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1314      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+
</code></pre>

If this output is not visible, we must <mark style="color:blue;">**install the NVIDIA Drivers**</mark>

### <mark style="color:blue;">A full analysis</mark>

To do this all at once...

If you would like a full printout of your system features, enter this command into the terminal:

```bash
echo "Machine Architecture: $(uname -m)" && \
echo "Kernel Name: $(uname -s)" && \
echo "Kernel Release: $(uname -r)" && \
echo "Kernel Version: $(uname -v)" && \
echo "Hostname: $(uname -n)" && \
echo "Operating System: $(uname -o)" && \
echo "----" && \
cat /proc/version && \
echo "----" && \
echo "CPU Information:" && cat /proc/cpuinfo | grep 'model name' | uniq && \
echo "----" && \
echo "Memory Information:" && cat /proc/meminfo | grep 'MemTotal' && \
echo "----" && \
lsb_release -a 2>/dev/null && \
echo "----" && \
echo "NVCC Version:" && nvcc --version
```

The output from the terminal will provide you all the information necessary to check system information for compatibility.

<details>

<summary><mark style="color:green;">Typical analysis of the output from a A-100 80GB instance</mark></summary>

<mark style="color:blue;">**Machine Architecture: x86\_64**</mark>

* Your system uses the 64-bit version of the x86 architecture. This is a standard architecture for modern desktops and servers, supporting more memory and larger data sizes compared to 32-bit systems.

<mark style="color:blue;">**Kernel Details**</mark>

* **Kernel Name**: Linux, indicating that your operating system is based on the Linux kernel.
* **Kernel Release**: 5.4.0-167-generic. This specifies the version of the Linux kernel you are running. 'Generic' here implies a standard kernel version that is versatile for various hardware setups.
* **Kernel Version**: #184-Ubuntu SMP. This shows a specific build of the kernel, compiled with Symmetric Multi-Processing (SMP) support, allowing efficient use of multi-core processors. The timestamp shows the build date.

<mark style="color:blue;">**Hostname**</mark><mark style="color:blue;">:</mark> ps1rgbvhl

* This is the network identifier for your machine, used to distinguish it in a network environment.

<mark style="color:blue;">**Operating System**</mark><mark style="color:blue;">:</mark> GNU/Linux

* This indicates that you're using a GNU/Linux distribution, a combination of the Linux kernel with GNU software.

<mark style="color:blue;">**Detailed Kernel Version**</mark>

* This reiterates your kernel version and build details. It also mentions the GCC version used for building the kernel (9.4.0), which affects compatibility with certain software.

<mark style="color:blue;">**CPU Information**</mark><mark style="color:blue;">:</mark> Intel(R) Xeon(R) Gold 6342 CPU @ 2.80GHz

* The system is powered by an Intel Xeon Gold 6342 processor, which is a high-performance, server-grade CPU. The 2.80 GHz frequency indicates its base clock speed.

<mark style="color:blue;">**Memory Information**</mark><mark style="color:blue;">:</mark> MemTotal: 92679772 kB

* The system has a substantial amount of RAM (approximately 92.68 GB). This is a significant size, suitable for memory-intensive applications and multitasking.

<mark style="color:blue;">**Ubuntu Distribution Information**</mark>

* **Distributor ID**: Ubuntu. This shows the Linux distribution you're using.
* **Description**: Ubuntu 20.04.6 LTS, indicating the specific version and that it's a Long-Term Support (LTS) release.
* **Release**: 20.04, the version number.
* **Codename**: focal, the internal codename for this Ubuntu release.

<mark style="color:blue;">**NVCC Version**</mark>

* The output details the version of the NVIDIA CUDA Compiler (NVCC) as 12.1, built in February 2023. NVCC is a key component for compiling CUDA code, essential for developing applications that leverage NVIDIA GPUs for parallel processing tasks.

In summary, the output paints a picture of a powerful, 64-bit Linux system with a high-performance CPU and a significant amount of RAM, running an LTS version of Ubuntu.&#x20;

The presence of the NVCC with CUDA version 12.1 indicates readiness for CUDA-based development, particularly in fields like data science, machine learning, or any computationally intensive tasks that can benefit from GPU acceleration.

</details>

### <mark style="color:blue;">Installation of .NET SDK -  required for Polyglot Notebooks</mark>

<details>

<summary><mark style="color:green;">Installation of .NET</mark></summary>

.NET is a free, open-source, and cross-platform framework developed by Microsoft.&#x20;

It is used for building various types of applications, including web applications, desktop applications, cloud-based services, and more. .NET provides a rich set of libraries and tools for developers to create robust and scalable software solutions.

#### <mark style="color:green;">Add the Microsoft package repository</mark> <a href="#add-the-microsoft-package-repository" id="add-the-microsoft-package-repository"></a>

Installing with APT can be done with a few commands. Before you install .NET, run the following commands to add the Microsoft package signing key to your list of trusted keys and add the package repository.

Open a terminal and run the following commands:

```bash
wget https://packages.microsoft.com/config/ubuntu/20.04/packages-microsoft-prod.deb -O packages-microsoft-prod.deb
sudo dpkg -i packages-microsoft-prod.deb
rm packages-microsoft-prod.deb
```

#### <mark style="color:green;">Install the SDK</mark> <a href="#install-the-sdk" id="install-the-sdk"></a>

The .NET SDK allows you to develop apps with .NET. If you install the .NET SDK, you don't need to install the corresponding runtime. To install the .NET SDK, run the following commands:

```bash
sudo apt-get update && \
  sudo apt-get install -y dotnet-sdk-8.0
```

#### <mark style="color:green;">Install the runtime</mark> <a href="#install-the-runtime" id="install-the-runtime"></a>

The ASP.NET Core Runtime allows you to run apps that were made with .NET that didn't provide the runtime. The following commands install the ASP.NET Core Runtime, which is the most compatible runtime for .NET. In your terminal, run the following commands:

```bash
sudo apt-get update && \
  sudo apt-get install -y aspnetcore-runtime-8.0
```

As an alternative to the ASP.NET Core Runtime, you can install the .NET Runtime, which doesn't include ASP.NET Core support: replace `aspnetcore-runtime-8.0` in the previous command with `dotnet-runtime-8.0`:

```bash
sudo apt-get install -y dotnet-runtime-8.0
```

</details>

<details>

<summary><mark style="color:yellow;">If you want to change the GCC Version in your environment</mark></summary>

You can change your GCC version in a Conda environment.  &#x20;

Here's how you can change the GCC version in a Conda environment:

**Create a New Conda Environment (Optional)**

If you don't already have a specific environment for your CUDA work, create one:

```bash
conda create -axolotl python=3.10
```

**Activate the Conda Environment**:

```bash
conda activate axolotl
```

**Install a Specific GCC Version**:

```bash
conda install gcc_linux-64=gcc_version
```

Replace `gcc_version` with the version of GCC you need, for example, `9.4.0`.

**Verify GCC Version**:

```bash
gcc --version
```

**Install CUDA Toolkit (if needed)**:

* If you haven't installed CUDA in your environment, you can do so using Conda (if available) or follow the CUDA Toolkit's installation guide:

```bash
conda install cudatoolkit=x.x
```

* Replace `x.x` with the version of the CUDA Toolkit you need.

</details>

<details>

<summary><mark style="color:yellow;">If you want to change the version of CUDA being used in your environment</mark></summary>

The Conda installation for CUDA is an efficient way to install and manage the CUDA Toolkit, especially when working with Python environments.&#x20;

<mark style="color:blue;">**Conda Overview**</mark>

* Conda can facilitate the installation of the CUDA Toolkit.

<mark style="color:blue;">**Installing CUDA Using Conda**</mark>

* Basic installation command: `conda install cuda -c nvidia`.
* This command installs all components of the CUDA Toolkit.

<mark style="color:blue;">**Uninstalling CUDA Using Conda**</mark>

* Uninstallation command: `conda remove cuda`.
* It removes the CUDA Toolkit installed via Conda.
* Special Tip: After uninstallation, check for any residual files or dependencies that might need manual removal.

<mark style="color:blue;">**Installing Previous CUDA Releases**</mark>

* Install specific versions using: `conda install cuda -c nvidia/label/cuda-<version>`.
* Replace `<version>` with the desired CUDA version (e.g., `11.3.0`).
* Special Tip: Installing previous versions can be crucial for compatibility with certain applications or libraries. Always check version compatibility with your project requirements.

<mark style="color:blue;">**Practical Example: Installing CUDA Toolkit**</mark><mark style="color:blue;">:</mark>

Create a virtual environment

* <mark style="color:yellow;">`conda install -c "nvidia/label/cuda-11.8.0" cuda-nvcc`</mark><mark style="color:yellow;">:</mark> Installs the NVIDIA CUDA Compiler (nvcc) from the specified NVIDIA channel on Conda. This is aligned with the CUDA version 11.8.0, ensuring compatibility with the specific version of PyTorch being used.
* <mark style="color:yellow;">`conda install -c anaconda cmake`</mark><mark style="color:yellow;">:</mark> Installs CMake, a cross-platform tool for managing the build process of software using a compiler-independent method.
* <mark style="color:yellow;">`conda install -c anaconda cmake`</mark><mark style="color:yellow;">:</mark> Installs 'lit', a tool for executing LLVM's integrated test suites.

**Additional Tools Installation (Optional)**:

* <mark style="color:yellow;">`conda install -c anaconda cmake`</mark><mark style="color:yellow;">:</mark> Installs CMake, a cross-platform tool for managing the build process of software using a compiler-independent method.
* <mark style="color:yellow;">`conda install -c conda-forge lit`</mark><mark style="color:yellow;">:</mark> Installs 'lit', a tool for executing LLVM's integrated test suites.

<mark style="color:blue;">**Installing PyTorch and Related Libraries**</mark>

* The <mark style="color:yellow;">`pip install`</mark> command is used to install specific versions of PyTorch (`torch`), along with its sister libraries `torchvision` and `torchaudio`. The `--index-url` specifies the PyTorch wheel for CUDA 11.8, ensuring that the installed PyTorch version is compatible with CUDA 11.8.
* These commands add a new PPA (Personal Package Archive) for Ubuntu toolchain tests and install GCC 11 and G++ 11. These are needed for building certain components that require C++ compilation, particularly for `deepspeed`, a deep learning optimization library.

### <mark style="color:green;">**Checking to see whether the revised version of CUDA is installed**</mark>

<mark style="color:blue;">**CUDA in Conda Environments**</mark>

* When you create a Conda environment and install a specific version of CUDA (like 11.8 in your case), you are installing CUDA toolkit libraries that are <mark style="color:purple;">**compatible with that version within that environment**</mark>**.**
* This installation <mark style="color:purple;">**does not change the system-wide CUDA version**</mark>, nor does it affect what `nvidia-smi` displays.
* The Conda environment's CUDA version is used by the programs and processes running within that environment. <mark style="color:purple;">It's independent of the system-wide CUDA installation.</mark>

<mark style="color:blue;">**Verifying CUDA Version in Conda Environment**</mark>

* To check the CUDA version in your Conda environment, you <mark style="color:purple;">should not rely on</mark> <mark style="color:yellow;">`nvidia-smi`</mark>. Instead, you can check the version of the CUDA toolkit you have installed in your environment. This can typically be done by checking the version of specific CUDA toolkit packages installed in the environment, like `cudatoolkit`.
* You can use a command like <mark style="color:yellow;">`conda list cudatoolkit`</mark> within your Conda environment to see the installed version of the CUDA toolkit in that environment.

<mark style="color:blue;">**Compatibility**</mark>

* It's important to ensure that the CUDA toolkit version within your Conda environment is compatible with the version supported by your NVIDIA driver (as indicated by `nvidia-smi`). <mark style="color:red;">If the toolkit version in your environment is</mark> <mark style="color:red;"></mark>*<mark style="color:red;">**higher**</mark>* <mark style="color:red;"></mark><mark style="color:red;">than the driver's supported version, you may encounter compatibility issues</mark>.

In summary, <mark style="color:yellow;">`nvidia-smi`</mark> <mark style="color:yellow;"></mark><mark style="color:yellow;">shows the maximum CUDA version supported by your GPU's driver,</mark> not the version used in your current Conda environment. To check the CUDA version in a Conda environment, use Conda-specific commands to list the installed packages and their versions.

Another way of putting it:

1. **CUDA Driver Version:** <mark style="color:yellow;">The version reported by</mark> <mark style="color:yellow;"></mark><mark style="color:yellow;">`nvidia-smi`</mark> <mark style="color:yellow;"></mark><mark style="color:yellow;">is the CUDA driver version installed on your system</mark>, which is 12.3 in your case. This is the version of the driver software that allows your operating system to communicate with the NVIDIA GPU.
2. **CUDA Toolkit Version in PyTorch:** <mark style="color:yellow;">When you install PyTorch with a specific CUDA toolkit</mark> version (like `cu118` for CUDA 11.8), <mark style="color:yellow;">it refers to the version of the CUDA toolkit libraries that PyTorch uses for GPU acceleration</mark>. PyTorch packages these libraries with itself, so it does not rely on the system-wide CUDA toolkit installation.
3. **Compatibility:** The key point is compatibility. Your system's CUDA driver version (12.3) is newer and compatible with the CUDA toolkit version used by PyTorch (11.8). Generally, a newer driver version can support older toolkit versions without issues.
4. **Functionality Check:** <mark style="color:yellow;">As long as</mark> <mark style="color:yellow;"></mark><mark style="color:yellow;">`torch.cuda.is_available()`</mark> <mark style="color:yellow;"></mark><mark style="color:yellow;">returns</mark> <mark style="color:yellow;"></mark><mark style="color:yellow;">`True`</mark>, it indicates that <mark style="color:yellow;">PyTorch is able to interact with your GPU using its bundled CUDA libraries</mark>, and you should be able to run CUDA-accelerated PyTorch operations on your GPUs.

In summary, your setup is fine for running PyTorch with GPU support. The difference in the CUDA driver and toolkit versions is normal and typically not a problem as long as the driver version is equal to or newer than the toolkit version required by PyTorch.

</details>

### <mark style="color:blue;">**Test Compatibility**</mark>

Below are some scripts to create to test for compatibility. &#x20;

These scripts will test that both your CPU and GPU are correctly processing the CUDA code.  It will also test to make sure there are no compatibility issues between the installed GCC version and the CUDA Toolkit version you are using.

<details>

<summary><mark style="color:blue;">Compatibility Test Scripts</mark></summary>

To test the compatibility of your GCC version with the CUDA Toolkit version installed, you can use a simple CUDA program. Below is a basic script for a CUDA program that performs a simple operation on the GPU. This script will help you verify that your setup is correctly configured for CUDA development.

First, create a simple CUDA program. Let's call it <mark style="color:yellow;">`test_cuda.cu`</mark>:

```cpp
#include <stdio.h>
#include <cuda_runtime.h>

// Kernel function to add two vectors
__global__ void add(int n, float *x, float *y) {
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    for (int i = index; i < n; i += stride)
        y[i] = x[i] + y[i];
}

int main(void) {
    int N = 1<<25; // 33.6M elements

    float *x, *y;
    cudaEvent_t start, stop;

    cudaMallocManaged(&x, N*sizeof(float));
    cudaMallocManaged(&y, N*sizeof(float));

    for (int i = 0; i < N; i++) {
        x[i] = 1.0f;
        y[i] = 2.0f;
    }

    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start);

    int blockSize = 256;
    int numBlocks = (N + blockSize - 1) / blockSize;
    add<<<numBlocks, blockSize>>>(N, x, y);

    cudaEventRecord(stop);
    cudaEventSynchronize(stop);

    float milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);
    printf("Time taken: %f ms\n", milliseconds);

    float maxError = 0.0f;
    for (int i = 0; i < N; i++)
        maxError = fmax(maxError, fabs(y[i]-3.0f));
    printf("Max error: %f\n", maxError);

    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    cudaFree(x);
    cudaFree(y);
    
    return 0;
}

```

Next, create a shell script to compile and run this CUDA program. Name this script `test_cuda_compatibility.sh`:

```bash
#!/bin/bash

# Define the CUDA file
cuda_file="test_cuda.cu"

# Define the output executable
output_executable="test_cuda_executable"

# Compile the CUDA program
nvcc $cuda_file -o $output_executable

# Check if the compilation was successful
if [ $? -eq 0 ]; then
    echo "Compilation successful. Running the CUDA program..."
    ./$output_executable
else
    echo "Compilation failed."
fi
```

This script compiles the <mark style="color:yellow;">`test_cuda.cu`</mark> file using <mark style="color:yellow;">`nvcc`</mark>, the NVIDIA CUDA compiler, and then runs the compiled executable if the compilation is successful.

**How to Use the Script:**

1. Save the CUDA program code in a file named <mark style="color:yellow;">`test_cuda.cu`</mark><mark style="color:yellow;">.</mark>
2. Save the shell script in a file named <mark style="color:yellow;">`test_cuda_compatibility.sh`</mark><mark style="color:yellow;">.</mark>
3. Make the shell script executable:

```bash
chmod +x test_cuda_compatibility.sh
```

Run the script:

```bash
./test_cuda_compatibility.sh
```

If everything is set up correctly, the script will compile the CUDA program and run it, resulting in output from both the CPU and GPU.&#x20;

If there are compatibility issues between GCC and the CUDA Toolkit, the script will likely fail during compilation, and you'll see error messages indicating what went wrong.

</details>

<mark style="color:red;">**Remember:**</mark> Compatibility between the GCC version and the CUDA Toolkit is crucial. Make sure the GCC version you choose is compatible with your CUDA Toolkit version.

### <mark style="color:blue;">Where are you now?</mark>

We have now created a deep learning development environment optimised for NVIDIA GPUs, with compatibility across key components.

We have so far:

**-Installed CUDA Toolkit and Drivers**

**-Set up the NVIDIA Container Toolkit to allow access to NVIDIA Docker containers**

**-Ensured Host Compatibility by verifying** variables such as GCC (GNU Compiler Collection) and GLIBC (GNU C Library) are compatible with the CUDA version.

**-Created a Compatibility Check Script:**  Developing a script to check for compatibility issues&#x20;

With these components in place, your environment is tailored for deep learning development. It supports the development and execution of deep learning models, leveraging the computational power of GPUs for training and inference tasks.

### <mark style="color:blue;">==>  With the environment established for NVIDIA GPUS, the next step is creating the virtual environment for Axolotl and installing the code base</mark>

[^1]: This is the version of CUDA

[^2]: This is the version number

[^3]: This is the CUDA Version

[^4]: This is the virtual machine


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://axolotl.continuumlabs.pro/creation-of-environment.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
