> For the complete documentation index, see [llms.txt](https://axolotl.continuumlabs.pro/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://axolotl.continuumlabs.pro/phi-2.0/phi-2.0-preprocessing.md). # Phi 2.0 - Preprocessing To execute the training run, we will first preprocess the dataset. Axolotl allows us to optionally **pre-tokenize dataset** before finetuning. This is recommended for large datasets. ### Populate the config.yaml file To execute the preprocessing, we have to make sure the **datasets component** of the config.yaml is correctly populated. You will need to ***direct Axolotl to where your dataset is**.* To do this, find the path to your dataset in VS Code by right clicking on the dataset and asking for the relative path. Then enter it into the **config YAML file** below which is located at: ### *your directory*/axolotl/examples/phi/phi2-ft.yml ```bash your directory/axolotl/examples/phi/phi2-ft.yml ``` When you have located the file, then **populate the configuration file** below. To do this, find the **path to your dataset** in VS Code by **right clicking** on the dataset and asking for the relative path. Use this in the command below. Remember to **include the file name.** ```yaml datasets: - path: datasets/alpaca-cleaned/alpaca_data_cleaned.json type: alpaca ds_type: json data_files: - alpaca_data_cleaned.json dataset_prepared_path: val_set_size: 0.20 output_dir: ./phi-out ``` The data set type (ds\_type) is JSON, and we have arbitrarily set the output directory as **./phi-out** For a refresher of what the alpaca-cleaned dataset contains:

Alpaca Cleaned Dataset - Description

The Alpaca dataset is a curated and cleaned version of an original dataset released by Stanford, specifically designed for instruction-tuning of language models. **Dataset Description and Improvements** 1. **Issues Addressed**: * **Hallucinations**: The original dataset included instructions that led to irrelevant or fabricated responses by the model, particularly in cases where the input was a URL or an ambiguous prompt. These have been cleaned to prevent such issues. * **Merged Instructions**: Instances where multiple instructions were combined have been separated for clarity and precision. * **Empty Outputs**: Entries with no outputs have been addressed. * **Missing Code Examples**: The cleaned dataset ensures that code examples are included where necessary. * **Instructions for Image Generation**: Removed or modified instructions that required image generation, as this is not feasible for a text-based model. * **N/A Outputs and Inconsistent Input Fields**: These issues have been standardised or corrected. * **Incorrect Answers**: The dataset has been reviewed for accuracy, especially in areas like math problems where the original dataset had a high error rate. * **Unclear Instructions**: Ambiguous instructions have been clarified or rewritten. * **Extraneous Characters**: Removal of unnecessary escape and control characters. 2. **Original Alpaca Dataset Summary**: * The original dataset, Alpaca, contains 52,000 instruction examples created using OpenAI's `text-davinci-003` engine. * It was designed for instruction tuning, making language models better at following specific instructions. * The dataset was generated more cost-effectively and is more diverse compared to its predecessors. 3. **Dataset Structure**: * The dataset consists of fields like `instruction`, `input`, `output`, and `text`, which combine these elements in a format suitable for model fine-tuning. * The `input` field is optional and used when context is necessary for the instruction. 4. **Data Splits**: * The dataset contains 52,002 training examples. 5. **Usage and Applications**: * It is intended for instruction training of pre-trained language models, particularly in the text-generation domain. 6. **Languages**: * The dataset is exclusively in English.

Once you have configured the dataset component of the YAML file, then execute the following command which uses the **preprocess.py script** within the Axolotl library. {% code fullWidth="false" %} ```bash python -m axolotl.cli.preprocess examples/phi/phi2-ft.yml ``` {% endcode %} The output should indicate success: ```bash Success! Preprocessed data path: `dataset_prepared_path: ./phi-out' ``` If you have not set a dataset\_prepared\_path in the configuration file, it will default to: "preprocess CLI called without dataset\_prepared\_path set, using default path: last\_run\_prepared" For an analysis of the output relating to the dataset preprocessing, see below:

Analysis of Output from Pre-Processing

The numbers: 1. **PID (Process ID)**: `6779` * This is the ID of the process running the script. 2. **Token IDs**: * End Of Sentence (EOS) Token ID: `2` * Beginning Of Sentence (BOS) Token ID: `1` * Padding (PAD) Token ID: `0` * Unknown (UNK) Token ID: `0` 3. **Data Downloading and Processing**: * Number of data files downloaded: `1` * Download speed: `12557.80 items/second` * Number of data files extracted: `1` * Extraction speed: `214.70 items/second` 4. **Dataset Generation**: * Number of examples in the train split: `51760` * Generation speed: `85351.57 examples/second` 5. **Mapping and Filtering**: * Number of examples processed in mapping (num\_proc=12): `51760` * Processing speed in mapping: `6035.04 examples/second` * Number of examples processed in filtering (num\_proc=12): `51760` * Filtering speed: `41994.96 examples/second` * Number of examples processed in the second mapping (num\_proc=12): `51760` * Second mapping speed: `30435.07 examples/second` 6. **Token Counts**: * Total number of tokens: `12104896` * Total number of supervised tokens: `8475133` 7. **Efficiency Estimates and Data Loading**: * Packing efficiency estimate: `1.0` * Total number of tokens per device: `12104896` * Data loader length: `1461` * Sample packing efficiency estimate across ranks: `0.9798729691644562` * Sample packing efficiency estimate: `0.98` * Total number of steps: `5844` 8. **Time and Date Information**: * Date and time of the log entries: `2023-12-06` * Times of various log entries: `05:24:22,428`, `05:24:22,688`, `05:24:32,282`, `05:24:36,028`, `05:24:36,393`, `05:24:41,028`, `05:24:41,029`, `05:24:41,037` * Total execution time: `26 seconds` * Time when the command prompt was ready again: `05:24:42`

If you would like information on how the preprocess.py script works you can view the code at ax**olotl****/****src****/****axolotl****/****cli.** Alternatively, take the time to read the analysis of the scripts, and the output it produced:

Axolotl preprocess.py - analysis of script and command output

The provided **`preprocess.py`** script is a **command-line interface (CLI) tool** for preprocessing datasets in the context of training models using the Axolotl platform. **Imports and Logger Setup** * The script imports necessary modules such as **`logging`****,**** ****`pathlib.Path`** for file path operations, **`fire`** for the CLI, and several modules from **`transformers`** and **`axolotl`**. * **`colorama`**** i**s used for coloured console output, enhancing readability. * A logger **`LOG`** is set up for logging purposes, using the **`logging`** module. **do\_cli Function**: **Load Configurations** The main function **`do_cli`** takes a `config` argument (with a default path of **`"examples/"`**) and **`**kwargs`** for additional arguments. These configurations include essential details about the dataset, model, and training setup. This step sets up the parameters for how the dataset will be processed and how the model training will proceed. **Load Datasets** The function **`load_datasets`** is called with the loaded configuration and parsed command-line arguments. This function is responsible for the heavy lifting of loading the dataset (which could include downloading it if not already present locally) and preprocessing it according to the specifications in the configuration. This step is central to preparing the data for model training. **ASCII Art and Configuration Loading** It starts by printing ASCII art using **`print_axolotl_text_art`** for visual appeal. This is more for aesthetic purposes and has no impact on the functional aspects of the scrip * Then, it loads the configuration file using **`load_cfg`**, which sets up the parameters for dataset preprocessing. * Initially, there's a deprecation warning from the **`transformers.deepspeed`** module. This suggests that in future versions of the Transformers library, you'll need to import DeepSpeed modules differently. * **Accelerator and User Token Checks**: It ensures that the default accelerator configuration is set and validates the user token for authentication or API access. * **CLI Arguments Parsing**: The script uses **`transformers.HfArgumentParse`**`r` to parse additional command-line arguments specific to preprocessing, defined in `PreprocessCliArgs`. * **Dataset Preparation Path Check**: It checks if **`dataset_prepared_path`** is set in the configuration. If the **`dataset_prepared_pat`**`h` is not explicitly set in the configuration, the script issues a warning and *assigns a default pat*h **(****`DEFAULT_DATASET_PREPARED_PATH`****).** This path is where the script will store the preprocessed dataset, ensuring that there’s always a defined location for this data. **Load and Preprocess Datasets** The **`load_datasets`** function is called with the loaded configuration and parsed CLI arguments, handling the dataset loading and preprocessing. * The script provides debug information about special tokens (End Of Sentence, Beginning Of Sentence, Padding, and Unknown) used by the tokenizer. This information is crucial for understanding how the model will interpret different types of tokens in the data. **Logging Success** * Upon successful completion, the script logs a message indicating the path where the preprocessed data is stored. This is done using the **`LOG.info`** method, and the message is coloured green for visibility. **Main Block** This block checks if the script is the main program being run **(****`__name__ == "__main__"`****)** and not a module imported in another script. If it is the main program, it uses **`fire.Fire(do_cli)`** to enable the script to be run from the command line, where **`do_cli`** is the main function being executed. **General Observations** * **User-Friendly**: The script is user-friendly, providing clear messages and using colour coding for warnings and success messages. * **Flexibility**: It is flexible and allows customization through command-line arguments. * **Error Handling**: The script checks for potential issues like missing configuration parameters and handles them gracefully, providing default values and warnings. * **Logging**: Effective use of logging helps in tracking the script's execution and diagnosing issues if any arise. OUTPUT 1. **Warnings and ASCII Art**: 2. **Token Information**: The script provides debug information about special tokens (End Of Sentence, Beginning Of Sentence, Padding, and Unknown) used by the tokenizer. This information is crucial for understanding how the model will interpret different types of tokens in the data. 3. **Data File Processing**: The script downloads and extracts data files, then generates a training split with 51,760 examples. This is followed by mapping and filtering operations on the dataset, performed in parallel (noted by `num_proc=12`), which is an efficient way to handle large datasets. 4. **Dataset Merging and Saving**: Post-processing, the datasets are merged and saved to disk. This is an essential step to ensure that the processed data is stored in a format ready for model training. 5. **Token Counts and Sample Packing**: The script calculates the total number of tokens and supervised tokens. It also estimates the packing efficiency and total number of steps needed for training, which are critical for understanding how the data will be batched and fed into the model during training. 6. **Final Information and Path**: Finally, the script concludes with a success message and provides the path to the preprocessed data. This is your confirmation that the preprocessing was successful and where the prepared data is stored.

This script for our Axolotl platform will be:

The configured Phi2-ft.yml YAML file

```yaml base_model: microsoft/phi-2 model_type: AutoModelForCausalLM tokenizer_type: AutoTokenizer load_in_8bit: false load_in_4bit: false strict: false datasets: - path: datasets/alpaca-cleaned/alpaca_data_cleaned.json type: alpaca ds_type: json data_files: - alpaca_data_cleaned.json dataset_prepared_path: val_set_size: 0.20 output_dir: ./phi-out sequence_len: 2048 sample_packing: true pad_to_sequence_len: true adapter: lora_model_dir: lora_r: lora_alpha: lora_dropout: lora_target_linear: lora_fan_in_fan_out: wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 1 micro_batch_size: 2 num_epochs: 4 optimizer: adamw_torch adam_beta2: 0.95 adam_epsilon: 0.00001 max_grad_norm: 1.0 lr_scheduler: cosine learning_rate: 0.000003 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: true gradient_checkpointing: true gradient_checkpointing_kwargs: use_reentrant: True early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true warmup_steps: 100 evals_per_epoch: 4 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.1 fsdp: fsdp_config: resize_token_embeddings_to_32x: true special_tokens: pad_token: "<|endoftext|>" ```

{% hint style="warning" %} Potential Flash Attention Dependency Issues {% endhint %} ### Flash Attention Issues Some users have had some issue with Flash Attention dependencies. For some reason the axolotl environment keeps reverting to Pytorch 2.01 - Flash Attention needs Pytorch version 2.10 We executed this command in the axolotl library, and it upgraded Pytorch to 2.10. ```bash pip install flash_attn -U --force-reinstall ```

Flash Attention Debugging

Flash Attention needs Pytorch 2.10 - not 2.0. I had to force install Flash Attention to make it work:\ \ The error message you're encountering originates from running a Python script that attempts to preprocess a configuration file for a machine learning model using the `axolotl.cli.preprocess` module. The error is an `Import Error` associated with the `flash_attn_2_cuda` library, which is part of the `flash_attn` package. Let's break down the error message line by line: 1. **Command Executed**: `python -m axolotl.cli.preprocess your directory/axolotl/examples/llama-2/lora.yml` * This command attempts to run a Python module (`axolotl.cli.preprocess`) with a specific configuration file as its argument. 2. **Initial Traceback**: * The Python interpreter starts by trying to execute the module but encounters an issue in the process. 3. **Import Chain**: * The error arises in a chain of imports starting from your script's initial import statement. Python attempts to import various modules and packages needed for the script to run. 4. **Specific ImportError**: * The final error, `ImportError`, occurs when Python tries to import `flash_attn_2_cuda` from the `flash_attn` package. * The error message specifically states `undefined symbol: _ZN3c104cuda9SetDeviceEi`. This indicates a missing or incompatible symbol in the compiled CUDA extension. 5. #### Troubleshooting Steps 1. **Verify CUDA Installation**: * Ensure that CUDA is correctly installed on your system and is compatible with the versions required by `flash_attn`. * Check the CUDA version against the version required by `flash_attn`. If there's a mismatch, consider upgrading or downgrading CUDA. 2. **Check PyTorch and flash\_attn Compatibility**: * Verify that the installed PyTorch version is compatible with `flash_attn`. Incompatibilities between CUDA, PyTorch, and `flash_attn` can lead to such errors. 3. **Reinstall flash\_attn**: * Try reinstalling the `flash_attn` package. Sometimes, recompiling the package can resolve symbol mismatch errors. * Use the command `pip install flash-attn --force-reinstall` to reinstall. 4. **Check Environment Variables**: * Ensure that your `LD_LIBRARY_PATH` environment variable includes the paths to the CUDA libraries. 5. **Inspect Python Environment**: * Verify that you are using the correct Python environment where all required dependencies are installed. Sometimes, conflicts in environments can cause such issues. 6. **Check for System Updates**: * Occasionally, system updates or changes to the compiler or related libraries can cause incompatibilities. Ensure your system is up to date. 7. **Consult Documentation or Community Forums**: * Check the documentation for `flash_attn` and related packages for any known issues or troubleshooting tips. * Seek help from relevant community forums or issue trackers for `flash_attn` and related projects. 8. **Test in a Clean Environment**: * If possible, try to run the script in a clean Python environment with only necessary packages installed. This can help rule out conflicts with other packages.

Using the Upload File Function

The `upload file` function from the Hugging Face Hub library is used to upload files to a repository on the Hugging Face Hub. This function is quite versatile, supporting various parameters to customize the upload process. Here's a breakdown of the parameters and what they mean: 1. **path\_or\_fileobj (str, Path, bytes, or IO)**: This is the source of the file you want to upload. It can be a path to a file on your local machine, a binary data stream, a file object, or a buffer. 2. **path\_in\_repo (str)**: This specifies where in the repository the file should be placed. For example, if you want to place a file in a folder called 'checkpoints', you would use something like `"checkpoints/weights.bin"`. 3. **repo\_id (str)**: The identifier for the repository to which you are uploading. This usually follows the format `"username/repository-name"`. 4. **token (str, optional)**: Your authentication token for the Hugging Face Hub. If you have already logged in using `HfApi.login`, this will default to the stored token. If not provided, the function will attempt to use a stored token from a previous login. 5. **repo\_type (str, optional)**: Indicates the type of repository. It can be `"dataset"`, `"space"`, or `"model"`. If you are uploading to a dataset or a space, specify accordingly. The default is `None`, which is interpreted as a model. 6. **revision (str, optional)**: Specifies the git revision (like a branch name or commit hash) from which the commit should be made. The default is the head of the "main" branch. 7. **commit\_message (str, optional)**: A short summary or title for the commit you are making. Think of this as the headline of your changes. 8. **commit\_description (str, optional)**: A more detailed description of the changes you are committing. 9. **create\_pr (boolean, optional)**: Determines whether to create a Pull Request for the commit. Defaults to `False`. If `True`, a PR will be created based on the specified `revision`. 10. **parent\_commit (str, optional)**: The hash of the parent commit to which your changes will be added. This is used to ensure that you are committing to the correct version of the repository, especially useful in concurrent environments. 11. **run\_as\_future (bool, optional)**: If set to `True`, the upload process will run in the background as a non-blocking action. This returns a `Future` object that can be used to check the status of the upload. The `path_in_repo` parameter in the `api.upload_file` function from the `huggingface_hub` library specifies the destination path within the repository on the Hugging Face Hub where the file will be uploaded. This is essentially the relative path in the repository where the file will be placed. In this example command: ```python api.upload_file( path_or_fileobj="/path/to/trained_model_weights.bin", path_in_repo="model_weights.bin", repo_id="username/my_ml_model", repo_type="model" ) ``` * `path_or_fileobj="/path/to/trained_model_weights.bin"`: This is the path to the file on your local machine. It indicates where the file is currently stored in your local file system. * `path_in_repo="model_weights.bin"`: This determines where within your Hugging Face Hub repository the file will be saved. In this case, you are instructing the function to upload your file directly to the root of the repository and name it `model_weights.bin`. If you wanted to place this file inside a folder within your repository, you would specify a path like `"``folder_name/model_weights.bin``"`. * `repo_id="username/my_ml_model"`: This identifies the specific repository on the Hugging Face Hub where the file should be uploaded. It's a combination of your username (or organization name) and the repository name. * `repo_type="model"`: This indicates the type of repository you are uploading to. In this case, it's a model repository.

--- # Agent Instructions This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com. ## Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://axolotl.continuumlabs.pro/phi-2.0/phi-2.0-preprocessing.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.