# Full Fine Tune

This is the coonfiguration file for a full fine tune.

```yaml
base_model: meta-llama/Meta-Llama-3-8B
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: datasets/alpagasus/data/train-00000-of-00001-0c59455170918204.parquet
    type: alpaca
    ds_type: parquet
    data_files: 
      - datasets/alpagasus/data/train-00000-of-00001-0c59455170918204.parquet

dataset_prepared_path: ./prepared_data
val_set_size: 0.10
output_dir: ./llama-fft-out

sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true

wandb_project: llama3-fft
wandb_entity: continuum-labs
wandb_watch:
wandb_name: 
wandb_log_model:

gradient_accumulation_steps: 8  
micro_batch_size: 1  
num_epochs: 1  
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 2e-5

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 100
evals_per_epoch: 2
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  pad_token: <|end_of_text|>
```

### <mark style="color:blue;">Suggestions</mark>

#### <mark style="color:green;">Optimizer and Learning Rate</mark>

* You're currently using <mark style="color:yellow;">**`paged_adamw_8bit`**</mark> optimizer with a learning rate of 2e-5. You could experiment with other optimizers like <mark style="color:yellow;">**`lion_8bit`**</mark>, <mark style="color:yellow;">**`galore_adamw_8bit`**</mark>, or <mark style="color:yellow;">**`adamw_torch_fused`**</mark> to see if they yield better performance.
* Additionally, you can try different learning rates, such as 1e-5 or 3e-5, to find the optimal value for your specific dataset and model.

<mark style="color:green;">Learning Rate Scheduler</mark>

* You're using the <mark style="color:yellow;">**`cosine`**</mark> learning rate scheduler. You could explore other options like <mark style="color:yellow;">**`one_cycle`**</mark> or <mark style="color:yellow;">**`log_sweep`**</mark> to see if they improve the training process.
* If using the cosine scheduler, you can set <mark style="color:yellow;">**`cosine_min_lr_ratio`**</mark> and <mark style="color:yellow;">**`cosine_constant_lr_ratio`**</mark> to control the decay and freezing of the learning rate during training.

<mark style="color:green;">Gradient Accumulation and Batch Size</mark>

* Adjust the <mark style="color:yellow;">**`gradient_accumulation_steps`**</mark> and <mark style="color:yellow;">**`micro_batch_size`**</mark> based on your available GPU memory. Increasing the batch size can lead to faster convergence but may require more memory.
* You can also set <mark style="color:yellow;">**`eval_batch_size`**</mark> to a different value than <mark style="color:yellow;">**`micro_batch_size`**</mark> for evaluation.

<mark style="color:green;">Datasets and Preprocessing</mark>

* Consider using multiple datasets by adding more entries to the <mark style="color:yellow;">**`datasets`**</mark> list. This can help improve the model's generalization and robustness.
* Experiment with different <mark style="color:yellow;">**`sequence_len`**</mark> values to find the optimal sequence length for your specific task and dataset.
* Set <mark style="color:yellow;">**`train_on_inputs`**</mark> to <mark style="color:yellow;">**`true`**</mark> if you want to include the human's prompt in the training labels.

<mark style="color:green;">Evaluation and Checkpointing</mark>

* Adjust <mark style="color:yellow;">**`eval_steps`**</mark> or <mark style="color:yellow;">**`evals_per_epoch`**</mark> to control the frequency of evaluation during training. More frequent evaluation can provide better insights into the model's progress.
* Similarly, modify <mark style="color:yellow;">**`save_steps`**</mark> or <mark style="color:yellow;">**`saves_per_epoch`**</mark> to control the frequency of checkpoint saving.
* Set <mark style="color:yellow;">**`save_total_limit`**</mark> to limit the number of checkpoints saved at a time, preventing disk space issues.

<mark style="color:green;">Advanced Techniques</mark>

* Enable <mark style="color:yellow;">**`neftune_noise_alpha`**</mark> to add noise to embeddings using the NEFT technique, which can improve generalization.
* Set <mark style="color:yellow;">**`s2_attention`**</mark> to <mark style="color:yellow;">**`true`**</mark> to use shifted-sparse attention for LLaMA models, which can reduce memory usage and improve efficiency.
* Experiment with different values for <mark style="color:yellow;">**`lora_r`**</mark><mark style="color:yellow;">**,**</mark><mark style="color:yellow;">**&#x20;**</mark><mark style="color:yellow;">**`lora_alpha`**</mark>, and <mark style="color:yellow;">**`lora_dropout`**</mark> if using LoRA adaptation.

<mark style="color:green;">Debugging and Monitoring</mark>

* Set <mark style="color:yellow;">**`debug`**</mark> to <mark style="color:yellow;">**`true`**</mark> to enable debug mode for more detailed logging and debugging information.
* Adjust <mark style="color:yellow;">**`logging_steps`**</mark> to control the frequency of logging during training.
* Set <mark style="color:yellow;">**`loss_watchdog_threshold`**</mark> and <mark style="color:yellow;">**`loss_watchdog_patience`**</mark> to monitor and abort training if the loss exceeds a certain threshold for a specified number of steps.

<mark style="color:green;">Tokenizer and Special Tokens</mark>

* Add or modify special tokens in the <mark style="color:yellow;">**`special_tokens`**</mark> section based on your specific requirements.
* If you have added extra tokens to your tokenizer, specify them in the <mark style="color:yellow;">**`tokens`**</mark> list.

Remember to experiment and iterate on these configurations to find the optimal settings for your specific use case. It's also important to monitor the training progress, evaluate the model's performance, and make adjustments accordingly.

### <mark style="color:blue;">Machine Memory</mark>

<mark style="color:green;">Memory Usage</mark>

* With <mark style="color:yellow;">**`load_in_8bit`**</mark> and <mark style="color:yellow;">**`load_in_4bit`**</mark> set to <mark style="color:yellow;">**`false`**</mark>, the model will be loaded in <mark style="color:blue;">**full precision (FP32)**</mark>. This may consume a significant amount of memory, especially for a large model like LLaMA-3-8B.
* To reduce memory usage, you can consider setting <mark style="color:yellow;">**`load_in_8bit`**</mark> to <mark style="color:yellow;">**`true`**</mark>. This will load the model in 8-bit precision, which can significantly reduce memory consumption while maintaining comparable performance.

<mark style="color:green;">Sequence Length</mark>

* You have set <mark style="color:yellow;">**`sequence_len`**</mark> to 8192, which is quite large. Depending on your specific task and dataset, you may not need such a long sequence length.
* Increasing the sequence length will require more memory during training. If you encounter out-of-memory (OOM) issues, consider reducing the <mark style="color:yellow;">**`sequence_len`**</mark> to a smaller value, such as 2048 or 4096.

<mark style="color:green;">Gradient Accumulation and Batch Size</mark>

* With <mark style="color:yellow;">**`gradient_accumulation_steps`**</mark> set to 8 and <mark style="color:yellow;">**`micro_batch_size`**</mark> set to 1, the effective batch size will be 8.
* Depending on your available memory, you may be able to increase the <mark style="color:yellow;">**`micro_batch_size`**</mark> to a larger value, such as 2 or 4, while keeping the <mark style="color:yellow;">**`gradient_accumulation_steps`**</mark> the same. This can help speed up the training process.

<mark style="color:green;">Optimizer and Learning Rate</mark>

* You're using the <mark style="color:yellow;">**`paged_adamw_8bit`**</mark> optimizer, which is a good choice for memory efficiency. However, make sure you have the necessary dependencies installed for this optimizer.
* The learning rate of 2e-5 is a reasonable starting point, but you may need to experiment with different values to find the optimal learning rate for your specific dataset and model.

<mark style="color:green;">Evaluation and Checkpointing</mark>

* You have set <mark style="color:yellow;">**`evals_per_epoch`**</mark> to 2, which means evaluation will be performed twice per epoch. Depending on your dataset size and training duration, you may want to adjust this value.
* Similarly, <mark style="color:yellow;">**`saves_per_epoch`**</mark> is set to 1, meaning checkpoints will be saved once per epoch. Adjust this value based on your desired checkpoint frequency and available storage.

<mark style="color:green;">Debugging and Logging</mark>

* Setting <mark style="color:yellow;">**`debug`**</mark> to <mark style="color:yellow;">**`true`**</mark> will enable debug mode, which can provide more detailed logging and debugging information. However, keep in mind that this may slow down the training process.
* You have set <mark style="color:yellow;">**`logging_steps`**</mark> to 1, which means logging will occur at every step. If you find the logging too verbose, you can increase this value to a larger number to reduce the logging frequency.

Overall, your configuration looks reasonable for fine-tuning LLaMA-3-8B on an A100 with 80GB of memory.&#x20;

Just be mindful of the memory usage, especially with the large sequence length and full precision loading. If you encounter OOM issues, consider adjusting the sequence length, enabling 8-bit loading, or increasing the batch size if possible.

Remember to monitor the training progress, evaluate the model's performance, and make adjustments as needed based on your specific requirements and available resources.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://axolotl.continuumlabs.pro/llama3/full-fine-tune.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
