Full Fine Tune

This is the coonfiguration file for a full fine tune.

base_model: meta-llama/Meta-Llama-3-8B
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: datasets/alpagasus/data/train-00000-of-00001-0c59455170918204.parquet
    type: alpaca
    ds_type: parquet
    data_files: 
      - datasets/alpagasus/data/train-00000-of-00001-0c59455170918204.parquet

dataset_prepared_path: ./prepared_data
val_set_size: 0.10
output_dir: ./llama-fft-out

sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true

wandb_project: llama3-fft
wandb_entity: continuum-labs
wandb_watch:
wandb_name: 
wandb_log_model:

gradient_accumulation_steps: 8  
micro_batch_size: 1  
num_epochs: 1  
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 2e-5

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 100
evals_per_epoch: 2
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  pad_token: <|end_of_text|>

Suggestions

Optimizer and Learning Rate

  • You're currently using paged_adamw_8bit optimizer with a learning rate of 2e-5. You could experiment with other optimizers like lion_8bit, galore_adamw_8bit, or adamw_torch_fused to see if they yield better performance.

  • Additionally, you can try different learning rates, such as 1e-5 or 3e-5, to find the optimal value for your specific dataset and model.

Learning Rate Scheduler

  • You're using the cosine learning rate scheduler. You could explore other options like one_cycle or log_sweep to see if they improve the training process.

  • If using the cosine scheduler, you can set cosine_min_lr_ratio and cosine_constant_lr_ratio to control the decay and freezing of the learning rate during training.

Gradient Accumulation and Batch Size

  • Adjust the gradient_accumulation_steps and micro_batch_size based on your available GPU memory. Increasing the batch size can lead to faster convergence but may require more memory.

  • You can also set eval_batch_size to a different value than micro_batch_size for evaluation.

Datasets and Preprocessing

  • Consider using multiple datasets by adding more entries to the datasets list. This can help improve the model's generalization and robustness.

  • Experiment with different sequence_len values to find the optimal sequence length for your specific task and dataset.

  • Set train_on_inputs to true if you want to include the human's prompt in the training labels.

Evaluation and Checkpointing

  • Adjust eval_steps or evals_per_epoch to control the frequency of evaluation during training. More frequent evaluation can provide better insights into the model's progress.

  • Similarly, modify save_steps or saves_per_epoch to control the frequency of checkpoint saving.

  • Set save_total_limit to limit the number of checkpoints saved at a time, preventing disk space issues.

Advanced Techniques

  • Enable neftune_noise_alpha to add noise to embeddings using the NEFT technique, which can improve generalization.

  • Set s2_attention to true to use shifted-sparse attention for LLaMA models, which can reduce memory usage and improve efficiency.

  • Experiment with different values for lora_r, lora_alpha, and lora_dropout if using LoRA adaptation.

Debugging and Monitoring

  • Set debug to true to enable debug mode for more detailed logging and debugging information.

  • Adjust logging_steps to control the frequency of logging during training.

  • Set loss_watchdog_threshold and loss_watchdog_patience to monitor and abort training if the loss exceeds a certain threshold for a specified number of steps.

Tokenizer and Special Tokens

  • Add or modify special tokens in the special_tokens section based on your specific requirements.

  • If you have added extra tokens to your tokenizer, specify them in the tokens list.

Remember to experiment and iterate on these configurations to find the optimal settings for your specific use case. It's also important to monitor the training progress, evaluate the model's performance, and make adjustments accordingly.

Machine Memory

Memory Usage

  • With load_in_8bit and load_in_4bit set to false, the model will be loaded in full precision (FP32). This may consume a significant amount of memory, especially for a large model like LLaMA-3-8B.

  • To reduce memory usage, you can consider setting load_in_8bit to true. This will load the model in 8-bit precision, which can significantly reduce memory consumption while maintaining comparable performance.

Sequence Length

  • You have set sequence_len to 8192, which is quite large. Depending on your specific task and dataset, you may not need such a long sequence length.

  • Increasing the sequence length will require more memory during training. If you encounter out-of-memory (OOM) issues, consider reducing the sequence_len to a smaller value, such as 2048 or 4096.

Gradient Accumulation and Batch Size

  • With gradient_accumulation_steps set to 8 and micro_batch_size set to 1, the effective batch size will be 8.

  • Depending on your available memory, you may be able to increase the micro_batch_size to a larger value, such as 2 or 4, while keeping the gradient_accumulation_steps the same. This can help speed up the training process.

Optimizer and Learning Rate

  • You're using the paged_adamw_8bit optimizer, which is a good choice for memory efficiency. However, make sure you have the necessary dependencies installed for this optimizer.

  • The learning rate of 2e-5 is a reasonable starting point, but you may need to experiment with different values to find the optimal learning rate for your specific dataset and model.

Evaluation and Checkpointing

  • You have set evals_per_epoch to 2, which means evaluation will be performed twice per epoch. Depending on your dataset size and training duration, you may want to adjust this value.

  • Similarly, saves_per_epoch is set to 1, meaning checkpoints will be saved once per epoch. Adjust this value based on your desired checkpoint frequency and available storage.

Debugging and Logging

  • Setting debug to true will enable debug mode, which can provide more detailed logging and debugging information. However, keep in mind that this may slow down the training process.

  • You have set logging_steps to 1, which means logging will occur at every step. If you find the logging too verbose, you can increase this value to a larger number to reduce the logging frequency.

Overall, your configuration looks reasonable for fine-tuning LLaMA-3-8B on an A100 with 80GB of memory.

Just be mindful of the memory usage, especially with the large sequence length and full precision loading. If you encounter OOM issues, consider adjusting the sequence length, enabling 8-bit loading, or increasing the batch size if possible.

Remember to monitor the training progress, evaluate the model's performance, and make adjustments as needed based on your specific requirements and available resources.

Last updated