Full Fine Tune
This is the coonfiguration file for a full fine tune.
base_model: meta-llama/Meta-Llama-3-8B
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
load_in_8bit: false
load_in_4bit: false
strict: false
datasets:
- path: datasets/alpagasus/data/train-00000-of-00001-0c59455170918204.parquet
type: alpaca
ds_type: parquet
data_files:
- datasets/alpagasus/data/train-00000-of-00001-0c59455170918204.parquet
dataset_prepared_path: ./prepared_data
val_set_size: 0.10
output_dir: ./llama-fft-out
sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true
wandb_project: llama3-fft
wandb_entity: continuum-labs
wandb_watch:
wandb_name:
wandb_log_model:
gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 2e-5
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
flash_attention: true
warmup_steps: 100
evals_per_epoch: 2
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
pad_token: <|end_of_text|>Suggestions
Optimizer and Learning Rate
You're currently using
paged_adamw_8bitoptimizer with a learning rate of 2e-5. You could experiment with other optimizers likelion_8bit,galore_adamw_8bit, oradamw_torch_fusedto see if they yield better performance.Additionally, you can try different learning rates, such as 1e-5 or 3e-5, to find the optimal value for your specific dataset and model.
Learning Rate Scheduler
You're using the
cosinelearning rate scheduler. You could explore other options likeone_cycleorlog_sweepto see if they improve the training process.If using the cosine scheduler, you can set
cosine_min_lr_ratioandcosine_constant_lr_ratioto control the decay and freezing of the learning rate during training.
Gradient Accumulation and Batch Size
Adjust the
gradient_accumulation_stepsandmicro_batch_sizebased on your available GPU memory. Increasing the batch size can lead to faster convergence but may require more memory.You can also set
eval_batch_sizeto a different value thanmicro_batch_sizefor evaluation.
Datasets and Preprocessing
Consider using multiple datasets by adding more entries to the
datasetslist. This can help improve the model's generalization and robustness.Experiment with different
sequence_lenvalues to find the optimal sequence length for your specific task and dataset.Set
train_on_inputstotrueif you want to include the human's prompt in the training labels.
Evaluation and Checkpointing
Adjust
eval_stepsorevals_per_epochto control the frequency of evaluation during training. More frequent evaluation can provide better insights into the model's progress.Similarly, modify
save_stepsorsaves_per_epochto control the frequency of checkpoint saving.Set
save_total_limitto limit the number of checkpoints saved at a time, preventing disk space issues.
Advanced Techniques
Enable
neftune_noise_alphato add noise to embeddings using the NEFT technique, which can improve generalization.Set
s2_attentiontotrueto use shifted-sparse attention for LLaMA models, which can reduce memory usage and improve efficiency.Experiment with different values for
lora_r,lora_alpha, andlora_dropoutif using LoRA adaptation.
Debugging and Monitoring
Set
debugtotrueto enable debug mode for more detailed logging and debugging information.Adjust
logging_stepsto control the frequency of logging during training.Set
loss_watchdog_thresholdandloss_watchdog_patienceto monitor and abort training if the loss exceeds a certain threshold for a specified number of steps.
Tokenizer and Special Tokens
Add or modify special tokens in the
special_tokenssection based on your specific requirements.If you have added extra tokens to your tokenizer, specify them in the
tokenslist.
Remember to experiment and iterate on these configurations to find the optimal settings for your specific use case. It's also important to monitor the training progress, evaluate the model's performance, and make adjustments accordingly.
Machine Memory
Memory Usage
With
load_in_8bitandload_in_4bitset tofalse, the model will be loaded in full precision (FP32). This may consume a significant amount of memory, especially for a large model like LLaMA-3-8B.To reduce memory usage, you can consider setting
load_in_8bittotrue. This will load the model in 8-bit precision, which can significantly reduce memory consumption while maintaining comparable performance.
Sequence Length
You have set
sequence_lento 8192, which is quite large. Depending on your specific task and dataset, you may not need such a long sequence length.Increasing the sequence length will require more memory during training. If you encounter out-of-memory (OOM) issues, consider reducing the
sequence_lento a smaller value, such as 2048 or 4096.
Gradient Accumulation and Batch Size
With
gradient_accumulation_stepsset to 8 andmicro_batch_sizeset to 1, the effective batch size will be 8.Depending on your available memory, you may be able to increase the
micro_batch_sizeto a larger value, such as 2 or 4, while keeping thegradient_accumulation_stepsthe same. This can help speed up the training process.
Optimizer and Learning Rate
You're using the
paged_adamw_8bitoptimizer, which is a good choice for memory efficiency. However, make sure you have the necessary dependencies installed for this optimizer.The learning rate of 2e-5 is a reasonable starting point, but you may need to experiment with different values to find the optimal learning rate for your specific dataset and model.
Evaluation and Checkpointing
You have set
evals_per_epochto 2, which means evaluation will be performed twice per epoch. Depending on your dataset size and training duration, you may want to adjust this value.Similarly,
saves_per_epochis set to 1, meaning checkpoints will be saved once per epoch. Adjust this value based on your desired checkpoint frequency and available storage.
Debugging and Logging
Setting
debugtotruewill enable debug mode, which can provide more detailed logging and debugging information. However, keep in mind that this may slow down the training process.You have set
logging_stepsto 1, which means logging will occur at every step. If you find the logging too verbose, you can increase this value to a larger number to reduce the logging frequency.
Overall, your configuration looks reasonable for fine-tuning LLaMA-3-8B on an A100 with 80GB of memory.
Just be mindful of the memory usage, especially with the large sequence length and full precision loading. If you encounter OOM issues, consider adjusting the sequence length, enabling 8-bit loading, or increasing the batch size if possible.
Remember to monitor the training progress, evaluate the model's performance, and make adjustments as needed based on your specific requirements and available resources.
Last updated
Was this helpful?
