LogoLogo
Continuum Knowledge BankContinuum Applications
  • Introduction
  • Creation of Environment
    • Platform Installation
    • Axolotl Dependencies
    • setup.py objectives
      • script analysis
  • Huggingface Hub
  • Download the dataset
    • Types of Dataset Structures
    • Structuring Datasets for Fine-Tuning Large Language Models
    • Downloading Huggingface Datasets
    • Use Git to download dataset
    • Popular Datasets
    • Download cleaned Alpaca dataset
    • Template-free prompt construction
  • Downloading models
    • Phi 2.0 details
    • Downloading Phi 2.0
    • Available Models
  • Configuration for Training
  • Datasets
  • Model Selection - General
  • Phi 2.0
    • Phi 2.0 - Model Configuration
    • Phi 2.0 - Model Quantization
    • Phi 2.0 - Data Loading and Paths
    • Phi 2.0 - Sequence Configuration
    • Phi 2.0 - Lora Configuration
    • Phi 2.0 - Logging
    • Phi 2.0 - Training Configuration
    • Phi 2.0 - Data and Precision
    • Phi 2.0 - Optimisations
    • Phi 2.0 - Extra Hyperparameters
    • Phi 2.0 - All Configurations
    • Phi 2.0 - Preprocessing
    • Phi 2.0 - Training
    • Uploading Models
  • Llama2
    • Llama2 - Model Configuration
    • Llama2 - Model Quantization
    • Llama2 - Data Loading and Paths
    • Llama2 - Sequence Configuration
    • Llama2 - Lora Configuration
    • Llama2 - Logging
    • Llama2 - Training Configuration
    • Llama2 - Data and Precision
    • Llama2 - Optimisations
    • Llama2 - Extra Hyperparameters
    • Llama2- All Configurations
    • Llama2 - Training Configuration
    • Llama2 - Preprocessing
    • Llama2 - Training
  • Llama3
    • Downloading the model
    • Analysis of model files
      • Model Analysis - Configuration Parameters
      • Model Analysis - Safetensors
      • Tokenizer Configuration Files
        • Model Analysis - tokenizer.json
        • Model Analysis - Special Tokens
    • Llama3 - Model Configuration
    • Llama3 - Model Quantization
    • Llama3 - Data Loading and Paths
    • Llama3 - Sequence Configuration
    • Llama3 - Lora Configuration
    • Llama3 - Logging
    • Llama3 - Training Configuration
    • Llama3 - Data and Precision
    • Llama3 - Optimisations
    • Llama3 - Extra Hyperparameters
    • Llama3- All Configurations
    • Llama3 - Preprocessing
    • Llama3 - Training
    • Full Fine Tune
  • Special Tokens
  • Prompt Construction for Fine-Tuning Large Language Models
  • Memory-Efficient Fine-Tuning Techniques for Large Language Models
  • Training Ideas around Hyperparameters
    • Hugging Face documentation on loading PEFT
  • After fine tuning LLama3
  • Merging Model Weights
  • Merge Lora Instructions
  • Axolotl Configuration Files
    • Configuration Options
    • Model Configuration
    • Data Loading and Processing
    • Sequence Configuration
    • Lora Configuration
    • Logging
    • Training Configuration
    • Augmentation Techniques
  • Axolotl Fine-Tuning Tips & Tricks: A Comprehensive Guide
  • Axolotl debugging guide
  • Hugging Face Hub API
  • NCCL
  • Training Phi 1.5 - Youtube
  • JSON (JavaScript Object Notation)
  • General Tips
  • Datasets
Powered by GitBook
LogoLogo

This documentation is for the Axolotl community

On this page
  • Training Hyperparameters
  • warmup_steps: 100
  • evals_per_epoch: 4
  • saves_per_epoch: 1
  • Deepspeed and FSDP
  • deepspeed:
  • fsdp_config:
  • weight decay: 1
  • Special Tokens and Token Embeddings
  • resize_token_embeddings_to_32x:
  • pad_token

Was this helpful?

  1. Phi 2.0

Phi 2.0 - Extra Hyperparameters

This is the balance of the configuration file for the training of Phi 2.0. We provide a full explanation of each of the configurations below.

warmup_steps: 100
evals_per_epoch: 4
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.1
fsdp:
fsdp_config:
resize_token_embeddings_to_32x: true
special_tokens:
  pad_token: "<|endoftext|>"

Training Hyperparameters

warmup_steps: 100

Warm-up steps are a crucial part of the learning rate scheduling.

Over the first 100 training steps, the learning rate incrementally increases to its target value. This gradual increase helps in stabilizing the training early on, preventing the model from making too large updates too quickly.

evals_per_epoch: 4

This setting determines the frequency of evaluations within each training epoch.

With a value of 4, the model will be evaluated four times per epoch, providing regular feedback on its performance. Frequent evaluations help in monitoring the model's progress and ensuring it is learning as expected.

saves_per_epoch: 1

To safeguard your training progress, the model's state is saved once every epoch. This checkpointing allows you to resume training from the last saved state in case of interruptions and also provides opportunities for model fine-tuning.

Deepspeed and FSDP

deepspeed:

DeepSpeed integration offers advanced optimisations for accelerating training and reducing memory consumption. Configuring DeepSpeed enhances training efficiency, particularly for large-scale models, by optimizing computational resources and parallelizing the workload.

fsdp_config:

Fully Sharded Data Parallel (FSDP) is a technique to reduce memory consumption and increase the scale of distributed training. The fsdp_config allows you to customise FSDP's behavior, optimizing memory usage and computational efficiency across multiple devices.

weight decay: 1

Weight decay is a regularization technique to prevent overfitting by penalising large weights.

A weight decay factor of 0.1 helps in moderating the update of weights, encouraging the model to learn more general features rather than overly fitting to the training data.

Special Tokens and Token Embeddings

resize_token_embeddings_to_32x:

This configuration enables the resizing of token embeddings, a feature particularly useful for adapting the model to different vocabulary sizes efficiently.

Resizing to 32 times the original size allows for more flexible and scalable model architecture.

pad_token

Special tokens play a pivotal role in how the model processes and understands text. The padding token (pad_token) is used to fill out sequences to a uniform length, ensuring consistent input size for the model.

PreviousPhi 2.0 - OptimisationsNextPhi 2.0 - All Configurations

Last updated 1 year ago

Was this helpful?

Page cover image