Page cover image

Augmentation Techniques

Field Name
Explanation

noisy_embedding_alpha

noisy_embedding_alpha is used for applying noise to embeddings as part of data augmentation. It is based on the NEFT (Noisy Embedding Fine-Tuning) technique and can be set to a number (e.g., 5) to add noise to embeddings. This technique helps introduce variability into the training data, potentially improving robustness and generalization.

flash_optimum

flash_optimum determines whether to use the "Optimum Layer-Order for Transformers" technique provided by Better Transformers. It's an advanced technique that optimizes the order of layers in the transformer model for improved performance.

xformers_attention

xformers_attention specifies whether to use the attention patch from the XFormers library. XFormers is a library that provides optimized implementations of transformer components, including attention mechanisms.

flash_attention

flash_attention controls whether to use the Flash Attention patch from the Flash Attention library. Flash Attention is another library that offers optimized attention mechanisms for transformers.

flash_attn_cross_entropy

flash_attn_cross_entropy determines whether to use the Flash-Attention Cross Entropy implementation. This is an advanced option and should be used with caution, as it may require specific use cases.

flash_attn_rms_norm

flash_attn_rms_norm specifies whether to use the Flash-Attention Root Mean Square (RMS) Norm implementation. RMS Norm is a technique for normalizing model activations.

flash_attn_fuse_qkv

flash_attn_fuse_qkv controls whether to fuse the Query, Key, and Value (QKV) components of the attention mechanism into a single operation. This can potentially improve efficiency during training.

flash_attn_fuse_mlp

flash_attn_fuse_mlp determines whether to fuse part of the Multi-Layer Perceptron (MLP) components of the attention mechanism into a single operation. Like the previous option, this aims to enhance efficiency.

sdp_attention

sdp_attention specifies whether to use the Scaled Dot-Product Attention mechanism, which is a fundamental component of transformer models. The link provided points to the PyTorch documentation for this attention mechanism.

landmark_attention

landmark_attention is used only with LLaMA and controls whether to use landmark attention. Landmark attention is a specialized attention mechanism designed for specific use cases.

xpos_rope

xpos_rope is related to the RoPE (Relative Positional Encoding) technique and is specific to LLaMA. It appears to be related to modifying RoPE for positional encoding in the LLaMA model. The provided link points to an external resource for more details.

Last updated

Was this helpful?