Augmentation Techniques
noisy_embedding_alpha
noisy_embedding_alpha
is used for applying noise to embeddings as part of data augmentation. It is based on the NEFT (Noisy Embedding Fine-Tuning) technique and can be set to a number (e.g., 5) to add noise to embeddings. This technique helps introduce variability into the training data, potentially improving robustness and generalization.
flash_optimum
flash_optimum
determines whether to use the "Optimum Layer-Order for Transformers" technique provided by Better Transformers. It's an advanced technique that optimizes the order of layers in the transformer model for improved performance.
xformers_attention
xformers_attention
specifies whether to use the attention patch from the XFormers library. XFormers is a library that provides optimized implementations of transformer components, including attention mechanisms.
flash_attention
flash_attention
controls whether to use the Flash Attention patch from the Flash Attention library. Flash Attention is another library that offers optimized attention mechanisms for transformers.
flash_attn_cross_entropy
flash_attn_cross_entropy
determines whether to use the Flash-Attention Cross Entropy implementation. This is an advanced option and should be used with caution, as it may require specific use cases.
flash_attn_rms_norm
flash_attn_rms_norm
specifies whether to use the Flash-Attention Root Mean Square (RMS) Norm implementation. RMS Norm is a technique for normalizing model activations.
flash_attn_fuse_qkv
flash_attn_fuse_qkv
controls whether to fuse the Query, Key, and Value (QKV) components of the attention mechanism into a single operation. This can potentially improve efficiency during training.
flash_attn_fuse_mlp
flash_attn_fuse_mlp
determines whether to fuse part of the Multi-Layer Perceptron (MLP) components of the attention mechanism into a single operation. Like the previous option, this aims to enhance efficiency.
sdp_attention
sdp_attention
specifies whether to use the Scaled Dot-Product Attention mechanism, which is a fundamental component of transformer models. The link provided points to the PyTorch documentation for this attention mechanism.
landmark_attention
landmark_attention
is used only with LLaMA and controls whether to use landmark attention. Landmark attention is a specialized attention mechanism designed for specific use cases.
xpos_rope
xpos_rope
is related to the RoPE (Relative Positional Encoding) technique and is specific to LLaMA. It appears to be related to modifying RoPE for positional encoding in the LLaMA model. The provided link points to an external resource for more details.
Last updated