LogoLogo
Continuum Knowledge BankContinuum Applications
  • Introduction
  • Creation of Environment
    • Platform Installation
    • Axolotl Dependencies
    • setup.py objectives
      • script analysis
  • Huggingface Hub
  • Download the dataset
    • Types of Dataset Structures
    • Structuring Datasets for Fine-Tuning Large Language Models
    • Downloading Huggingface Datasets
    • Use Git to download dataset
    • Popular Datasets
    • Download cleaned Alpaca dataset
    • Template-free prompt construction
  • Downloading models
    • Phi 2.0 details
    • Downloading Phi 2.0
    • Available Models
  • Configuration for Training
  • Datasets
  • Model Selection - General
  • Phi 2.0
    • Phi 2.0 - Model Configuration
    • Phi 2.0 - Model Quantization
    • Phi 2.0 - Data Loading and Paths
    • Phi 2.0 - Sequence Configuration
    • Phi 2.0 - Lora Configuration
    • Phi 2.0 - Logging
    • Phi 2.0 - Training Configuration
    • Phi 2.0 - Data and Precision
    • Phi 2.0 - Optimisations
    • Phi 2.0 - Extra Hyperparameters
    • Phi 2.0 - All Configurations
    • Phi 2.0 - Preprocessing
    • Phi 2.0 - Training
    • Uploading Models
  • Llama2
    • Llama2 - Model Configuration
    • Llama2 - Model Quantization
    • Llama2 - Data Loading and Paths
    • Llama2 - Sequence Configuration
    • Llama2 - Lora Configuration
    • Llama2 - Logging
    • Llama2 - Training Configuration
    • Llama2 - Data and Precision
    • Llama2 - Optimisations
    • Llama2 - Extra Hyperparameters
    • Llama2- All Configurations
    • Llama2 - Training Configuration
    • Llama2 - Preprocessing
    • Llama2 - Training
  • Llama3
    • Downloading the model
    • Analysis of model files
      • Model Analysis - Configuration Parameters
      • Model Analysis - Safetensors
      • Tokenizer Configuration Files
        • Model Analysis - tokenizer.json
        • Model Analysis - Special Tokens
    • Llama3 - Model Configuration
    • Llama3 - Model Quantization
    • Llama3 - Data Loading and Paths
    • Llama3 - Sequence Configuration
    • Llama3 - Lora Configuration
    • Llama3 - Logging
    • Llama3 - Training Configuration
    • Llama3 - Data and Precision
    • Llama3 - Optimisations
    • Llama3 - Extra Hyperparameters
    • Llama3- All Configurations
    • Llama3 - Preprocessing
    • Llama3 - Training
    • Full Fine Tune
  • Special Tokens
  • Prompt Construction for Fine-Tuning Large Language Models
  • Memory-Efficient Fine-Tuning Techniques for Large Language Models
  • Training Ideas around Hyperparameters
    • Hugging Face documentation on loading PEFT
  • After fine tuning LLama3
  • Merging Model Weights
  • Merge Lora Instructions
  • Axolotl Configuration Files
    • Configuration Options
    • Model Configuration
    • Data Loading and Processing
    • Sequence Configuration
    • Lora Configuration
    • Logging
    • Training Configuration
    • Augmentation Techniques
  • Axolotl Fine-Tuning Tips & Tricks: A Comprehensive Guide
  • Axolotl debugging guide
  • Hugging Face Hub API
  • NCCL
  • Training Phi 1.5 - Youtube
  • JSON (JavaScript Object Notation)
  • General Tips
  • Datasets
Powered by GitBook
LogoLogo

This documentation is for the Axolotl community

On this page
  • JSON (JavaScript Object Notation)
  • JSON Lines (JSONL)
  • Key Differences
  • Commonalities

Was this helpful?

JSON (JavaScript Object Notation)

JSON is a lightweight, human-readable, and widely used data interchange format.

It represents data as key-value pairs, making it suitable for storing structured or semi-structured data.

JSON is language-independent and supported by various programming languages.

· Storing and exchanging data between a server and a client in web applications.

· Configuration files for applications or services.

· Storing data with hierarchical relationships or complex structures.

JSON is useful for training LLMs when you need to store and process structured or semi-structured textual data with additional metadata or attributes.

For instance, if you are working with a dataset containing articles with author, title, date, and content information, JSON can efficiently store and represent this data.

JSON and JSON Lines (JSONL) are both formats used for storing and exchanging data. However, they have distinct structures and use cases.

JSON (JavaScript Object Notation)

  • Structure: JSON is a lightweight data-interchange format that is easy for humans to read and write and easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language. JSON represents data as a single, cohesive entity.

  • Usage: Typically used to represent complex data structures containing multiple nested objects and arrays. Ideal for configurations, RESTful API responses, and data interchange between server and web applications.

Format Example

[
    {
        "name": "John",
        "age": 30
    },
    {
        "name": "Jane",
        "age": 25
    }
]
  • File Read/Write: The entire JSON file must be read or written in one operation, which can be resource-intensive for large datasets.

JSON Lines (JSONL)

  • Structure: JSON Lines is a convenient format for storing structured data that may be processed one record at a time. It handles multiple, discrete JSON objects, with each object separated by a newline character.

  • Usage: Well-suited for log files and data streaming scenarios where each line can be processed independently. Ideal for large datasets and Unix-style text processing tools.

Format Example

{"name": "John", "age": 30}
{"name": "Jane", "age": 25}
  • File Read/Write: Allows for processing data one line at a time, which is more efficient for large datasets. Each line is a valid JSON object, allowing for incremental reading/writing.

Key Differences

  1. Data Representation: JSON represents a single cohesive data structure, while JSONL represents multiple, independent JSON objects separated by newlines.

  2. Read/Write Efficiency: JSONL is more efficient for large datasets as it allows for processing one record at a time, whereas JSON requires handling the entire data structure at once.

  3. Use Cases: JSON is ideal for configurations and API responses, while JSONL is better for logging and streaming large datasets.

Commonalities

  • Both formats use UTF-8 encoding.

  • Both are based on the JSON standard and can be easily parsed using standard JSON parsers, with slight adaptations for JSONL.

JSONL is particularly advantageous when dealing with large datasets that need to be streamed or processed incrementally, avoiding the memory overhead of loading the entire dataset as required in standard JSON.

PreviousTraining Phi 1.5 - YoutubeNextGeneral Tips

Last updated 1 year ago

Was this helpful?

Page cover image