Popular Datasets
is is an analysis of the most popular datasets. You can skip this section if you want to move forward to downloading the dataset for your fine tuning run.
Alpaca Cleaned
The cleaned Alpaca Dataset is a curated and cleaned version of the original dataset used to train the Alpaca Large Language Model (LLM).
The original dataset, generated using GPT-3, had several issues that impacted its quality and usefulness for training machine learning models. The cleaned version addresses these problems to improve the performance of models trained on this data.
Key points
The cleaned dataset fixes issues such as hallucinations, merged instructions, empty outputs, missing code examples, incorrect answers, and unclear instructions.
On April 8, 2023, the remaining uncurated instructions (around 50,000) were replaced with data from the GPT-4-LLM dataset. Curation of the new data is ongoing.
The average prompt length in the cleaned dataset is longer than the original, with many prompts exceeding 256 tokens. It is recommended to set the maximum prompt length to at least 512 or higher during fine-tuning.
This cleaned dataset aims to provide a higher-quality resource for training and fine-tuning LLMs, ultimately leading to better-performing models with reduced hallucinations.
Open Orca
The OpenOrca dataset is a collection of augmented FLAN Collection data. Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions.
It is tabularised in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope.
Financial Phrase Databank
The Financial Phrase Bank dataset is a collection of approximately 5,000 English sentences from financial news articles, annotated for sentiment analysis. The sentences are labeled as positive, negative, or neutral based on their potential impact on stock prices, from an investor's perspective.
Key features of the dataset:
Contains 4,840 sentences from English financial news
Sentences are labeled as 'positive', 'negative', or 'neutral'
Annotations were performed by 16 people with financial background
Available in four configurations based on annotator agreement percentages: 50%, 66%, 75%, and 100%
No predefined train/validation/test split
Sourced from a subset of 10,000 articles covering various companies, industries, and news sources
Annotators were from the same institution (Aalto University School of Business)
This dataset is useful for training and benchmarking sentiment analysis models in the financial domain.
Evol-Instruct
The training dataset for this fine tuned model is initialised with Code Alpaca, a 20K instruction-following dataset.
The Evol-Instruct technique is then applied to evolve the dataset, with the evolved data being merged with the original dataset after each round.
Dataset Focus and Views
The dataset used for training WizardCoder plays a crucial role in its superior performance.
The authors start with Code Alpaca, a 20K instruction-following dataset, and iteratively evolve it using the Code Evol-Instruct method.
This process generates a more complex and diverse dataset tailored specifically for code-related tasks.
The adaptation of Evol-Instruct to the code domain involves several key modifications:
Refining evolutionary instructions by removing deepening, complicating input, and In-Breadth Evolving.
Simplifying the form of evolutionary prompts by unifying the prompt template.
Incorporating code-specific evolutionary instructions, such as code debugging and time-space complexity constraints.
These modifications demonstrate the importance of domain-specific adaptations when creating datasets for training LLMs.
By considering the unique characteristics and requirements of the code domain, the authors were able to generate a dataset that effectively enhances the performance of the Code LLM.
Creating Datasets for LLM training
When creating similar datasets for training LLMs, I believe the following points should be considered:
Start with a high-quality initial dataset relevant to the target domain.
Identify domain-specific characteristics and requirements that can be incorporated into the dataset evolution process.
Develop a set of evolutionary instructions and prompts that align with the domain-specific goals and constraints.
Iteratively evolve the dataset using the adapted evolutionary instructions, evaluating the performance of the model after each round of evolution.
Continuously refine the evolutionary process based on the observed performance and domain-specific insights gained during the training process.
By following these guidelines and adapting the Evol-Instruct method to the specific domain of interest, researchers can create high-quality datasets that enable the training of LLMs with superior performance in their respective domains.
The Persuasion Dataset
The Persuasion Dataset is structured to provide a comprehensive understanding of the effectiveness of various arguments, whether human-written or model-generated, in altering a person's stance on specific claims. Here is a structured table summarizing the dataset's components:
Column Name | Description |
worker_id | Identifier for the participant who annotated their stance. |
claim | The statement or assertion presented for argument. |
argument | The argument provided, crafted by either a human or a language model. |
source | Indicates whether the argument was generated by a human or a specific model. |
prompt_type | The type of prompt used to generate the argument. |
rating_initial | Participant's initial rating of the claim before reading the argument. |
rating_final | Participant's final rating of the claim after being exposed to the argument. |
persuasiveness_metric | Numerical score indicating the persuasiveness of the argument (not shown in the initial explanation but deducible from context). |
Explanation of the Dataset Structure
Worker ID: This is crucial for tracking the responses from individual participants across multiple claims or arguments, ensuring the data is consistent and attributed correctly.
Claim: The focal point of the dataset; these are statements or topics on which arguments are based. Understanding the diversity and nature of claims is essential for analyzing how different types of arguments perform.
Argument: Central to the dataset, these entries show the persuasive text presented to participants. The content and quality of these arguments are key to the research being conducted.
Source: Distinguishing between human and model-generated arguments allows researchers to compare the effectiveness of natural versus artificial persuasive techniques.
Prompt Type: Knowing the prompt type helps in understanding the context or angle from which the argument was developed, which can influence its persuasiveness.
Ratings (Initial and Final): These metrics are critical as they provide a before-and-after snapshot of the participant's stance on a claim, showing the direct impact of the argument. The change between these ratings can be used as a quantitative measure of argument effectiveness.
Persuasiveness Metric: Although not included in the initial column listing, it's plausible that such a metric could be calculated from the difference between initial and final ratings, quantifying how persuasive an argument was.
This dataset can be immensely useful for developing and refining language models aimed at persuasive writing. It also provides insights into human cognitive biases and response patterns to different forms of rhetoric, aiding in fields like marketing, political science, and psychology.
Below is some more background on this interesting dataset:
Last updated