Page cover image

Popular Datasets

is is an analysis of the most popular datasets. You can skip this section if you want to move forward to downloading the dataset for your fine tuning run.

Alpaca Cleaned

The cleaned Alpaca Dataset is a curated and cleaned version of the original dataset used to train the Alpaca Large Language Model (LLM).

The original dataset, generated using GPT-3, had several issues that impacted its quality and usefulness for training machine learning models. The cleaned version addresses these problems to improve the performance of models trained on this data.

Alpaca Cleaned Dataset

Key points

  1. The cleaned dataset fixes issues such as hallucinations, merged instructions, empty outputs, missing code examples, incorrect answers, and unclear instructions.

  2. On April 8, 2023, the remaining uncurated instructions (around 50,000) were replaced with data from the GPT-4-LLM dataset. Curation of the new data is ongoing.

  3. The average prompt length in the cleaned dataset is longer than the original, with many prompts exceeding 256 tokens. It is recommended to set the maximum prompt length to at least 512 or higher during fine-tuning.

This cleaned dataset aims to provide a higher-quality resource for training and fine-tuning LLMs, ultimately leading to better-performing models with reduced hallucinations.

Open Orca

The OpenOrca dataset is a collection of augmented FLAN Collection data. Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions.

It is tabularised in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope.

Open Orca Dataset
Orca Paper
Flan Collection

Financial Phrase Databank

The Financial Phrase Bank dataset is a collection of approximately 5,000 English sentences from financial news articles, annotated for sentiment analysis. The sentences are labeled as positive, negative, or neutral based on their potential impact on stock prices, from an investor's perspective.

Financial Phrase Bank dataset

Key features of the dataset:

  • Contains 4,840 sentences from English financial news

  • Sentences are labeled as 'positive', 'negative', or 'neutral'

  • Annotations were performed by 16 people with financial background

  • Available in four configurations based on annotator agreement percentages: 50%, 66%, 75%, and 100%

  • No predefined train/validation/test split

  • Sourced from a subset of 10,000 articles covering various companies, industries, and news sources

  • Annotators were from the same institution (Aalto University School of Business)

This dataset is useful for training and benchmarking sentiment analysis models in the financial domain.

Evol-Instruct

The training dataset for this fine tuned model is initialised with Code Alpaca, a 20K instruction-following dataset.

The Evol-Instruct technique is then applied to evolve the dataset, with the evolved data being merged with the original dataset after each round.

WizardCoder

Dataset Focus and Views

The dataset used for training WizardCoder plays a crucial role in its superior performance.

The authors start with Code Alpaca, a 20K instruction-following dataset, and iteratively evolve it using the Code Evol-Instruct method.

This process generates a more complex and diverse dataset tailored specifically for code-related tasks.

The adaptation of Evol-Instruct to the code domain involves several key modifications:

  1. Refining evolutionary instructions by removing deepening, complicating input, and In-Breadth Evolving.

  2. Simplifying the form of evolutionary prompts by unifying the prompt template.

  3. Incorporating code-specific evolutionary instructions, such as code debugging and time-space complexity constraints.

These modifications demonstrate the importance of domain-specific adaptations when creating datasets for training LLMs.

By considering the unique characteristics and requirements of the code domain, the authors were able to generate a dataset that effectively enhances the performance of the Code LLM.

Creating Datasets for LLM training

When creating similar datasets for training LLMs, I believe the following points should be considered:

  1. Start with a high-quality initial dataset relevant to the target domain.

  2. Identify domain-specific characteristics and requirements that can be incorporated into the dataset evolution process.

  3. Develop a set of evolutionary instructions and prompts that align with the domain-specific goals and constraints.

  4. Iteratively evolve the dataset using the adapted evolutionary instructions, evaluating the performance of the model after each round of evolution.

  5. Continuously refine the evolutionary process based on the observed performance and domain-specific insights gained during the training process.

By following these guidelines and adapting the Evol-Instruct method to the specific domain of interest, researchers can create high-quality datasets that enable the training of LLMs with superior performance in their respective domains.

The Persuasion Dataset

The Persuasion Dataset is structured to provide a comprehensive understanding of the effectiveness of various arguments, whether human-written or model-generated, in altering a person's stance on specific claims. Here is a structured table summarizing the dataset's components:

Column Name

Description

worker_id

Identifier for the participant who annotated their stance.

claim

The statement or assertion presented for argument.

argument

The argument provided, crafted by either a human or a language model.

source

Indicates whether the argument was generated by a human or a specific model.

prompt_type

The type of prompt used to generate the argument.

rating_initial

Participant's initial rating of the claim before reading the argument.

rating_final

Participant's final rating of the claim after being exposed to the argument.

persuasiveness_metric

Numerical score indicating the persuasiveness of the argument (not shown in the initial explanation but deducible from context).

Explanation of the Dataset Structure

Worker ID: This is crucial for tracking the responses from individual participants across multiple claims or arguments, ensuring the data is consistent and attributed correctly.

Claim: The focal point of the dataset; these are statements or topics on which arguments are based. Understanding the diversity and nature of claims is essential for analyzing how different types of arguments perform.

Argument: Central to the dataset, these entries show the persuasive text presented to participants. The content and quality of these arguments are key to the research being conducted.

Source: Distinguishing between human and model-generated arguments allows researchers to compare the effectiveness of natural versus artificial persuasive techniques.

Prompt Type: Knowing the prompt type helps in understanding the context or angle from which the argument was developed, which can influence its persuasiveness.

Ratings (Initial and Final): These metrics are critical as they provide a before-and-after snapshot of the participant's stance on a claim, showing the direct impact of the argument. The change between these ratings can be used as a quantitative measure of argument effectiveness.

Persuasiveness Metric: Although not included in the initial column listing, it's plausible that such a metric could be calculated from the difference between initial and final ratings, quantifying how persuasive an argument was.

This dataset can be immensely useful for developing and refining language models aimed at persuasive writing. It also provides insights into human cognitive biases and response patterns to different forms of rhetoric, aiding in fields like marketing, political science, and psychology.

Below is some more background on this interesting dataset:

Anthropic Persuasion

The paper "Measuring the Persuasiveness of Language Models" by Anthropic researchers explores the relationship between the scale of AI language models and their ability to generate persuasive arguments.

The study compares the persuasiveness of arguments generated by various Anthropic models (Claude 1, 2, and 3) and two classes of models (compact and frontier) with human-written arguments.

Methodology

  1. The researchers curated 28 topics on complex and emerging issues where people are less likely to have hardened views, such as online content moderation, ethical guidelines for space exploration, and the appropriate use of AI-generated content. They created 56 opinionated claims, with supporting and opposing claims for each topic.

  2. For human-written arguments, three participants were randomly assigned to each claim and asked to write persuasive messages of approximately 250 words. Participants were informed that their submissions would be evaluated, with the most persuasive author receiving additional compensation.

  3. For AI-generated arguments, the researchers used four distinct prompts to generate arguments: Compelling Case, Role-playing Expert, Logical Reasoning, and Deceptive. The ratings of changed opinions were averaged across these prompts to calculate the persuasiveness of the AI-generated arguments.

  4. To assess persuasiveness, participants were shown a claim without an accompanying argument and asked to report their initial level of support on a 1-7 Likert scale. They were then shown an argument supporting the claim, written by either a human or an AI model, and asked to re-rate their stance. The persuasiveness metric was defined as the difference between the final and initial support scores.

  5. As a control condition, the researchers presented participants with Claude 2 generated arguments that attempt to refute indisputable factual claims to quantify opinion changes due to extraneous factors.

Findings

  1. Claude 3 Opus was found to be roughly as persuasive as humans, with no statistically significant difference.

  2. A general scaling trend was observed: as models get larger and more capable, they become more persuasive.

  3. The control condition worked as expected, with the persuasiveness score close to zero for indisputable factual claims.

Lessons Learned: The researchers discussed the limitations of their study, including:

  1. Persuasion is difficult to study in a lab setting, and the results may not transfer to the real world.

  2. Evaluating the persuasiveness of arguments is inherently subjective.

  3. The experimental design has limitations, such as studying only single-turn arguments, using human writers who may not be experts in persuasion, and not exploring human + AI collaboration.

  4. Cultural and linguistic context may limit the generalizability of the findings.

  5. The study might suffer from an anchoring effect, limiting the magnitude of the persuasiveness effect observed.

  6. Different prompting methods work differently across models, with the Deceptive strategy being the most persuasive overall.

  7. The researchers did not measure the longer-term effects of being exposed to persuasive arguments.

Last updated

Logo

This documentation is for the Axolotl community