Using ChatGPT to build Synthetic Datasets
Often when solving very specific business problems, it can be challenging to find a large and diverse dataset to train machine learning models.
Real-world datasets can be costly and time-consuming to collect, and they may not always contain the necessary variety of data to create accurate models. In recent years, the use of synthetic datasets generated by large language models like GPT has become increasingly popular.
In this article, we will generate a synthetic dataset of complicated sentences and their rewritten simpler versions using GPT models.
Why use Synthetic Datasets?
Synthetic datasets are computer-generated datasets that can be tailored to specific needs and created quickly and inexpensively compared to collecting and labeling real-world data. This provides businesses with a time-saving and cost-effective solution for training their machine-learning models.
Synthetic datasets are more diverse and representative than real-world datasets, offering businesses an advantage. Real-world datasets may be limited in their data variety, impacting the accuracy of machine-learning models. Synthetic datasets can be generated with specific properties, such as balanced class distributions, that are hard to achieve in real-world datasets, leading to more accurate models and informed decision-making.
While synthetic datasets may not be able to fully capture the complexities of real-world data, they are still a powerful tool for businesses to overcome the challenges of collecting real-world data and developing machine-learning models that meet their specific needs.
How to Generate Synthetic Data with ChatGPT
Models like GPT4 and ChatGPT are usually trained on huge datasets to learn multiple knowledge bases and can do various tasks. We can use this property of LLMs to generate synthetic datasets which are more specific to our needs.
Now, we can generate synthetic datasets from scratch, it is always better to provide the model with some examples to begin with.
Let’s discuss the EXACT pipeline we use to generate high-quality synthetic datasets.
Extracting properties from provided examples
The first step is to analyze the examples provided and understand what exactly we are seeking to achieve. This is not required, but we have found that this greatly increases the quality of generated outputs.
The goal of this step is to generate and understand the key properties of the provided examples. In the synthetic dataset, pairs of inputs and outputs are generated, these properties are the guide for the model to generate them.
Here’s an example of how to do this:
Generating the dataset
Once we have the properties we want our dataset to have, we can begin to generate the synthetic dataset we want. We can tweak the properties if we want, this gives us more control over the generated dataset.
For this, we usually set the model parameters in a specific way:
- Temperature: The temperature is usually set to the max value, 1. This is because we want the model to be as “random” as possible, while still staying within the constraints provided, to generate the most diverse dataset possible.
- Frequency Penalty: This value is also set to a higher number, 1. A higher frequency penalty value will encourage the model to produce more diverse and unique content by penalizing the repetition of the same words or phrases.
- Presence Penalty: Setting this parameter is a bit tricky. Higher presence penalty that model will be penalized if it generates the same word multiple times. This can be good if you want your dataset to be diverse and contain lots of different words. But if you are generating something very specific, you might want to have the same words present in your dataset multiple times in different contexts. The value of this parameter depends entirely on your requirement.
You can of course tweak these parameters according to your needs.
Here’s the prompt to use for this step:
You can change the topic according to your needs.
The prompts are changed to compress the information from the first output, initial testing shows that the results are good enough to work with. But specific usages will require specific changes and prompts will have to be changed accordingly.
Here’s an example output of the above prompt with the topic set to business administration:
Postprocessing and storing
Once we have the prompts ready, we can tie everything together and start generating the dataset. We usually ask GPT3 to generate data in a specific format so we can parse and store it. We are going to use the JSON structure in this article as it is very common and is easy to process.
For this tutorial, we will take the output JSON from GPT3, parse it using python and store it inside a pandas dataframe. After that, we can store it as a CSV.
Generating a Synthetic Dataset of Rephrased Sentences
Now that we have the pipeline and the prompts laid out, we can generate the dataset. Here we present an example of generating a dataset of rephrased sentences.
Python code to call GPT APIs
Now we write a python function that calls the GPT APIs and generates the dataset. We will call this function repeatedly with different inputs for the topic parameter to generate sentences with a diverse range of vocabulary.
Here’s the code:
In this code, we first load the openai module which allows us to call the GPT APIs, then we load the OPENAI_KEY and set it as the API key to authenticate our requests to OpenAI's GPT service.
The infer_gpt() function takes two parameters: topic and prompt_version, with a default value of 0 for prompt_version.
The prompts variable holds a list of different versions of the same prompt. In this case, the tkn_topic_tkn placeholder within the prompt is replaced with the value of the topic parameter, which is the topic we want GPT to generate text about.
Then, we call the openai.Completion.create() method to generate the sentence pairs.
Calling the API to generate the dataset
Once we have the code to generate the sentences, we can write a loop that will generate them in the context of a specific topic.
In this bit of code, we are looping through our list of topics and calling the infer_gpt for the pased amount of times every single topic in the list.
Once done, we will have all the data in the df_gpt variable. We can dump that data into a CSV:
You can access the data we generated here.
Exploring the Dataset
For a dataset on rewritten sentences, this dataset works really well. For rewriting, we wanted a wide vocabulary and a lot of different scenarios where sentences could be rewritten.
We achieved a wide range of words using the topic parameter, it gave us some control over what kind of words we want to be generated.
But when it comes to different types of scenarios, this is definitely not a good dataset. All the pairs are of one single sentence rewritten, in real life there might be multiple sentences or even paragraphs that need rewriting. The dataset also lacks variety in the topic of the sentences, this is most likely because of the passed example in the prompt. The generated sentences are all too similar to the given example.
Synthetic Data vs Real Data
Synthetic data can indeed help solve a lot of problems in the industry, but it is important to note that synthetic data is always limited by how we are generating it and what underlying model is being used.
In the above example, the synthetic data is pretty good but is not all that is required to train a model to rephrase sentences. There are many ways a sentence can be simplified, semantically, syntactically, and even by using simple conjunctions. The generated dataset does not contain all these variations. True, these can be generated too, but that requires more work and a new prompt.
In real data, these variations will be present naturally. Synthetic data alone should not be used to train highly accurate systems, they should only be used to add to the data that is already present. It can also be used to introduce new variations and types to the existing data.
Want to build synthetic datasets?
Looking to build synthetic datasets and train AI models on them? We have worked with synthetic datasets in different fields. We have internal pipelines to ensure the quality of the generated data and experience in using them properly to build and deploy AI systems in production.
Book a call, let’s help you build your specific dataset!