Comprehensive Guide to Chain-of-Thought Prompting
Scaling up large language models (LLMs) has shown good results in sentiment analysis and machine translation, even without any examples. However, they fail in complex multi-step problems such as arithmetic and commonsense reasoning. To address this, LLMs can either be fine-tuned for a particular task or taught with few-shot prompting. However, both of these methods have limitations. Fine-tuning is costly for creating high-quality reasoning, while only few-shot prompting is not effective enough for the task.
Chain-of-Thought (CoT) prompting can address both of these problems. In this article, we will explore CoT prompting and how implementing it can upskill your business.
What is Prompt Engineering?
Prompt engineering is the practice of writing well-structured and carefully crafted prompts that can be better interpreted by a generative AI model. A prompt tells the LLM what task to perform and what kind of output to generate. It can contain instruction, context, input data, and output indicators. Using prompt engineering, we can use LLMs to carry out various tasks, from simple question answering to complex creative text generation. It is based on an emergent property, in-context learning, allowing LLMs to learn from prompts. Prompt engineering improves the performance of LLMs on the task at hand. It uses zero-shot, few-shot, active and CoT prompting as discussed ahead.
We also have a blog on advanced prompt engineering.
Zero-shot Prompting
In zero-shot prompting, we provide the LLM a prompt that describes the task, but the prompt does not provide any examples for the task. The LLM is then asked to generate a response to the prompt. It improves the flexibility and generalization of LLMs. It can be used to train LLMs on several tasks, without having to collect training data for each task. For example, ChatGPT can write a poem on prompt engineering without giving any examples of how to write a poem. However, zero-shot prompting is limited for complex tasks.
Few-shot Prompting
Few-shot prompting can provide demonstrations to steer the model to better performance. It is a technique for providing LLMs with a few examples of the desired output, in addition to the prompt. The examples help the model to better understand the task and to generate more accurate and informative responses. We should provide vast and different examples to the model, instead of multiple similar examples. It ensures the model learns as much as possible about the task. Standard few-shot prompting is a good technique for many tasks, but not reliable for complex reasoning tasks. Therefore, more advanced prompting techniques, such as chain-of-thought, active prompting, and fine-tuning, are needed.
Active Prompting
Active prompting improves the performance of LLMs on complex tasks by iteratively providing them with feedback on their responses. This feedback can help the LLMs to learn from their mistakes and to generate more accurate and informative responses. It provides the LLM with a prompt and a few examples of the desired output. The LLM then generates a response. The response is then evaluated by a human evaluator, who provides feedback to the LLM on the accuracy and informativeness of the response. The LLM then uses this feedback to improve its response generation capabilities. This process repeats until the LLM can generate responses that are accurate and informative enough to satisfy the human evaluator.
Active prompting is important for CoT prompting as it identifies important questions for annotation, minimizes human annotation efforts, and improves the accuracy and informativeness of CoT prompts. The following figure shows active prompting with CoT to improve performance. It is a four-stage process that involves estimating the uncertainty of a question by querying an LLM multiple times, selecting the most uncertain questions for annotation by ranking them, annotating them with detailed feedback from human evaluators, inferring the answers to new questions by using the LLM to generate answers and the feedback from the annotation step to improve the quality.
What is Chain-of-Thought Prompting?
Chain-of-Thought prompting is a prompt engineering technique through which we force LLMs to output a sequence of intermediate steps that lead to the desired answer. It improves the reasoning abilities of LLMs. It is beneficial because it allows the model to focus on solving one step at a time, rather than having to consider the entire problem all at once. It can be especially helpful for complex problems that would be difficult or impossible to solve in a single step. It provides an interpretable window into the behavior of the model. We can see how the model arrived at its answer by following the sequence of steps that it took.
CoT prompting can be used with LLMs with a large set of parameters (~100 B parameters) for several reasoning tasks, including math word problems, commonsense reasoning, and symbolic manipulation. For example, using CoT prompting in the PaLM model instead of standard few-shots, improved the performance in the GSM8K benchmark from 17.9% to 58.1%. CoT prompting can be readily elicited in sufficiently large language models without any special training or fine-tuning of the model. It makes CoT prompting a scalable and accessible technique.
Few-Shot CoT
Few-shot prompting prompts the LLM with a question and the answer. Then, the LLM is provided with a few examples of how to solve similar problems. The examples are presented in a way that encourages the LLM to reason about the problem and come up with a chain of thought that leads to the answer.
Few-shot CoT is a more effective technique for improving the reasoning abilities of LLMs than the few-shot baseline because it provides LLMs with examples of similar problems. It can be more complex to implement than a few-shot baseline because it requires the creation of example prompts. However, the benefits of few-shot CoT outweigh the additional complexity.
Zero-Shot CoT
Zero-shot CoT involves adding "Let's think step by step" to the original prompt. It extracts reasoning and answers using two prompts.
- Reasoning extraction: In this step, the language model thinks about the question and comes up with a chain of reasoning that leads to the answer. For this, we give the language model a prompt that includes the question and a trigger sentence "Let's think step by step." The language model will then generate a sentence that explains how it arrived at the answer.
- Answer extraction: In the second step, we extract the final answer from the language model's response. We concatenate the prompt, the generated sentence, and a trigger sentence, "The answer is". It tells the language model to give us the answer. The language model will then generate a sentence that contains the answer to the question.
In contrast to this, the zero-shot baseline uses prompts like "The answer is" for answer extraction. Few-shot prompting, whether standard or CoT, avoids the need for such answer-extraction prompts by designing example answers to end in the correct formats.
On comparing zero-shot CoT to two other methods for evaluating the zero-shot reasoning abilities of LLMs, researchers found that zero-shot-CoT outperforms the other methods on various reasoning tasks. If you are looking for a smaller model trained on CoT prompting, consider the Flan-T5 model. It can be used for zero-shot NLP tasks including text summarization, natural language inference, translation, and common sense reasoning.
When Does CoT Emerge?
CoT reasoning is an emergent ability of LLMs that may arise due to scaling models over 100 billion parameters. It does not positively impact performance for smaller LLMs and only yields performance gains when used with models of this size. There are two reasons for it. Firstly, smaller LLMs are not able to generate long chains of thought that are both fluent and logical. This leads to lower performance than standard prompting. Secondly, CoT reasoning is more effective for more complicated problems. It requires the LLM to be able to identify the key steps involved in solving a problem and then generate a chain of thoughts that leads to the solution. Smaller LLMs may not be able to do this as effectively as larger LLMs.
Another reason for the emergence of CoT reasoning in large LLMs may be due to their pre-training data. Larger LLMs are typically trained on massive datasets that include step-by-step reasoning, which could help them to develop the ability to reason in a chain-of-thought fashion. Instruction-following does not seem to be necessary for CoT capabilities, as zero-shot and few-shot CoT reasoning was shown using LLMs that were not fine-tuned to follow instructions. However, instruction-following could possibly improve the quality of CoT reasoning. Ultimately, more research is needed to determine the exact cause of the emergence of CoT reasoning in large LLMs.
How To Perform CoT Prompting?
To perform Chain of Thought prompting you just need to append “Let’s think step by step” at the end of your prompt. This forces the model to think in steps and break down the problem in steps or smaller parts. Here’s an example of what happens when don’t use and do use Chain of Thought prompting:
Here you can see how using Chain of thought makes LLM return a better more sophisticated and correct output. The prompt without thinking in steps immediately results in a wrong answer.
If you have a rather strict problem that you know can only be solved with a specific set of reasoning patterns, that’s where you would use Few Shot COT. You can provide some examples of reasoning steps required for your specific set of problems and then the LLM will attempt to solve the given problem using similar steps. Or you can use this technique to solve the problem in a specific method for your users. For example, if students are going to be using your app, you might want to use few-shot-CoT to solve problems in a fun, simple and easy to understand way.
These few shot examples should showcase the intermediate steps and the final solution. Once you have developed the chain of thought prompts and examples, you can incorporate them into the model. Finally, test the model and iterate on the chain of thought prompts and examples until the model's performance is satisfactory.
Key Aspects of CoT Prompting
In this section, we will explore crucial dimensions of CoT prompting impacting its performance and reliability in large language models. We will delve into how sensitivity, self-consistency, robustness, and coherence play pivotal roles in shaping the effectiveness of CoT prompting technique.
Self-consistency
Self-consistency is a technique for improving the performance of language models on tasks that require multi-step reasoning. In the context of chain-of-thought prompting, self-consistency can be used to improve the performance of the model by sampling multiple, diverse chains of thought for the same problem. The model can then be trained to select the most consistent answer from these chains of thought.
Self-consistency significantly boosts the performance of CoT prompting on many popular arithmetic and commonsense reasoning benchmarks. For example, on the GSM8K benchmark, self-consistency increased the performance of CoT prompting by 17.9%. On the SVAMP benchmark, by 11.0%. And on the AQuA benchmark, by 12.2%. It is an entirely unsupervised technique that works off-the-shelf with pre-trained language models. It requires no additional human annotation and avoids any other training, models, or fine-tuning. It is robust to sampling strategies and parameters. On varying T in temperature sampling, k in top-k sampling, and p in nucleus sampling strategies over PaLM-540B, self-consistency consistently improved performance.
Robustness
The researchers conducted experiments with three different sets of chain-of-thought annotations, each written by a different annotator. They found that CoT prompting consistently outperformed the standard baseline, regardless of the annotator. This suggests that CoT prompting is not dependent on a particular linguistic style. The researchers also conducted experiments with exemplars randomly sampled from the GSM8K training set, an independent source. They found that CoT prompting with these exemplars performed comparably to CoT prompting with manually written exemplars. This suggests that CoT prompting is not dependent on the specific exemplars that are used.
The researchers also conducted experiments with varying numbers of exemplars. They found that CoT prompting remained robust to varying numbers of exemplars. This suggests that CoT prompting does not require a large number of exemplars to be effective. The researchers conducted experiments with a variety of language models, including LaMDA 137B. They found that CoT prompting was effective with all of these language models. This suggests that CoT prompting is not dependent on the specific language model that is used. Overall, the results of these experiments suggest that CoT prompting is a robust technique for improving the performance of language models on a variety of tasks. It is not dependent on a particular linguistic style, annotator, set of exemplars, or language model.
Sensitivity
Sensitivity in CoT prompting refers to the extent to which the performance of the model is affected by the design of the prompts. If the prompts are not well-designed, then the model's performance may deteriorate. The prompts should be clear, concise, and easy for the model to understand. Avoid using jargon or technical terms that the model may not be familiar with. The prompts should be matched to the specific task that the model is trying to solve. If the prompts are not matched to the task, then the model may not be able to generate the correct answer. The more complex the task, the more sensitive the model may be to the design of the prompts.
The performance of few-shot CoT deteriorated when the prompt example question types and task question types were unmatched. This suggests that few-shot CoT is highly sensitive to the design of the prompts and that the prompts need to be carefully matched to the specific task to achieve good performance.
Coherence
Coherence refers to the extent to which the steps of a CoT rationale are in the correct order. This means that later steps should not be preconditions for earlier steps, and earlier steps should not be based on later steps. For example, a rationale where "32 + 42 = 74" appears before the introduction of "32" or "42", would not have coherence. This is because the equation "32 + 42 = 74" is a later step that depends on the earlier steps of introducing the numbers "32" and "42."
The researchers designed a set of ablation settings to examine the impact of coherence on different components of a CoT-like rationale. Ablation settings are a way of testing the importance of different parts of a system by removing them and observing the impact on the system's performance. It was found that coherence was important for all components of a CoT-like rationale. When coherence was removed, the performance of the system deteriorated.
The researchers also found that the coherence of language templates is particularly important for the performance of CoT prompting. Language templates are the phrases that are used to connect the different steps of a CoT rationale. If the language templates are not coherent, then the model may not be able to understand the rationale and generate the correct answer.
Types of Chain-of-Thought Prompting
Within the realm of chain-of-thought (CoT) prompting, two notable variations emerge as impactful strategies: multimodal CoT and least to most prompting. Let us explore these techniques in detail.
Multi-modal CoT
Traditional CoT focuses on the language modality, which means that it only uses text to provide the model with a context for reasoning. Multimodal CoT incorporates text and vision into a two-stage framework. The first step involves rationale generation based on multimodal information. This means that the model is provided with both text and images, and it is then asked to generate a rationale that explains how the text and images are related. The second phase of the framework is answer inference. This is where the model uses the informative rationale that it generated in the first step to infer the correct answer to the question.
1B multimodal CoT outperforms GPT-3.5 by 16 percentage points (75.17% to 91.68% accuracy) and surpasses human performance on the ScienceQA benchmark. Among the 8 question classes, our model improved performance from 67.43% to 88.80% for questions with paired images. Methods such as UnifiedQA and GPT-3.5, use image captions to understand what the image shows, however, using image features was more effective. Future studies could improve CoT reasoning by using better image features, adding common sense knowledge, and filtering out irrelevant information.
Least-to-Most Prompting
Chain-of-thought prompting is a powerful technique for natural language reasoning, but it can struggle with tasks that require solving problems that are harder than the examples shown in the prompts. To address this challenge, we propose a novel prompting strategy called least-to-most prompting.
Least-to-most prompting works by breaking down a complex problem into a series of simpler subproblems, and then solving them in sequence. Each subproblem is facilitated by the answers to the previous subproblems. For example, to solve a math word problem, we might first query the language model to decompose the problem into subproblems, such as "What is the cost of the first item?" and "What is the total cost?" We would then query the language model to sequentially solve the subproblems, using the answers to the previous subproblems to inform our queries.
Least-to-most prompting generalizes to more difficult problems on symbolic manipulation, compositional generalization, and math reasoning tasks. GPT-3 code-davinci-002 with least-to-most prompting can solve SCAN with 99% accuracy using 14 exemplars, while chain-of-thought prompting only gets 16% accuracy. The table below shows the accuracies of different prompting methods on the subset of GSM8K and DROP benchmarks containing only numerical problems. The base language model is code-davinci-002.
The table below shows the accuracies of different prompting methods on the last-letter-concatenation task.
Auto-CoT
Auto-CoT is a way to automatically create demonstrations with questions and reasoning chains. It uses large language models to generate reasoning chains for each demonstration, using the prompt "Let's think step by step." Auto-CoT has two main steps. First, it partitions the questions in a given dataset into a few clusters. Then, it selects a representative question from each group and uses Zero-Shot-CoT with simple heuristics to generate a reasoning chain. The diversity of the demonstration questions is important for reducing the number of mistakes that Zero-Shot-CoT makes in the reasoning chain. By clustering the questions into a few groups, Auto-CoT can ensure that each demonstration is representative of a different type of question. This helps to reduce the chances that Zero-Shot-CoT will make mistakes in the reasoning chain.
Auto-CoT was tested on 10 reasoning tasks, including arithmetic reasoning (MultiArith, GSM8K, AQUA-RAT, SVAMP), commonsense reasoning (CSQA, StrategyQA), and symbolic reasoning (Last Letter Concatenation, Coin Flip). Auto-CoT consistently matched or exceeded the performance of Manual-CoT in GPT-3.
Here is a comparison of Auto-CoT with four baseline methods: Zero-Shot, Zero-Shot-CoT, Few-Shot, and Manual-CoT.
Applications of CoT
Applications of CoT are in various domains, including arithmetic, commonsense, symbolic reasoning, natural language inference, and question answering. CoT prompts offer capabilities to LLMs to address complex problems across these areas.
Arithmetic Reasoning
Chain of thought (CoT) prompting, when used with a 540B parameter language model, has comparable performance with task-specific fine tuned models on various tasks, including arithmetic reasoning. Solving math word problems is a challenging task for language models. To evaluate LLMs on the ability to solve math problems, two benchmarks, MultiArith and GSM8K, are used. Standard prompting shows relatively flat scaling curves for these benchmarks, meaning increasing model size does not substantially improve performance. However, when using CoT prompting, increasing model scale significantly improves performance, especially for large model sizes.
PaLM, a 540B parameter language model, combined with CoT prompting, achieves a state-of-the-art performance of 58% on the GSM8K benchmark. Self-consistency techniques further improve CoT prompting performance, reaching 74% accuracy on GSM8K. CoT prompting results in a state of the art in math word problem-solving, surpassing fine-tuned GPT-3 baselines.
Commonsense Reasoning
Chain-of-thought prompting can also be used for commonsense reasoning tasks. Such tasks require reasoning about physical and human interactions based on general knowledge. Commonsense reasoning is challenging for current natural language understanding systems. CoT prompting is evaluated on commonsense reasoning benchmarks such as CommonsenseQA, StrategyQA, date understanding, and sports understanding. Performance on these tasks generally improves with an increase in model size. CoT prompting provides small improvements over it. CoT prompting is most effective in improving performance on sports understanding tasks.
PaLM 540B with CoT outperformed an unaided sports enthusiast with a score of 95% vs. 84% and the prior state-of-the-art on StrategyQA with a score of 75.6% vs. 69.4% and sports understanding with 95.4% vs. 84%. But, minimal improvement is seen in CommonsenseQA (CSQA).
Symbolic reasoning
Chain-of-thought prompting enables language models to perform symbolic reasoning tasks that are difficult with standard prompting. It also supports length generalization, allowing models to handle inference-time inputs longer than those seen in few-shot exemplars. During research, to test CoT prompting, two toy tasks were used for evaluation. The first was the last letter concatenation, where the model concatenates the last letters of words in a name. And the second was coin flip, where the model determines if a coin remains heads up after people flip it or not.
In-domain and out-of-domain test sets were used to evaluate the performance of PaLM 540B with chain-of-thought prompting (CoT) and standard prompting on these two tasks. For in-domain evaluations, the examples had the same number of steps as the training/few-shot exemplars. For out-of-domain evaluations, the examples had more steps than those in the exemplars.
PaLM 540B with CoT achieved almost 100% solve rates for in-domain evaluations. Standard prompting failed for both tasks in both in-domain and out-of-domain evaluations. CoT prompting resulted in improved performance, but it was lower than in in-domain evaluations.
Question Answering
CoT prompting improves question answering (QA) by decomposing complex questions or prompts into a sequence of simpler, logical steps. This approach helps the language model understand the structure of the question and the relationships between its components. Each step focuses on a specific aspect of the question, helping the model to identify relevant information more effectively. CoT encourages the model to perform multi-hop reasoning, where it iteratively gathers and combines information from different sources or documents. This enables the model to perform improved inference and connect separate pieces of knowledge to arrive at an accurate answer. By explicitly specifying reasoning steps, CoT prompts can help prevent common errors or biases that language models might introduce when answering complex questions. Additionally, CoT prompts allows users to understand how the model arrived at a particular response.
CoT vs. Other Methods
In this section, we delve into a detailed comparison of CoT prompting with other methods, specifically Standard and Tree of Thought Prompting. Evaluating their strengths and limitations offers valuable insights into selecting the most suitable approach for your business applications.
CoT vs. Standard Prompting
Standard Prompting uses input-output pairs as examples. The pairs are formatted as questions and answers. The model predicts answers based on these pairs. It is limited in handling multi-step reasoning tasks effectively. But is suitable for straightforward tasks, such as single-turn questions. It demands fewer computational resources. It commonly uses single-shot prompts for training and tends to require more data to fine-tune for complex tasks. Standard prompting may not exhibit significant performance improvements with the model scale.
While CoT prompting involves generating intermediate reasoning steps. These steps precede providing a final answer. It excels at complex reasoning, enabling models to think step by step. It is versatile and applicable to a wide range of tasks requiring intricate reasoning. It requires training on sequences of prompts and efficiently utilizes data for multi-step reasoning. It demonstrates enhanced performance with larger models and, thus, requires more computational power. It excels in complex reasoning benchmarks and tasks that demand multi-step problem-solving.
Comparison on MAWPS Benchmark:
Comparison on Length Generalization Task:
You can choose Standard Prompting for straightforward tasks and CoT Prompting, as the superior choice, for applications requiring deep, multi-step reasoning and interpretability. An open-source repository of data and tools related to CoT reasoning is available on GitHub. It has datasets on various tasks such as math problems and common sense reasoning. It also has a community forum for discussions.
CoT vs. Tree of Thought Prompting
CoT follows a linear approach where each new word or idea is linked directly to the one before it, forming a chain. It represents a sequential thought organization. Tree of Thoughts (ToT), on the other hand, adopts a hierarchical approach. Ideas are organized into a tree-like structure, with each idea branching off into multiple related ideas. It represents a more complex and branching thought organization.
CoT models, like GPT-3, are generally good at generating coherent and contextually relevant text over short spans. ToT models, such as Transformer models, are often better at maintaining coherence over longer texts and can keep track of multiple related ideas at once. CoT models are simpler in structure and are computationally less intensive compared to ToT models because of the latter’s hierarchical nature. Also, ToT introduces the concept of a "ToT Controller" trained through reinforcement learning (RL). This controller can potentially learn from new data or self-play, allowing the ToT system to evolve and acquire new knowledge even with a fixed language model.
CoT-SC (Self-consistency with a Chain of Thoughts) uses a simple prompting technique. It doesn't explicitly mention the use of search algorithms. ToT employs search algorithms like breadth-first search (BFS) and depth-first search (DFS) to enable systematic exploration of thoughts. It uses these algorithms in conjunction with the tree structure for problem-solving. Hence, ToT outperforms other methods significantly.
You can choose CoT for simpler, shorter texts, and ToT can be more appropriate for complex, longer texts and problem-solving tasks.
Want to write high-quality production-grade prompts for your LLMs?
If you are looking to leverage prompt engineering for your business needs, we can help. We are a team of AI engineers with extensive experience in prompt engineering techniques like Chain-of-Thought, ReAct, etc. Contact us today and let us apply CoT applications to elevate your business.