A Comprehensive Comparison Of Open Source LLMs
In the last year, we have seen remarkable growth in the capability of open-source LLMs due to several factors, including the increasing availability of data, the development of new training techniques, and the growing demand for AI solutions. Their transparent, accessible, and customizable ability makes them a perfect alternative for closed-source LLMs such as GPT-4.
One of the challenges with open-source LLMs is no agreed-upon evaluation criteria. It makes it difficult to compare and choose a model for a particular task. However, several benchmarking techniques have come forth, such as MMLU and ARC, to evaluate the performance of open-source LLMs on various tasks.
In this article, we will analyze different open-source LLMs to help you understand and choose a model for your needs.
Open Source Vs Closed Source LLMs
With open-source LLMs (Large Language Models), individuals can build custom models tailored to specific tasks and domains. These models remove the entry barrier in AI. Organizations and researchers can train their LLMs, deploying them on personal PCs.
Businesses can use open-source LLMs to design models that securely reside on internal servers, evolving performance through continuous refinement and specialization. It grants businesses control, amplifying their AI efforts and internal capabilities.
The public availability and involvement in open-source LLMs promotes experimentation and innovation. Platforms like Hugging Face, facilitate the availability of pre-trained models and NLP tools to help people evaluate and work on their and other state-of-the-art models.
However, open-source LLMs are still not at par with closed-source LLMs like GPT-4. It is because closed-source LLMs can be trained on more data and with more computational resources. Despite their current limitations, they are likely to close the gap in their performance.
Exploring The Architecture Of LLMs
LLMs use a transformer architecture consisting of a stack of neural networks trained to predict the next word in a sequence. Transformers are based on self-attention mechanisms, which allow the model to focus on specific parts of the input sequence. It has a significant impact on a model’s performance and capabilities. Hence, understanding their architecture is essential for their efficient use.
Encoder Only
Figure from A BetterTransformer for Fast Transformer Inference
Encoder models are systems that use only the encoder of a Transformer model. Their attention layers access all words in the input sentence and use bidirectional context to make predictions. They pre-train with masked language modeling (MLM), which masks a portion of the input tokens and asks the model to predict them. It helps the model understand the context of words in a sentence. Fine-tuning the text representations is helpful for downstream tasks like classification, question answering, and named entity recognition. Some examples of encoder-only models include BERT and ELECTRA.
Encoder-only models are capable of learning long-range dependencies between words in a sentence. It helps understand the context of a sentence and make accurate predictions. They can represent sentence meanings in a vector space. It further aids the comparison of sentences and making predictions. They are efficient to train and deploy for large classification tasks. Hence, they are for sentence classification, named entity recognition, and extractive question answering.
Encoder-Decoder
Figure from Dive Into Deep Learning
Encoder-decoder models are bidirectional models that pre-train on corrupted spans of text. The encoder generates the hidden representation as a summary of the input sequence. It is then passed to the decoder to generate the output. The advantage of having both the encoder and the decoder makes it effective against complex inputs. Span corruption helps the model to learn the context of words in a sentence, making it effective on tasks such as question answering, summarization, and translation. Some examples of encoder-decoder models are BART and Megatron.
Encoder-decoder models are mainly used for sequence-to-sequence predictions, such as text summarizations and question answering. It is capable of solving machine translation problems. These are good for generative tasks but are also computationally expensive. It is because it requires both an encoder and a decoder. Their popularity is increasing due to decreasing computing power costs.
Decoder Only
Figure from A Transformer-based Generative Model for De Novo Molecular Design
Decoder-only models are auto-regressive models that use only the decoder of a transformer architecture. The architecture lacks an encoder. It implicitly encodes the information in the hidden state and updates it during output generation. They are pre-trained to predict the next word in a sentence, making them commonly used models for text generation. When fine-tuned, they can be used for downstream tasks by using a classifier over the last token hidden representation. Examples of decoder-only models include GPT models, transformer XL, and LLaMA.
Decoder-only models generate one word at a time. They take into account the context of already generated words to generate the next word. These models are for text generation purposes, including text completion, and question-answering. They can be used for machine translation or in chatbots. They are capable of learning through feedback, thus, can improve over time.
Comparison Of Different Open-Source LLMs
Here is an in-depth comparison of various open-source LLMs, focusing on their structures, training methods, and performance.
LLaMA
Overview
LLaMA is a family of large language models (LLMs) with a decoder-only architecture and bidirectional context. It was developed by Meta AI and released in February 2023. LLaMA models range in size from 7 billion to 65 billion parameters. LLaMA 1 models were trained on a dataset of 1.4 trillion tokens, while Llama 2 models were trained on a dataset of 2 trillion tokens.
The latter was released in three model sizes: 7, 13, and 70 billion parameters. It also has some architectural tweaks, such as Grouped Query Attention to make inference more efficient. It uses a vanilla multi-head attention mechanism. But, it can be fine-tuned with reinforcement learning to improve its performance on specific tasks. It outperforms many LLMs like MPT but still lags behind GPT-4. It is a developing model that is effective in various tasks, such as machine translation, and code generation. It is perhaps the best open-source model in terms of capability out there right now.
Figure by Meta AI
The Llama 1 license was for non-commercial use, while the Apache 2.0 license allows for commercial use with some restrictions. Meta forbids Llama 2’s usage to train other models. And if Llama 2 is used in an app or service with more than 700 million monthly users, a special license must be obtained from Meta AI. This means that the architecture can be studied and modified, and the code can be used to create new models. However, the weights of the model, which contain the learned parameters, are not publicly available. The RedPajama project reproduces and distributes an open-source version of the LLaMA dataset.
OpenLLaMA is an open-source reproduction of the LLaMA model. It provides researchers and developers an accessible and permissively licensed large language model. It is a 7B model trained on 200 billion tokens of the RedPajama dataset. It is significant because of its public availability. Also, llama.cpp, a port of the LLaMA model in C and C++, enhances the model's capabilities by supporting CUDA acceleration with GPUs.
Performance Insights
Developers of LLaMA reported that the 13 billion parameter model outperformed the considerably larger GPT-3 (175 billion parameters) across numerous NLP benchmarks. This performance excellence extends to competing with advanced models like PaLM and Chinchilla, showcasing LLaMA's proficiency in diverse language-related tasks.
LLaMA 65B model has shown good capability in most use cases. It has an ARC rating of 63.48, posing a psychometric intelligence rate more than that of Falcon-40B. Also, LLaMA 2 ranks among the top 10 models in Open LLM Leaderboard on Hugging Face.
Use Cases And Applications
LLaMA is used for text generation tasks, including text summarization for condensing content-rich texts while preserving vital information. It can enhance sentences and paragraphs via natural language processing techniques, exceeding GPT-3. Its open-source nature and llama.cpp compatibility empowers users to explore, customize, and deploy the model as per specific requirements.
Numerous models are constructed using LLaMA. For example, Alpaca is built on the LLaMA 7B model. Many open-source projects are continuing the work of optimizing LLaMA with the Alpaca dataset.
Llama 2 includes both foundational models and models fine-tuned for dialog, called Llama 2 - Chat. They parallel closed-source counterparts like ChatGPT and PaLM. Llama 2 and Llama 2 - Chat have the same context length of 4K tokens as opposed to GPT-4, which increased context length during fine-tuning.
Falcon
Overview
Falcon is an open-sourced large language model (LLM) developed by the Technology Innovation Institute (TII). It is available under the Apache 2.0 license, meaning you can use it commercially. It has two models, Falcon-7B, and Falcon-40B, trained on 1.5 trillion and 1 trillion tokens, respectively.
Falcon was trained on a massive English web dataset called RefinedWeb, built on CommonCrawl. RefinedWeb is a high-quality dataset deduplicated and filtered to remove machine-generated text and adult content. Models trained on RefinedWeb perform better than the ones trained on curated sets. Falcon required 384 GPUs on AWS over two months. Falcon was built using custom tooling and leverages a unique data pipeline that can extract high-quality content from web data and use it to train a custom codebase independent from the works of NVIDIA, Microsoft, or HuggingFace.
Performance Insights
Falcon matches the performance of state-of-the-art models by DeepMind, Google, and Anthropic. The Falcon-40B model is currently among the top in the Open LLM Leaderboard, and the Falcon-7B model is also among the best in its weight class. Falcon models use a unique multi-query attention mechanism that is more efficient than the vanilla multi-head attention scheme. This makes Falcon more efficient than other LLMs, requiring only 75 percent of GPT-3's training compute, 40 percent of Chinchilla's, and 80 percent of PaLM-62B's.
Use Cases And Applications
The Falcon LLM can work on various tasks, including generating creative content, solving complex problems, customer service operations, virtual assistants, language translation, sentiment analysis, and automating repetitive work. It is trainable in English, German, Spanish, French, Italian, Portuguese, Polish, Dutch, Romanian, Czech, and Swedish languages. The Falcon-40B-Instruct model is fine-tuned for most use cases, including chatbots. Falcon-40B requires ~90GB of GPU memory, while Falcon-7B only needs ~15GB. It makes Falcon-7B accessible even on consumer hardware.
Vicuna
Overview
Vicuna is an open-source LLM developed by LMSYS that is built from LLaMA. It is fine-tuned on a dataset of 70,000 user-shared conversations from ShareGPT. Vicuna is an auto-regressive LLM trained on 33 billion parameters and achieves more than 90% of ChatGPT's quality in user preference tests while vastly outperforming Alpaca. The training cost of Vicuna-13B is around $300 because the ShareGPT data is freely available. The code, weights, and an online demo are publicly available for non-commercial use. Vicuna can be run on modest hardware using llama.cpp.
Performance Insights
A preliminary evaluation of Vicuna-13B, using GPT-4, showed that it achieved more than 90% of the quality of OpenAI, ChatGPT, and Google Bard, while outperforming other models like LLaMA and Stanford Alpaca. In benchmark tests, Vicuna showed significantly more detailed and better-structured answers than Alpaca after fine-tuning with ShareGPT data.
In LMSYS’s MT-Bench test, Vicuna scored 7.12, while the best proprietary model, GPT-4, secured 8.99 points. Also, in the MMLU test, it achieved 59.2 points, and GPT-4 scored 86.4 points. These results suggest that Vicuna is a promising LLM that can compete with the best proprietary models.
Use Cases And Applications
Vicuna can perform various tasks, such as generating text, translating languages, and writing creative content. It can also generate text that is both coherent and informative, and it can hold conversations that are engaging and natural. On the contrary, it has known issues, such as weaknesses in reasoning and math, and can produce hallucinations. Nonetheless, Vicuna has the potential to be a valuable tool for research and development in the field of AI.
MPT
Overview
MPT (MosaicML Pretrained Transformer) is a commercial open-source decoder-only transformer model developed by MosaicML. It is trained on a massive dataset of text and code of 1T tokens from various sources, including OpenLLaMA, StableLM, and Pythia. The model has a context length of 8K tokens and 6.7B parameters. It is available in three fine-tuned versions: MPT-7B-StoryWriter-65k+, MPT-7B-Instruct, and MPT-7B-Chat.
MPT-7B-StoryWriter-65k+ is trained on a filtered fiction subset of the books3 dataset and licensed under the Apache-2.0 license. MPT-7B-Instruct is trained on Databricks Dolly-15k and Anthropic's Helpful and Harmless datasets. MPT-7B-Chat is built on the ShareGPT-Vicuna, HC3, Alpaca, Helpful and Harmless, and Evol-Instruct datasets.
Performance Insights
Figure by MosaicML
MPT model, initially developed to counter the limitations to other open-source models, outperforms the GPT-3. It scores 6.39 in LMSYS’s MT-Bench test. These models leverage FlashAttention and FasterTransformer for speed performance. MPT-7B is as competent as LLmMA-7B and exceeds other open-source 7B-20B models. MPT is trained for long inputs up to 64k and can handle to 84k with the help of ALiBi. It is optimized for fast training and inference and has highly efficient open-source training code.
Use Cases And Applications
MPT-7B-StoryWriter-65k+ can read and write fictional stories with long context lengths. It can create creative text formats like poems, code, scripts, musical pieces, email, and letters. The model can generate as long as 84k tokens on a single node of A100-80GB GPUs. MPT-7B-Instruct follows short-form information. It generates instructions. It is capable of informative question answering. MPT-7B-Chat, chatbot-like model, is for dialogue generation, helping engaging in coherent conversations. Also, if you need a small local running LLM, consider the MPT-30B model.
Guanaco
Overview
Guanaco-65B is a non-commercial open-source LLM based on the LLaMA 7B model. It uses the QLoRA 4-bit fine tuning method, efficiently reducing memory usage while preserving full 16-bit task performance. It allows Guanaco-65B to be trained on a single GPU with 48GB of VRAM in just 24 hours. Guanaco is available in four models: 7B, 13B, 33B, and 65B. All of the models have been fine-tuned on the OASST1 dataset. The limitations of Guanaco-65B include its slow 4-bit inference and lack of math capabilities.
Performance Insights
Guanaco-65B outperforms even ChatGPT (GPT-3.5 model) with a much smaller parameter size on the Vicuna benchmark. In the MMLU test, Guanaco-33B scored 57.6 and Falcon scored 54.7. Similarly, in the MT-Bench evaluation, Guanaco stands at 6.53 and Falcon at 5.17. Guanaco-7B model requires only 5 gigabytes of GPU memory and exceeds the 26-gigabyte Alpaca model by more than 20 percentage points on the Vicuna benchmark.
Guanaco-65B achieves above 99 percent of the performance of ChatGPT (GPT-3.5-turbo) in a benchmark run with GPT-4. Guanaco-33B reached 97.8 percent of ChatGPT's performance with 33 billion parameters in a benchmark while training it on a single consumer GPU in less than 12 hours. Guanaco-65B could achieve 99.3 percent of ChatGPT's performance in 24 hours on a professional GPU.
Use Cases And Applications
Guanaco is an advanced instruction-following language model trained on English, Chinese, Japanese and Deutsch. It can work in multilingual environments and extend to meet various linguistic contexts. Its QLoRA 4-bit fine tuning method makes Guanaco a good option for researchers and developers who need a powerful LLM that can be used offline or on mobile devices. Its uses include text summarization, question answering, finetuning and chatbots. Since it is still in development, it also makes a good model for experimentation.
LLaMA 2 Vs Falcon
Llama 2 has three model sizes: 7B, 13B, and 70B. It was trained on 2 trillion tokens, including over 1 million human annotations. It employs Reinforcement Learning from Human Feedback (RLHF) specifically for Llama-Chat-2's fine-tuning. According to Meta, Llama 2 outperforms other LLMs, including Falcon and MPT, in the areas of reasoning, coding, proficiency, and knowledge tests. Llama 2-7B scored 54.32 while Falcon-7B gained 47.01 on Open LLM Leaderboard. However, it is less efficient than Falcon, requiring more training compute.
Falcon has two models, Falcon-7B and Falcon-40B. It was trained on 1.5 trillion and 1 trillion tokens, respectively. Falcon uses a unique multi-query attention mechanism that is more efficient than the vanilla multi-head attention scheme used by Llama 2. It makes Falcon more efficient than Llama 2, but it is not as strong on some tasks as Llama 2.
Comparison Table by Hugging Face
Choosing An Open Source LLM
Here are some factors that will help you narrow down your choices to choose the right open-source LLM for your needs.
Technical Requirements
Assess your hardware and infrastructure capabilities. Consider the technical factors of an open-source LLM, such as model size, compute power and storage capacity. You might need to upgrade existing resources or consider cloud-based solutions. A model should be scalable to meet the real-world requirements of increasing demands. It is also necessary to consider the specific use case of the model. For example, a Falcon model may be good for language translation and automating repetitive work. If the business needs an LLM for both online and offline use, a Guanaco model may be a better option.
Integration Ability
Consider an open-source LLM that can integrate with your business systems. Check its compatibility with programming languages, frameworks, and APIs used in the ecosystem of your enterprise. It can save organizations time and effort and ensure the model works effectively. For example, Megatron-Turing NLG is a good option for enterprises using Python, TensorFlow, or PyTorch. It is also compatible with the Hugging Face Transformers library. It makes it easy to integrate into existing software applications.
Licensing
Businesses should understand the licensing terms of the open-source LLM they select. They must be clear about the license type, compatibility with other licenses, permitted usage, modifications, and distribution requirements. For example, Megatron-Turing NLG has a GPLv3 license. It allows the distribution of its modified version under the same license. But h2oGPT is under Apache 2.0 license. It permits modifications and commercial usage but forbids distribution under the same license. If you need a commercial license with no limitations, consider Falcon models. A brief review of the license guarantees alignment, IP respect, and legal compliance.
Adaptability
The adaptability of an open-source LLM refers to its ability to perform various tasks and how accurate its output is for those tasks. This adaptability depends on the model's design, training data, documentation, and community. For example, Vicuna is adaptable in generating coherent creative content but lags in math. LLaMA, on the other hand, is accurate in common sense inference but average in the level of truthfulness in generating answers to questions. Also, MPT is a good model for long text generation due to its context length of 8K.
The Advantages Of Open Source LLMs
Open-source LLMs offer several advantages over closed-source LLMs, including cost-effectiveness, flexibility, security, and reduced dependency on external providers.
Customization
Open-source LLMs are a flexible alternative to proprietary LLMs. They are freely available to use and modify with no recurring costs. Organizations can deploy them on-premises and have more flexibility to customize the model to their needs. They can add specific features to open-source LLMs unavailable in the default version. Open-source LLMs have a large and active community of users and developers. It can help with troubleshooting and development.
Cost Efficiency
Open-source LLMs offer a cost-efficient alternative to proprietary software. They save costs on licenses, subscriptions, and hidden charges. Hosted LLMs have usage fees that can add up. Open-source LLMs do not have such recurring payments. Once acquired, organizations can deploy them without worrying about any expenses. Hence, they are great for startups, small businesses, and organizations with limited budgets. They also allow customization and specific training. It is time-consuming and costly with proprietary options. It is an advantage for organizations looking for cheap and tailored solutions.
Security
Open-source LLMs secure your data by enabling on-premise deployment, minimizing external server risks. It reduces the risk of data breaches and unauthorized access. The code for open-source LLMs is open to the public to inspect and identify loopholes. It helps improve security and prevent malicious use of models. Independent security experts can customize open-source LLMs, ensuring they are free of vulnerabilities. It can ensure that the models are effective for the tasks at hand.
Reduced Dependency
Open-source LLMs can help organizations reduce their dependency on external providers. It is because they deploy the models on-premise, get support from a community of developers, and choose from many providers. Using the model in the organization's setup, developers can make it work with their current systems, apps, and data. It can give organizations more control over the models and their performance. It offers more flexibility in pricing and features. It can also help organizations avoid vendor lock-in.
Want To Select Custom LLMs For Your Business?
If you are looking to select or build a custom LLM for your business, we can help. We are a team of AI engineers with extensive experience in training and fine-tuning custom LLMs. Contact us today and let us select a custom LLM to elevate your business.