Llama 2 Model Parameters Comparison
In the paper titled "Llama 2: Open Foundation and Fine-Tuned Chat Models," there is a detailed discussion about architecture changes compared to the first version of Llama. Here are some key points relevant to the number of parameters in attention modules and MLPs:
Memory and Parameter Optimization in Attention Modules: The paper discusses memory costs associated with the key-value (KV) cache size in multi-head attention (MHA) models. To address this, it suggests sharing key and value projections across multiple heads to reduce memory usage. For example, either the original multi-query format with a single KV projection or a grouped-query attention variant with multiple KV projections can be used.
Parameter Adjustments in Feed-Forward Networks (FFNs): To maintain a similar overall parameter count while optimizing attention layers, the paper mentions increasing the dimension of the feed-forward layers. For the multi-query attention (MQA) variant, the FFN dimension is increased by a factor of 1.33, and for the grouped-query attention (GQA) variant, it's increased by a factor of 1.3.
Performance Comparisons: The paper provides performance comparisons for different attention architecture variants, including MHA, MQA, and GQA, across various tasks. It also discusses how these architecture choices impact model latency and memory usage, especially when hosted on GPUs.
Decision for Llama 2 Models: Based on the ablation results and ease of scaling inference, the 34B and 70B Llama 2 models chose to use GQA instead of MQA.
While the paper provides insights into the architectural choices and optimizations made for attention modules and MLPs, it doesn't provide a direct comparison of the exact number of parameters in attention modules versus MLPs in Llama 2. The focus is more on the trade-offs between memory usage, performance, and parameter adjustments across different variants of attention modules and how these changes impact the overall architecture of the Llama 2 models.
In the "Llama 2: Open Foundation and Fine-Tuned Chat Models" paper, section 8.2.1, titled "Architecture Changes Compared to Llama 1," the authors discuss various changes in the architecture, focusing on optimizing the model's performance. Here are the key points related to the attention modules and MLPs:
Grouped-Query Attention (GQA): This attention mechanism is highlighted as a solution to address the increasing memory costs associated with key-value (KV) cache sizes in multi-head attention (MHA) models. GQA allows for sharing key and value projections across multiple heads, thereby reducing memory usage.
Parameter Adjustments: To maintain a similar overall parameter count while optimizing attention layers, the dimension of the feed-forward layers (FFNs) is increased. For the multi-query attention (MQA) variant, the FFN dimension is increased by a factor of 1.33, and for the GQA variant, it's increased by a factor of 1.3.
Performance Comparison: The paper includes performance comparisons of different attention architectures, namely MHA, MQA, and GQA. These comparisons are based on various evaluation tasks and show that the GQA variant performs comparably to the MHA baseline and better than the MQA variant on average.
Inference Speed and Memory Usage: There's also a discussion on how different attention variants affect inference speed and memory usage, particularly in the context of large models and high batch sizes.
The paper provides detailed insights into how architectural changes, particularly in the attention mechanisms and feed-forward networks, are implemented in the Llama 2 model. It emphasizes the balance between maintaining parameter counts and optimizing for performance and memory usage. However, it does not provide a direct numerical comparison of the number of parameters in attention modules versus MLPs.
Additionally, the paper includes a figure titled "Multi-query variants enable higher throughput with larger batch sizes, and show similar latency on smaller batches," illustrating the performance of these variants in different scenarios:
Multi-query variants enable higher throughput with larger batch sizes, and show similar latency on smaller batches. The first data point corresponds to batch size 1, and then we double it until the model runs out of memory. The MHA variant triggers an out-of-memory error at a batch size of 1024 for a context of 256 tokens and at a batch size of 128 for 2k context, whereas MQA and GQA have successful runs in those settings.
The "Llama 2: Open Foundation and Fine-Tuned Chat Models" paper explains that for the multi-query attention (MQA) variant, the dimension of the feed-forward network (FFN) layers is increased by a factor of 1.33. This adjustment is made to keep a similar overall parameter count across MQA and Grouped-Query Attention (GQA) variants, compensating for the reduction in the attention layers.
However, the paper does not explicitly state how much the MQA variant reduces the parameter count in the attention modules. The focus is more on maintaining a balance between the parameters in the attention modules and the FFNs. The increase in the dimension of FFNs by a factor of 1.33 suggests that there is a reduction in parameters in the attention layers, but the exact amount of this reduction is not specified.
The approach described indicates a trade-off between attention and FFN parameters, with the MQA variant likely reducing the parameter count in the attention modules to some extent, but the exact numerical reduction is not detailed in the paper.