Nagesh Singh Chauhan
- Mar 13
- 7 min read

Exploring 1-Bit LLMs by Microsoft

Delving into the Core of Microsoft's Revolutionary 1-Bit LLM Technology.

Introduction

In recent years, Large Language Models (LLMs) have emerged as powerful tools in the field of Natural Language Processing (NLP), exhibiting remarkable capabilities in a wide array of tasks such as language generation, translation, and sentiment analysis. These models, often trained on vast amounts of textual data, have revolutionized the way we interact with and process natural language. However, the increasing size and complexity of LLMs have brought forth a host of challenges that warrant careful consideration.

One of the primary concerns associated with LLMs is their burgeoning size. As models scale up to accommodate larger datasets and capture more nuanced linguistic patterns, they entail a significant increase in the number of parameters and layers, resulting in colossal model sizes. This exponential growth in size poses substantial challenges in terms of memory consumption, computational resources, and storage requirements. Moreover, the proliferation of 32-bit or even 16-bit floating-point representations exacerbates the issue, as it necessitates higher precision arithmetic operations, further exacerbating the computational burden.

What are 1-bit LLMs?

Large Language Models (LLMs) have taken the world by storm, but their sheer size often poses a challenge. Traditionally, these models(LLMs or any traditional ML model like logistic regression) store their weights (parameters) using 32-bit or 16-bit floating-point numbers, leading to massive memory footprints. This makes running them on devices like mobile phones a logistical nightmare.

Here's where 1-bit LLMs come in as a game-changer

The Root of the Problem:

Conventional LLMs, like GPT or others, boast billions of parameters.
Each parameter is typically stored using a 32-bit or 16-bit floating-point number, requiring several bytes of memory.
This high precision, while offering advantages, leads to a significant increase in model size.

The Case of Mistral or Llama-7B

Imagine "Llama-7B," an LLM with 7 billion parameters, using 32-bit precision for each. This translates to:

Total Memory: Size of one weight * Number of weights
Total Memory: 4 bytes/weight * 7,000,000,000 weights
Total Memory: 28,000,000,000 bytes
Total Memory (GB): 28,000,000,000 bytes / 1,073,741,824 bytes/GB ≈ 26.09 GB

This massive size excludes numerous devices, especially mobile phones, due to their limited storage and processing capabilities.

The 1-Bit LLM Solution:

1-bit LLMs offer a revolutionary approach by storing weights using only 1 bit (0 or 1) instead of multiple bytes. This dramatic reduction in storage requirements brings several benefits:

Smaller Model Size: By eliminating the need for multiple bytes per parameter, 1-bit LLMs achieve a significantly smaller footprint compared to traditional models.
Mobile-Friendly AI: The reduced size makes these models ideal for deployment on mobile devices, enabling on-device language processing and AI capabilities.
Improved Accessibility: This technology opens doors for a wider range of devices to leverage the power of LLMs, democratizing access to advanced AI features.

Unlocking the Potential:

1-bit LLMs represent a significant step forward in making LLMs more accessible and efficient. With continued research and development, we can expect even further advancements in this field. This paves the way for a future where powerful language processing is available not just on high-end computers but also in the palm of your hand.

1-bit LLMs (e.g., BitNet b1.58) provide a Pareto solution to reduce inference cost (latency, throughput, and energy) of LLMs while maintaining model performance. The new computation paradigm of BitNet b1.58 calls for actions to design new hardware optimized for 1-bit LLMs. Credits

A "Pareto improvement" is when a positive change is affected in some part of a system, but no harm is caused. In this case, the model efficiency was improved, and the accuracy was not affected.

Addressing the challenge posed by the substantial size of Language Model Models (LLMs) is crucial for their integration into local systems and production environments. Typically, the weights of Machine Learning models, whether LLMs or Logistic Regression, are stored as 32-bit or 16-bit floating points, contributing to large model sizes.

The root cause lies in the high precision values assigned to these weights, making models like GPT and other larger counterparts impractical for deployment on devices with limited storage and hardware capacity.

The total memory occupied by 1-bit model is calculated as follows:

Total Memory = Size of one weight * Number of weights

Total Memory = 0.125 bytes * 7,000,000,000

Total Memory = 875,000,000 bytes

Converting this to gigabytes (GB), we get:

Total Memory = 875,000,000 bytes / 1024³ bytes per GB

Total Memory ≈ 0.815 GB

Therefore, finding a solution to enable LLMs for smaller devices and mobile phones becomes imperative.

Quantization: The main idea behind 1-bit

Quantization involves reducing the precision of numerical values representing model parameters (weights) and activations (outputs of layers). Typically, this process converts values from high-precision formats, such as 32-bit floating-point (FP32), to lower-precision formats like 8-bit integers (INT8) or even binary and ternary formats. The primary objective of quantization is to diminish the model's memory footprint, accelerate inference and training times, and reduce energy consumption, all while minimizing the impact on accuracy.

Quantization shrinks neural networks by decreasing the precision of weights, biases, and activations. Credits

To illustrate the concept of quantization, consider a neural network node with a precise weight of 8.6256 (without quantization). The presence of numerous decimal points in this value necessitates memory and processing power for floating-point addition and multiplication operations. In contrast, rounding this value to 9 through quantization results in significant space and processing power savings without substantially compromising the model's performance. This process exemplifies how quantization optimizes neural network operations, making them more efficient and sustainable for various applications.

BitNet b1.58

The first-of-its-kind 1-bit LLM, BitNet b1.58 right now uses 1.58 bits per weight (and hence not an exact 1-bit LLM) where a weight can have 3 possible values (-1,0,1).

BitNet Architecture.Credits

BitNet adopts a layout similar to Transformers, employing stacked blocks of self-attention and feed-forward networks. Diverging from the conventional Transformer approach, BitNet utilizes BitLinear instead of standard matrix multiplication. This involves the use of binarized (i.e., 1-bit) model weights. Notably, other components, such as residual connections and layer normalization, are maintained at high precision, such as 8-bit in experimental setups. Several reasons support this choice.

Firstly, both residual connections and layer normalization contribute insignificantly to the computation costs of large language models. Secondly, the computation cost of the QKV transformation is notably smaller compared to parametric projection as the model scales in size. Thirdly, preserving precision for input/output embedding is essential because language models require high-precision probabilities for effective sampling.

Quantization Function: To constrain the weights to -1, 0, or +1, they adopt an absmean quantization function. It first scales the weight matrix by its average absolute value, and then round each value to the nearest integer among {-1, 0, +1}

Here’s a simplified breakdown of the above function:

Scaling: The function first scales the entire weight matrix by its average absolute value. This ensures the weights are centered around zero
Rounding: Each weight value is then rounded to the nearest integer value among -1, 0, and +1. This translates the scaled weights into the discrete ternary system

The architecture of LLaMA has been the defacto backbone for open-source LLMs. To embrace the open-source community, the design of BitNet b1.58 adopts the LLaMA-alike components. Specifically, it uses RMSNorm, SwiGLU, rotary embedding, and removes all biases. In this way, BitNet b1.58 can be integrated into the popular open-source software (e.g., Huggingface, vLLM [KLZ+23], and llama.cpp2) with minimal efforts.

Decoding latency (Left) and memory consumption (Right) of BitNet b1.58 varying the

model size.

The performance gap between BitNet b1.58 and LLaMA LLM decreases as model size increases. Importantly, BitNet b1.58 matches the performance of full precision baselines starting from a 3B size. BitNet b1.58 3.9B outperforms LLaMA LLM 3B with lower memory and latency costs, showing it’s a better choice compared to current LLM models.

Significance of BitNet b1.58

This advancement holds groundbreaking significance for various reasons:

Cost and Energy Efficiency: BitNet b1.58 achieves a paradigm shift by reducing the precision of weights to 1.58 bits. This substantial reduction leads to a drastic decrease in the energy and computational costs associated with operating Large Language Models (LLMs), establishing BitNet b1.58 as a more sustainable and efficient option.
Model Performance: Despite the reduction in bit representation, BitNet b1.58 not only matches but often surpasses the performance of full-precision LLMs in terms of perplexity and task-specific metrics, especially when starting from a 3 billion model size.
Scalability and Future Applications: BitNet b1.58 showcases outstanding scalability, paving the way for future applications. Its diminished computational requirements enable the deployment of more sophisticated AI models on edge and mobile devices, expanding the realm of possibilities for AI in various domains.

Potential Future of Large Language Models

The BitNet b1.58 model showcases significant advancements in cost and energy efficiency, performing on par with traditional transformer models and unveiling vast potential. Here are a few noteworthy possibilities:

1-bit Mixture-of-Experts (MoE) LLMs: Mixture-of-Experts (MoE) is a cost-effective approach for LLMs, but its drawbacks, such as high memory consumption and inter-chip communication overhead, limit its application. The introduction of 1.58-bit LLMs addresses these issues by reducing the memory footprint, potentially allowing the entire MoE model to reside on a single chip. This could eliminate inter-chip communication overhead, streamlining the deployment of powerful MoE models.
Long Sequence Processing in LLMs: Current LLMs use considerable memory for processing long text due to key-value (KV) caches storing intermediate computations. BitNet b1.58 optimizes activation data format, reducing the conventional 16 bits to 8 bits. This reduction effectively cuts the memory required to store activations in half, enabling the model to handle longer sequences with the same memory. Further optimization through Lossless Compression could compress activations to 4 bits or less without losing information.
LLMs on Edge and Mobile: Edge and mobile devices, constrained by memory and processing power, often feature more CPUs than GPUs. 1.56-bit LLMs prove effective on these less powerful CPUs, opening avenues for new applications. These devices can locally execute tasks like conversations or translations with these models, expanding the possibilities for edge and mobile computing.

Conclusion

In the realm of technological breakthroughs, Microsoft's 1-Bit LLM technology stands out as a revolutionary achievement. This pioneering advancement allows the storage of a single parameter using a mere 1.58 bits, departing from the conventional 8-bit storage era. Anticipated to elevate training compute performance and speed, these enhancements hold promise for empowering edge AI applications. The continuous progression of technology fuels our excitement, and we eagerly look forward to witnessing further strides in this transformative journey.