4 Bit != 4 Bit

Anyone running AI models locally on consumer hardware can hardly avoid quantization. The reason is pragmatic: large models consist of billions of weights that must be stored and processed. If these are stored using fewer bits, memory usage drops and the barriers to practical inference are often reduced as well. That is why quantization is not an exception, but the standard in the GGUF ecosystem around llama.cpp and in Apple’s MLX world.

What matters here is that quantization is not just about a bit count. GGUF packages models as complete binary files containing weights, metadata, and tensor descriptions, and a single file can include different quantization types. MLX also clearly documents that parameters such as group size and quantization mode change the format and therefore affect memory and compute behavior. The key insight is this: it is not only the bit width that matters, but also how and by whom the model was quantized. Individual tensors can intentionally be kept at higher precision, which improves quality but costs speed.

This became especially clear to me in practical tests with Gemma 4 26B A4B on a Mac with an M4 Pro. I tested GGUF models in llama.cpp as well as MLX models in MLX-VLM, each in 4-bit variants from different providers. The measured results were:

GGUF with llama.cpp

  • Unsloth GGUF Q4: 47 tokens/s
  • LM Studio GGUF Q4: 56 tokens/s

MLX with mlx-vlm

  • Unsloth MLX 4-bit: 56 tokens/s
  • MLX Community 4-bit: 64 tokens/s

These differences are clearly noticeable in everyday use and go beyond mere measurement noise. That is not a criticism of slower providers. Anyone optimizing more heavily for quality, accuracy, and evaluation may deliberately make choices that require a bit more computation.

The most important takeaway is this: performance is determined not only by the model and the hardware, but also by the provider of the quantization. This leads to a practical selection process for on-device AI. Instead of simply loading “a model in Q4,” it is worth doing a structured comparison. Small, reproducible mini-benchmarks with the same prompt, fixed parameters, and multiple runs are sufficient. The goal is not an academic comparison, but a reliable decision based on your own usage pattern.

This does make model selection a bit more complicated, but the practical effect is greater than one might initially think. A difference of a few tokens per second adds up noticeably in daily use, not only in terms of time, but also in how the work feels. A smooth model supports creativity and productivity; a sluggish one holds them back.

ACE Step 1.5 XL
Older post

ACE Step 1.5 XL