πŸŽ‰ llama.cpp Support Now Available!

#16
by nologik - opened

πŸŽ‰ llama.cpp Support Now Available!

I'm excited to announce that IQuest-Loop-Instruct models are now fully supported in llama.cpp! πŸš€

This is the world's first implementation of loop attention in the GGUF ecosystem.

What's New:

βœ… Full loop attention support - Dual attention with learned per-head gates
βœ… GGUF conversion - Convert PyTorch models to GGUF format
βœ… Quantization support - Q4_K_M, Q5_K_M, Q8_0 quantization available
βœ… Production ready - Tested and working with text generation

Quick Start:

# Run inference
./llama-cli --model IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf \
    --prompt "Write a function to reverse a linked list" \
    --n-predict 256

GGUF Models Available:

Pre-converted GGUF models: https://huggingface.co/Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF

Sizes:

  • F16: 75GB
  • Q8_0: 40GB
  • Q5_K_M: 27GB
  • Q4_K_M: 23GB

Technical Details:

The implementation includes:

  • Loop iteration wrapper (loop_num=2)
  • Global K/V caching from Loop 0
  • Dual attention (local + global) with gate mixing
  • Full backwards compatibility with standard llama models

PR to llama.cpp: https://github.com/ggml-org/llama.cpp/pull/18680

Performance:

Tested on IQuest-Coder-V1-40B-Loop-Instruct:

  • Prompt processing: ~3.4 t/s
  • Text generation: ~0.8 t/s
  • Memory overhead: ~512MB for global K/V cache

Big thanks to the llama.cpp community and @ggerganov for the amazing ecosystem! πŸ™


Related:

https://github.com/ggml-org/llama.cpp/pull/18680

Rejected AI generated slop violating their contributor guidelines

Did you check the commit qpqpqpqpqpqp? Did you check the author (hint: does not look like a noob vibecoder)? Do you know claudes capabilities? Sure, the guidelines of the main project rejected the code, but no one mentioned "slop", since the code runs the loop model. i just cloned the feature-branch and i'm now building the llama.cpp on my own so i can test the loop model with ollama.

People complained about compilers in the 60s because they would generate slop: code that may not do exactly what you would have done in assembly, and, make inferences/optimizations that lead to unintended behavior. Over the years, it got better, and I'm sure we can all agree on that.

In general, we have the input language, the translator, and the output language. From highest to lowest level, we have:

Input Language Translator Output Language
English -> AI -> Code (C, C++)
Code -> Compiler -> Machine Code
Machine Code -> Hardware Circuits -> Electrical Signals

We jumped up a level of abstraction with AI. At the end of the day, we're working with language. Embrace it. There's no reason to hold AI in contempt. The next question is: what is the next level of abstraction above English?

Sign up or log in to comment