🎉 llama.cpp Support Now Available!

#16

by nologik - opened Jan 7

Discussion

nologik

Jan 7

🎉 llama.cpp Support Now Available!

I'm excited to announce that IQuest-Loop-Instruct models are now fully supported in llama.cpp! 🚀

This is the world's first implementation of loop attention in the GGUF ecosystem.

What's New:

✅ Full loop attention support - Dual attention with learned per-head gates
✅ GGUF conversion - Convert PyTorch models to GGUF format
✅ Quantization support - Q4_K_M, Q5_K_M, Q8_0 quantization available
✅ Production ready - Tested and working with text generation

Quick Start:

# Run inference
./llama-cli --model IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf \
    --prompt "Write a function to reverse a linked list" \
    --n-predict 256

GGUF Models Available:

Pre-converted GGUF models: https://huggingface.co/Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF

Sizes:

F16: 75GB
Q8_0: 40GB
Q5_K_M: 27GB
Q4_K_M: 23GB

Technical Details:

The implementation includes:

Loop iteration wrapper (loop_num=2)
Global K/V caching from Loop 0
Dual attention (local + global) with gate mixing
Full backwards compatibility with standard llama models

PR to llama.cpp: https://github.com/ggml-org/llama.cpp/pull/18680

Performance:

Tested on IQuest-Coder-V1-40B-Loop-Instruct:

Prompt processing: ~3.4 t/s
Text generation: ~0.8 t/s
Memory overhead: ~512MB for global K/V cache

Big thanks to the llama.cpp community and @ggerganov for the amazing ecosystem! 🙏

Related:

GGUF Models: https://huggingface.co/Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF
llama.cpp: https://github.com/ggerganov/llama.cpp
PR #18680: https://github.com/ggml-org/llama.cpp/pull/18680

qpqpqpqpqpqp

Jan 9

https://github.com/ggml-org/llama.cpp/pull/18680

Rejected AI generated slop violating their contributor guidelines

Colarocker

Jan 12

Did you check the commit qpqpqpqpqpqp? Did you check the author (hint: does not look like a noob vibecoder)? Do you know claudes capabilities? Sure, the guidelines of the main project rejected the code, but no one mentioned "slop", since the code runs the loop model. i just cloned the feature-branch and i'm now building the llama.cpp on my own so i can test the loop model with ollama.

nologik

Jan 13

People complained about compilers in the 60s because they would generate slop: code that may not do exactly what you would have done in assembly, and, make inferences/optimizations that lead to unintended behavior. Over the years, it got better, and I'm sure we can all agree on that.

In general, we have the input language, the translator, and the output language. From highest to lowest level, we have:

Input Language	Translator	Output Language
English ->	AI ->	Code (C, C++)
Code ->	Compiler ->	Machine Code
Machine Code ->	Hardware Circuits ->	Electrical Signals

We jumped up a level of abstraction with AI. At the end of the day, we're working with language. Embrace it. There's no reason to hold AI in contempt. The next question is: what is the next level of abstraction above English?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment