| --- |
| language: |
| - en |
| tags: |
| - testing |
| - llm |
| - rp |
| - discussion |
| --- |
| |
| # Why? What? TL;DR? |
|
|
| Simply put, I'm making my methodology to evaluate RP models public. While none of this is very scientific, it is consistent. I'm focusing on things I'm *personally* looking for in a model, like its ability to obey a character card and a system prompt accurately. Still, I think most of my tests are universal enough that other people might be interested in the results, or might want to run those tests on their own. |
|
|
|
|
| # Testing Environment |
|
|
| - Frontend is staging version of Silly Tavern. |
| - Backend is the latest version of KoboldCPP for Windows using CUDA 12. |
| - Using **CuBLAS** but **not using QuantMatMul (mmq)**. |
| - Fixed Seed for all tests: **123** |
| - **7-10B Models:** |
| - All models are loaded in Q8_0 (GGUF) |
| - **Flash Attention** and **ContextShift** enabled. |
| - All models are extended to **16K context length** (auto rope from KCPP) |
| - Response size set to 1024 tokens max. |
| - **11-15B Models:** |
| - All models are loaded in Q4_KM or whatever is the highest/closest available (GGUF) |
| - **Flash Attention** and **8Bit cache compression** are enabled. |
| - All models are extended to **12K context length** (auto rope from KCPP) |
| - Response size set to 512 tokens max. |
|
|
|
|
| # System Prompt and Instruct Format |
|
|
| - The exact system prompt and instruct format files can be found in the [file repository](https://huggingface.co/SerialKicked/ModelTestingBed/tree/main). |
| - All models are tested in whichever instruct format they are supposed to be comfortable with (as long as it's ChatML or L3 Instruct) |
|
|
|
|
| # Available Tests |
|
|
| ### DoggoEval |
|
|
| The goal of this test, featuring a dog (Rex) and his owner (EsKa), is to determine if a model is good at obeying a system prompt and character card. The trick being that dogs can't talk, but LLM love to. |
|
|
| - [Results and discussions are hosted in this thread](https://huggingface.co/SerialKicked/ModelTestingBed/discussions/1) ([old thread here](https://huggingface.co/LWDCLS/LLM-Discussions/discussions/13)) |
| - [Files, cards and settings can be found here](https://huggingface.co/SerialKicked/ModelTestingBed/tree/main/DoggoEval) |
| - TODO: Charts and screenshots |
|
|
| ### MinotaurEval |
|
|
| TODO: The goal of this test is to check if a model is able of following a very specific prompting method and maintain situational awareness in the smallest labyrinth in the world. |
|
|
| - Discussions will be hosted here. |
| - Files and cards will be available soon (tm). |
|
|
| ### TimeEval |
|
|
| TODO: The goal of this test is to see if the bot is able to behave in 16K context, recall and summarise "old" info accurately. |
|
|
| - Discussions will be hosted here. |
| - Files and cards will be available soon (tm). |
|
|
|
|
| # Limitations |
|
|
| I'm testing for things I'm interested in. I do not pretend any of this is very scientific or accurate: as much as I try to reduce the amount of variables, a small LLM is still a small LLM at the end of the day. The results for other seeds, or with the smallest of change, are bound to give very different results. |
|
|
| I usually give the different models I'm testing a fair shake in a more casual settings. I regen tons of outputs with random seeds, and while there are (large) variations, it tends to even out to the results shown in testing. Otherwise I'll make a note of it. |