LLM Evaluation Keywords Explained
x-shot
The number of examples provided in the prompt before asking the model to answer. 0-shot means no examples — you’re testing the model’s raw knowledge. 1-shot gives one example. Few-shot provides several.
More examples generally improve performance, but 0-shot tests what the model actually knows versus what it can pattern-match from examples.
CoT (Chain of Thought)
A prompting technique that encourages step-by-step reasoning before giving a final answer. Instead of jumping to a conclusion, the model works through the problem explicitly.
This improves performance because it forces the model to generate more tokens of reasoning — and in transformer architectures, more computation before the answer means better answers.
Pass@k
An evaluation metric showing the success rate across multiple attempts. Pass@1 is the percentage of correct answers on the first try. Pass@10 checks whether any of 10 attempts produces a correct answer.
High pass@1 with even higher pass@10 suggests the model can solve the problem but isn’t reliable about it.
Maj@k (Majority Voting)
Checks whether the majority of multiple attempts converge on the same answer. Maj@32 runs the model 32 times and takes the most common answer.
This assesses robustness: a model that gives the same (correct) answer consistently is more trustworthy than one that occasionally gets it right by chance.
Putting it together
Think of it like a student taking an exam:
- Shots = how many practice problems they see before the test
- CoT = whether they show their work
- Pass@1 = did they get it right on the first try?
- Maj@32 = if they took the exam 32 times, what answer do they give most often?