LLM Evaluation Keywords Explained

x-shot

The number of examples provided in the prompt before asking the model to answer. 0-shot means no examples — you’re testing the model’s raw knowledge. 1-shot gives one example. Few-shot provides several.

More examples generally improve performance, but 0-shot tests what the model actually knows versus what it can pattern-match from examples.

CoT (Chain of Thought)

A prompting technique that encourages step-by-step reasoning before giving a final answer. Instead of jumping to a conclusion, the model works through the problem explicitly.

This improves performance because it forces the model to generate more tokens of reasoning — and in transformer architectures, more computation before the answer means better answers.

Pass@k

An evaluation metric showing the success rate across multiple attempts. Pass@1 is the percentage of correct answers on the first try. Pass@10 checks whether any of 10 attempts produces a correct answer.

High pass@1 with even higher pass@10 suggests the model can solve the problem but isn’t reliable about it.

Maj@k (Majority Voting)

Checks whether the majority of multiple attempts converge on the same answer. Maj@32 runs the model 32 times and takes the most common answer.

This assesses robustness: a model that gives the same (correct) answer consistently is more trustworthy than one that occasionally gets it right by chance.

Putting it together

Think of it like a student taking an exam:

Shots = how many practice problems they see before the test
CoT = whether they show their work
Pass@1 = did they get it right on the first try?
Maj@32 = if they took the exam 32 times, what answer do they give most often?