Root Signals julkaisi tämän uudelleen
#EvalsTuesdays Week 3 - Positional Bias in LLM-Judges #LLMs are just not reliable enough without an external measurement, control, and guardrail layer. Humans are kind of good at these sort of checks but human evaluation (literally looking at the LLM responses) just doesn't scale. LLM-as-a-Judge is the only scalable solution to this problem as we know for a fact that agreement rate of properly tuned LLM Judges match the agreement rate that human annotators have within themselves (Zheng et al.). Simplest approach to LLM-based evaluation is to prompt the model to return a score based on a metric or definition (pointwise scoring). However LLMs do not have an internal calibration mechanism and they kind of suck at continuous numeric ranges. There are methods to fix pointwise scoring challenges which we also employ at Root Signals but that's another post. Second popular approach is pairwise scoring where the judge is presented with two responses (from different prompts, models, temperatures etc.) and asked to identify the better one. Here comes the position bias. LLM-Judges often tend to have either primacy bias (more likely to prefer the first choice) or recency bias (more likely to prefer the last choice). Judges utilized for ground-truth based evals also suffer from this, i.e. the position of the ground truth answer matter. Couple of important things regarding position bias: 🔵 It depends on the model and the task. There is no consistent bias towards a specific direction (otherwise we would have corrected it). 🔵 It, of course, does not work to prompt the model to avoid the position bias. LLMs are auto-regressive models (predicting the next token) they just don't work that way. 🔵 Lengths of the prompt or the answers have a lot of effect on the accuracy of the judgements but has minimal effect on the position bias 🔵 Answer quality gap is the most important factor for position bias. When one answer is significantly better than the other, this bias decreases. In other words, when answers/responses have relatively similar quality regarding a specific metric (clarity, helpfulness, conciseness, harmfulness, instruction-following, formality, politeness etc.), LLM-Judges do not decide randomly. 🔵 Bigger and smarter general purpose models such as OpenAI o1 or Anthropic's Claude3.5 do not always have less position bias than smaller models. Cool, how to mitigate? ✅ Position switching trick: randomize the candidate positions and average out ✅ As helpful in most things: few-shot that thing. Provide examples in your instruction prompt. Couple of samples go a loooong way (unlike traditional machine learning) ✅ Evals, evals, evals! Just like you have to evaluate your LLM pipeline properly for your specific use case, you also have to evaluate your evaluators/judges (meta-eval). Stay safe and unbias your judgements (update your priors as Bayesians say)! Posts from past weeks and couple of citations below ⬇️