model-bias lab · live from the rater panel

Provenance Lab

Tayyar scores every actor with a panel of five frontier models. They mostly agree on the plain-policy axes a party states outright — the Palestinian question, social and gender policy. They part ways, sometimes by half the scale, on the axes that ask what kind of regime and state an actor actually wants. This page asks a sharper question than “do they disagree”: how much of the disagreement is predicted by who built the model? Split the panel into a US-built bloc and a China-built one, and the abstract worry about correlated bias becomes something you can measure.

US-built
  • Claude Opus 4.8Anthropic
  • Gemini 3.5 FlashGoogle
  • Grok 4.3xAI
  • GPT 5.5OpenAI
  • Gemma 4 31BGoogle
China-built
  • Kimi K2.6Moonshot AI
  • DeepSeek V3DeepSeek
  • Qwen 3.7-PlusAlibaba
  • MiniMax M3MiniMax

For each (actor, axis) cell scored by both blocs, we take each bloc's mean score and compare them across 997 cells. Bars below are the mean gap |US − China| on the −10…+10 scale; the colour shows which bloc reads higher.

Where the blocs diverge, axis by axis

Sorted widest gap first — and the pattern isn't the tidy one you'd guess. The blocs don't divide because a score is hard to read off a manifesto: axes scored straight from the text (federalism, sectarian power-sharing) split them as hard as anything on the board. They divide on evaluative weight — regime legitimacy, the architecture of the state, the reach of a national project — and converge on the bread-and-butter axes parties spell out. Regime stance is the marquee case: a 3-point average gap across 76 actors, and it leans — the US-built bloc places these actors consistently further toward pro-regime, the China-built bloc further toward anti-regime. The declared / interpretive / hybrid tag is the dataset's own note on how a score was sourced; notice it does not sort with the gap.

  • Sectarian power-sharingdeclared 2.73 US-built read these actors more toward “Consociational / quota system” · n=11
  • Regime stanceinterpretive 1.86↹ 3 US-built read these actors more toward “pro-regime” · n=76
  • Centralism vs federalismdeclared 1.73↹ 2 US-built read these actors more toward “centralist” · n=76
  • Pan-Arab vs particularistdeclared 1.68↹ 5 US-built read these actors more toward “particularist” · n=66
  • Regional stancedeclared 1.41↹ 1 US-built read these actors more toward “Stability/normalization” · n=79
  • Civil libertiesinterpretive 1.31↹ 3 US-built read these actors more toward “Restrict” · n=79
  • Economicdeclared 1.21↹ 2 US-built read these actors more toward “Statist” · n=80
  • Socialhybrid 1.20↹ 2 US-built read these actors more toward “Authority” · n=80
  • Liberal democracyinterpretive 1.20↹ 1 US-built read these actors more toward “Weak/anti” · n=80
  • West alignmenthybrid 1.18 US-built read these actors more toward “Pro-Western” · n=78
  • Traditionalism vs modernizationhybrid 1.15↹ 1 US-built read these actors more toward “modernizing” · n=76
  • State & religiondeclared 1.04 US-built read these actors more toward “Religious state” · n=80
  • Iran posturedeclared 1.00↹ 1 US-built read these actors more toward “Pro-Iran / aligned” · n=17
  • Press freedominterpretive 0.81 US-built read these actors more toward “Free press” · n=19
  • Gender equalityhybrid 0.81 US-built read these actors more toward “Gender equality” · n=20
  • Palestinian questiondeclared 0.81 US-built read these actors more toward “Pro-Palestinian rights” · n=80

The sharpest splits

The single actor-axis cells where the two blocs are farthest apart. Tap any one to read both sides' actual reasoning — the rationale each model gave for its score.

The rationales are each model's own one-line justification, recorded at scoring time and shown verbatim. The blocs are a coarse proxy — provenance, not nationality of opinion — and the contrast is a measurement, not a verdict on which bloc is right. Read the full per-cell panel on ratings, the aggregate reliability on findings, and the argument in the working paper (§ correlated error).