weval.orgCollective Intelligence Project


weval is an open, public-domain platform for evaluating AI model behaviour: anyone can write an eval, share it, and run it against many models. It’s the eval-side counterpart to OFL’s mission — where OFL makes facilitation methods an open commons, weval makes the evals that judge AI behaviour an open commons. CIP run it, and have been a proposed validation partner for the OFL eval suite.

How it works

A weval eval is a blueprint — a portable, CC0 YAML file pairing prompts with a rubric:

  • Prompts — single or multi-turn conversations.
  • Rubricshould / should_not criteria, each scored by an LLM judge, or checked deterministically with functions (contains/regex). Criteria carry weights and can define alternative (OR-logic) paths.
  • Scoring — a graded 5-point scale (unmet → fully met, 0.0–1.0), should_not inverted, weighted into a coverage score.

Blueprints live in a public repo and spread by being forked and adapted — the same “open, forkable spec” pattern OFL uses for methods.

Why it matters for OFL

Two layers connect weval to OFL’s evaluation framework:

  1. Format. weval’s blueprint rubric grammar is a ready model for the eval half of an OFL method spec — express a Why-How-Who facilitation eval as a blueprint and it runs on weval-shaped infrastructure, not just one platform.
  2. Judge calibration. weval doesn’t trust a single judge. It runs multiple judges in consensus (different models and framings, averaged) and measures their agreement with Krippendorff’s α, flagging evals where judges disagree too much to trust. That rigour — grounded in CIP’s research, LLM Judges Are Unreliable — is the missing piece in most LLM-as-judge setups.