llm-judge
StructAI offers a robust toolkit for LLM interaction—such as structured outputs, context management, and parallel execution.
Inference-time scaling for LLMs-as-a-judge.
A boring, config-driven harness for evaluating AI systems. One YAML drives the run, the trace is the source of truth. Offline, backtesting, and online-eval modes — works with any agent, RAG, or code-modifying system.
LLM evaluation framework — define what correct, well-formed, and safe means before you measure