Agent Evaluation Python Packages

trulens-core

Evaluation and Tracking for LLM Experiments and AI Agents

69K 3K 309

trulens-dashboard

Evaluation and Tracking for LLM Experiments and AI Agents

47K 3K 309

trulens-eval

Evaluation and Tracking for LLM Experiments and AI Agents

42K 3K 309

trulens-feedback

Evaluation and Tracking for LLM Experiments and AI Agents

40K 3K 309

trulens-otel-semconv

Evaluation and Tracking for LLM Experiments and AI Agents

39K 3K 309

trulens

Evaluation and Tracking for LLM Experiments and AI Agents

37K 3K 309

giskard

🐢 Open-Source Evaluation & Testing library for LLM Agents

32K 5K 479

trulens-connectors-snowflake

Evaluation and Tracking for LLM Experiments and AI Agents

17K 3K 309

trulens-providers-openai

Evaluation and Tracking for LLM Experiments and AI Agents

16K 3K 309

valanistack

AI agents fail like junior teammates, looping on bad ideas, ignoring feedback, and escalating commitment. vstack ports 34 of the most-cited organizational-behavior frameworks so you can diagnose your agents the same way you'd diagnose your team.

13K 1 0

trulens-apps-langchain

Evaluation and Tracking for LLM Experiments and AI Agents

12K 3K 309

trulens-providers-litellm

Evaluation and Tracking for LLM Experiments and AI Agents

11K 3K 309

trulens-providers-cortex

Evaluation and Tracking for LLM Experiments and AI Agents

10K 3K 309

multivon-eval

Practical LLM evaluation for teams that ship to production. Deterministic + LLM-as-judge evaluators, dataset support, CI/CD integration.

9K 8 0

trulens-apps-llamaindex

Evaluation and Tracking for LLM Experiments and AI Agents

9K 3K 309

any-agent

A single interface to use and evaluate different agent frameworks

7K 1K 94

agentsynth-ai

Outcome-verified agent trajectories, benchmarks, and RL environments — with a live leaderboard and a CI gate for your agents. Offline-first, MIT.

6K 0 2

trulens-providers-bedrock

Evaluation and Tracking for LLM Experiments and AI Agents

5K 3K 309

agentdog

AgentDog helps developers inspect, test, score, and monitor AI agent runs locally.

4K 5 0

trulens-providers-langchain

Evaluation and Tracking for LLM Experiments and AI Agents

4K 3K 309

agent-attest

Evidence-grounded evaluation for AI agents — verifies each claim against the agent's real tool outputs (constrained, evidence-grounded model judgment, not holistic LLM-judge guesswork), with confidence intervals.

4K 17 0

etzchaim

A diagnosable brain for your LLM. Cognitive architecture in the SOAR/ACT-R/CLARION/LIDA lineage, for the LLM era. Apache 2.0.

4K 1 0

evalview

Regression testing for AI agents. Snapshot behavior,diff tool calls,catch regressions in CI. Works with LangGraph, CrewAI, OpenAI, Anthropic.

4K 120 21

trulens-apps-langgraph

Evaluation and Tracking for LLM Experiments and AI Agents

3K 3K 309