Llm Evaluation Python Packages

mlflow-skinny

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

42.4M 27K 6K

mlflow

39.2M 27K 6K

mlflow-tracing

20.9M 27K 6K

deepeval

The LLM Evaluation Framework

7.6M 17K 2K

opik

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

3.8M 20K 2K

arize-phoenix

AI Observability & Evaluation

2.2M 10K 956

arize-phoenix-otel

AI Observability & Evaluation

1.8M 10K 956

arize-phoenix-client

AI Observability & Evaluation

977K 10K 956

arize-phoenix-evals

AI Observability & Evaluation

750K 10K 956

prompty

Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.

450K 1K 118

judgeval

The Continuous-Improvement Stack for Agents. Our environment data and evals power agent improvement and monitoring.

147K 1K 93

garak

the LLM vulnerability scanner

72K 8K 1K

trulens-core

Evaluation and Tracking for LLM Experiments and AI Agents

69K 3K 309

trulens-dashboard

Evaluation and Tracking for LLM Experiments and AI Agents

47K 3K 309

trulens-eval

Evaluation and Tracking for LLM Experiments and AI Agents

42K 3K 309

agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

41K 4K 559

trulens-feedback

Evaluation and Tracking for LLM Experiments and AI Agents

40K 3K 309

trulens-otel-semconv

Evaluation and Tracking for LLM Experiments and AI Agents

39K 3K 309

trulens

Evaluation and Tracking for LLM Experiments and AI Agents

37K 3K 309

giskard

🐢 Open-Source Evaluation & Testing library for LLM Agents

32K 5K 479

opik-optimizer

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

31K 20K 2K

lmms-eval

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

18K 4K 611

trulens-connectors-snowflake

Evaluation and Tracking for LLM Experiments and AI Agents

17K 3K 309

trulens-providers-openai

Evaluation and Tracking for LLM Experiments and AI Agents

16K 3K 309