PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Llm Evaluation Python Packages

Python packages with the GitHub topic llm-evaluation. Sorted by relevance, with stars and monthly downloads.
mlflow
mlflow-skinny

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

38.2M 26K 6K
mlflow
mlflow

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

37.1M 26K 6K
mlflow
mlflow-tracing

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

17M 26K 6K
comet-ml
opik

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

6.2M 19K 1K
confident-ai
deepeval

The LLM Evaluation Framework

3.3M 15K 1K
Arize-ai
arize-phoenix

AI Observability & Evaluation

2.4M 10K 882
Arize-ai
arize-phoenix-otel

AI Observability & Evaluation

1.8M 10K 882
Arize-ai
arize-phoenix-client

AI Observability & Evaluation

952K 10K 882
Arize-ai
arize-phoenix-evals

AI Observability & Evaluation

765K 10K 882
JudgmentLabs
judgeval

The Continuous-Improvement Stack for Agents. Our environment data and evals power agent improvement and monitoring.

479K 1K 93
Microsoft
prompty

Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.

449K 1K 114
truera
trulens-core

Evaluation and Tracking for LLM Experiments and AI Agents

147K 3K 280
truera
trulens

Evaluation and Tracking for LLM Experiments and AI Agents

123K 3K 280
truera
trulens-providers-litellm

Evaluation and Tracking for LLM Experiments and AI Agents

73K 3K 280
truera
trulens-dashboard

Evaluation and Tracking for LLM Experiments and AI Agents

62K 3K 280
truera
trulens-feedback

Evaluation and Tracking for LLM Experiments and AI Agents

61K 3K 280
truera
trulens-otel-semconv

Evaluation and Tracking for LLM Experiments and AI Agents

60K 3K 280
NVIDIA
garak

the LLM vulnerability scanner

59K 8K 949
truera
trulens-eval

Evaluation and Tracking for LLM Experiments and AI Agents

53K 3K 280
agenta-ai
agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

51K 4K 520
comet-ml
opik-optimizer

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

42K 19K 1K
Giskard-AI
giskard

🐢 Open-Source Evaluation & Testing library for LLM Agents

36K 5K 458
truera
trulens-connectors-snowflake

Evaluation and Tracking for LLM Experiments and AI Agents

35K 3K 280
truera
trulens-providers-cortex

Evaluation and Tracking for LLM Experiments and AI Agents

21K 3K 280
    • Data from PyPI, GitHub, ClickHouse, and BigQuery