llms-benchmarking
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Develop reliable AI apps
Evaluation suite for gender biases in LLMs.
Insert a Lie in a Haystack and evaluate the model's ability to detect it.
GenderBench - Evaluation suite for gender biases in LLMs