2024-04-14

LLM Evaluation, Training & Deployment

LLM Evaluation:

lm-evaluation-harness - A framework for few-shot evaluation of language models.
MixEval - A reliable click-and-go evaluation suite compatible with both open-source and proprietary models, supporting MixEval and other benchmarks.
lighteval - a lightweight LLM evaluation suite that Hugging Face has been using internally.
OLMO-eval - a repository for evaluating open language models.
instruct-eval - This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.
simple-evals - Eval tools by OpenAI.
Giskard - Testing & evaluation library for LLM applications, in particular RAGs
LangSmith - a unified platform from LangChain framework for: evaluation, collaboration HITL (Human In The Loop), logging and monitoring LLM applications.
Ragas - a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines.

LLM Training Frameworks

DeepSpeed - DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Megatron-DeepSpeed - DeepSpeed version of NVIDIA’s Megatron-LM that adds additional support for several features such as MoE model training, Curriculum Learning, 3D Parallelism, and others.
torchtune - A Native-PyTorch Library for LLM Fine-tuning.
torchtitan - A native PyTorch Library for large model training.
Megatron-LM - Ongoing research training transformer models at scale.
Colossal-AI - Making large AI models cheaper, faster, and more accessible.
BMTrain - Efficient Training for Big Models.
Mesh Tensorflow - Mesh TensorFlow: Model Parallelism Made Easier.
maxtext - A simple, performant and scalable Jax LLM!
Alpa - Alpa is a system for training and serving large-scale neural networks.
GPT-NeoX - An implementation of model parallel autoregressive transformers on GPUs, based on the DeepSpeed library.

LLM Deployment

Reference: llm-inference-solutions

vLLM - A high-throughput and memory-efficient inference and serving engine for LLMs.

TGI - a toolkit for deploying and serving Large Language Models (LLMs).

exllama - A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.

llama.cpp - LLM inference in C/C++.

ollama - Get up and running with Llama 3, Mistral, Gemma, and other large language models.

Langfuse - Open Source LLM Engineering Platform 🪢 Tracing, Evaluations, Prompt Management, Evaluations and Playground.

FastChat - A distributed multi-model LLM serving system with web UI and OpenAI-compatible RESTful APIs.

mistral.rs - Blazingly fast LLM inference.

MindSQL - A python package for Txt-to-SQL with self hosting functionalities and RESTful APIs compatible with proprietary as well as open source LLM.

SkyPilot - Run LLMs and batch jobs on any cloud. Get maximum cost savings, highest GPU availability, and managed execution – all with a simple interface.

Haystack - an open-source NLP framework that allows you to use LLMs and transformer-based models from Hugging Face, OpenAI and Cohere to interact with your own data.

Sidekick - Data integration platform for LLMs.

QA-Pilot - An interactive chat project that leverages Ollama/OpenAI/MistralAI LLMs for rapid understanding and navigation of GitHub code repository or compressed file resources.

Shell-Pilot - Interact with LLM using Ollama models(or openAI, mistralAI)via pure shell scripts on your Linux(or MacOS) system, enhancing intelligent system management without any dependencies.

LangChain - Building applications with LLMs through composability

Floom AI gateway and marketplace for developers, enables streamlined integration of AI features into products

Swiss Army Llama - Comprehensive set of tools for working with local LLMs for various tasks.

LiteChain - Lightweight alternative to LangChain for composing LLMs

magentic - Seamlessly integrate LLMs as Python functions

wechat-chatgpt - Use ChatGPT On Wechat via wechaty

promptfoo - Test your prompts. Evaluate and compare LLM outputs, catch regressions, and improve prompt quality.

Agenta - Easily build, version, evaluate and deploy your LLM-powered apps.

Serge - a chat interface crafted with llama.cpp for running Alpaca models. No API keys, entirely self-hosted!

Langroid - Harness LLMs with Multi-Agent Programming

Embedchain - Framework to create ChatGPT like bots over your dataset.

CometLLM - A 100% opensource LLMOps platform to log, manage, and visualize your LLM prompts and chains. Track prompt templates, prompt variables, prompt duration, token usage, and other metadata. Score prompt outputs and visualize chat history all within a single UI.

IntelliServer - simplifies the evaluation of LLMs by providing a unified microservice to access and test multiple AI models.

OpenLLM - Fine-tune, serve, deploy, and monitor any open-source LLMs in production. Used in production at BentoML for LLMs-based applications.

DeepSpeed-Mii - MII makes low-latency and high-throughput inference, similar to vLLM powered by DeepSpeed.

Text-Embeddings-Inference - Inference for text-embeddings in Rust, HFOIL Licence.

Infinity - Inference for text-embeddings in Python

TensorRT-LLM - Nvidia Framework for LLM Inference

FasterTransformer - NVIDIA Framework for LLM Inference(Transitioned to TensorRT-LLM)

Flash-Attention - A method designed to enhance the efficiency of Transformer models

Langchain-Chatchat - Formerly langchain-ChatGLM, local knowledge based LLM (like ChatGLM) QA app with langchain.

Search with Lepton - Build your own conversational search engine using less than 500 lines of code by LeptonAI.

Robocorp - Create, deploy and operate Actions using Python anywhere to enhance your AI agents and assistants. Batteries included with an extensive set of libraries, helpers and logging.

LMDeploy - A high-throughput and low-latency inference and serving framework for LLMs and VLs

Tune Studio - Playground for devs to finetune & deploy LLMs

LLocalSearch - Locally running websearch using LLM chains

AI Gateway — Gateway streamlines requests to 100+ open & closed source models with a unified API. It is also production-ready with support for caching, fallbacks, retries, timeouts, loadbalancing, and can be edge-deployed for minimum latency.

talkd.ai dialog - Simple API for deploying any RAG or LLM that you want adding plugins.

Wllama - WebAssembly binding for llama.cpp - Enabling in-browser LLM inference

GPUStack - An open-source GPU cluster manager for running LLMs