Lorax
Visit ToolLoRAX is an open-source multi-LoRA inference server that scales to thousands of fine-tuned LLMs on a single GPU. It dynamically loads adapters and optimizes inference for high throughput and low latency.
At a glance
Trending
LoRAX is an open-source multi-LoRA inference server that scales to thousands of fine-tuned LLMs on a single GPU. It dynamically loads adapters and optimizes inference for high throughput and low latency.
Trending
About
LoRAX (LoRA eXchange) is an open-source framework designed to serve thousands of fine-tuned Large Language Models (LLMs) on a single GPU. This dramatically reduces serving costs without compromising throughput or latency. Key features include dynamic adapter loading from HuggingFace, Predibase, or local files, heterogeneous continuous batching for efficient request packing, and adapter exchange scheduling to optimize system throughput. It also offers optimized inference with tensor parallelism, pre-compiled CUDA kernels, and quantization. LoRAX is production-ready with prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, distributed tracing, and an OpenAI-compatible API supporting multi-turn chat conversations and structured output.
Capabilities
Pricing & Plans
Open Source
Free
FAQs
Trending