Exllamav2

Visit Tool

ExLlamaV2 is an inference library for running local LLMs on modern consumer GPUs, offering fast performance and supporting various quantization formats. It provides dynamic batching and smart prompt caching for efficient generation.

Claim this tool

No Views Yet

At a glance

Pricing

Open Source

Free tier

Yes

API

Yes

Skill level

Technical

About

What is exllamav2?

ExLlamaV2 is a high-performance inference library designed to run large language models (LLMs) locally on modern consumer-grade GPUs. It supports both 4-bit GPTQ models and its own EXL2 format, which allows for mixed quantization levels from 2 to 8 bits per weight, optimizing for both performance and memory usage. Key features include dynamic batching, smart prompt caching, and K/V cache deduplication, all consolidated into a simplified API. The library is compatible with various frontends and APIs like TabbyAPI (OpenAI-compatible), ExUI, text-generation-webui, and lollms-webui, making it versatile for different deployment scenarios. Installation is flexible, supporting source builds, prebuilt wheels, and PyPI, catering to developers with different needs and environments.

Best used for

Ideal for developers and data scientists who need to run large language models efficiently on consumer GPUs, optimize model performance through advanced quantization, and integrate LLMs into custom applications. Especially valuable for those working with local inference and seeking high throughput.

Common actions

run LLMs locally

optimize LLM inference

quantize models

develop AI applications

face swapping"AI Agents"workflowsautomated workflowlow-code/no-codeopen-sourcegithub copilotdeepfakecollaboration

Capabilities

Key features

Fast LLM inference
EXL2 quantization support
Dynamic batching
Smart prompt caching
K/V cache deduplication
OpenAI-compatible API

Target Audience

developerdata scientist

Integrations

tabbyapiexuitext-generation-webuilollms-webui

Pricing & Plans

Open Source

Free

FAQs

What is the difference between ExLlamaV2 and ExLlamaV3?

ExLlamaV2 is currently archived, and development continues on ExLlamaV3. While ExLlamaV2 offers fast inference and advanced quantization, users seeking the latest features and ongoing support should refer to the ExLlamaV3 project for the most up-to-date developments and improvements.

What kind of quantization does ExLlamaV2 support?

ExLlamaV2 supports both the 4-bit GPTQ format and its own EXL2 format. The EXL2 format is highly flexible, allowing for mixed quantization levels from 2 to 8 bits per weight, and can apply multiple quantization levels to each linear layer for optimized performance and memory usage.

Can ExLlamaV2 be used with multiple GPUs?

Yes, ExLlamaV2 supports multi-GPU inference. When running inference, you can append the '--gpu_split auto' flag to enable the library to automatically manage and utilize multiple GPUs for enhanced performance and larger model support.

Trending

Subcategories trending in Coding & Development

Code Assistants DevOps & Infrastructure No-Code / Low-Code Testing & QA Backend & APIs Prompt Engineering

Trending

Also listed in

This tool also appears in

AI Agents & Automation › AI Frameworks & Infra

Explore

Browse AI tools by category

Content & Design Productivity & Business Coding & Development AI Agents & Automation Research & Education Wellness & Lifestyle Career Development Marketing & Growth Data & Analytics Customer Support & CX Finance E-commerce