Minbpe
Visit Toolminbpe provides minimal, clean code for the Byte Pair Encoding (BPE) algorithm, commonly used in LLM tokenization. It allows users to train, encode, and decode text with various BPE implementations.
At a glance
Trending
minbpe provides minimal, clean code for the Byte Pair Encoding (BPE) algorithm, commonly used in LLM tokenization. It allows users to train, encode, and decode text with various BPE implementations.
Trending
About
minbpe offers a minimal and clean code implementation of the Byte Pair Encoding (BPE) algorithm, a fundamental technique for LLM tokenization. The tool supports byte-level BPE, running on UTF-8 encoded strings, as popularized by the GPT-2 paper. It includes two primary tokenizers: BasicTokenizer for straightforward text processing and RegexTokenizer, which preprocesses text using regex patterns for more advanced tokenization, mirroring the approach used in GPT-2 and GPT-4. A GPT4Tokenizer wrapper is also provided for exact reproduction of GPT-4 tokenization. Users can train custom tokenizers, encode text to tokens, and decode tokens back to text, with options for handling special tokens. The repository emphasizes readability and hackability, making it an excellent resource for understanding and implementing BPE.
Capabilities
Pricing & Plans
Open Source
Free
FAQs
Trending