MNBVC
Visit ToolMNBVC is a Data & Analytics tool that provides a massive, never-ending Chinese corpus for training large language models. It includes diverse text data from mainstream and niche cultures, aiming to reach 253TB.
At a glance
Trending
MNBVC is a Data & Analytics tool that provides a massive, never-ending Chinese corpus for training large language models. It includes diverse text data from mainstream and niche cultures, aiming to reach 253TB.
Trending
About
MNBVC (Massive Never-ending BT Vast Chinese corpus) is an ambitious project to create an ultra-large-scale Chinese corpus, targeting 253TB of data for training large language models, comparable to the 40TB used for ChatGPT. This dataset encompasses a wide array of pure text Chinese data, including news, essays, novels, books, magazines, papers, scripts, posts, wikis, ancient poetry, song lyrics, product descriptions, jokes, anecdotes, and chat logs. It aims to cover both mainstream and niche cultural content, even including "Martian language" data. The project also provides various tools for processing, cleaning, and extracting data, such as charset detection, deduplication, format checking, and specialized cleaning scripts for different data sources like WikiHow, diplomatic speeches, and legal documents. Additionally, it offers code repository crawling tools and multimodal processing utilities for PDFs and Arxiv documents.
Capabilities
Pricing & Plans
Open Source
Free
FAQs
Trending