MNBVC

Visit Tool

MNBVC is a Data & Analytics tool that provides a massive, never-ending Chinese corpus for training large language models. It includes diverse text data from mainstream and niche cultures, aiming to reach 253TB.

Claim this tool

1View

At a glance

Pricing

Open Source

Free tier

Yes

API

Skill level

Technical

About

What is MNBVC?

MNBVC (Massive Never-ending BT Vast Chinese corpus) is an ambitious project to create an ultra-large-scale Chinese corpus, targeting 253TB of data for training large language models, comparable to the 40TB used for ChatGPT. This dataset encompasses a wide array of pure text Chinese data, including news, essays, novels, books, magazines, papers, scripts, posts, wikis, ancient poetry, song lyrics, product descriptions, jokes, anecdotes, and chat logs. It aims to cover both mainstream and niche cultural content, even including "Martian language" data. The project also provides various tools for processing, cleaning, and extracting data, such as charset detection, deduplication, format checking, and specialized cleaning scripts for different data sources like WikiHow, diplomatic speeches, and legal documents. Additionally, it offers code repository crawling tools and multimodal processing utilities for PDFs and Arxiv documents.

Best used for

Ideal for developers and data scientists who need to access a vast and diverse Chinese text corpus for training large language models, conducting natural language processing research, and preprocessing data for AI models. Especially valuable for those requiring a comprehensive dataset that includes both mainstream and niche cultural content.

Common actions

collect data

process data

train language models

research NLP

clean text

github copilot"AI Agents"face swappingopen-sourcedeepfakeworkflowslow-code/no-codecollaborationautomated workflow

Capabilities

Key features

Ultra-large Chinese corpus
Diverse text data types
Data cleaning tools
Code repository crawlers
Multimodal processing tools
P2P data synchronization
Desensitized data

Target Audience

developerdata scientist

Integrations

Not yet documented

Pricing & Plans

Open Source

Free

FAQs

What kind of data is included in the MNBVC corpus?

The MNBVC corpus includes a wide variety of pure text Chinese data such as news, essays, novels, books, magazines, papers, scripts, posts, wikis, ancient poetry, song lyrics, product descriptions, jokes, anecdotes, and chat logs, covering both mainstream and niche cultural content.

How large is the MNBVC dataset and what is its target size?

Currently, the MNBVC dataset has a total volume of 60732GB. The project's ambitious goal is to reach 253TB of data, with the current progress at 24% towards this target.

Does MNBVC provide tools for data processing and cleaning?

Yes, MNBVC offers a suite of tools for processing and cleaning large-scale Chinese corpora. These include utilities for character set detection, deduplication, format checking, and specialized cleaning scripts for various data sources like WikiHow, diplomatic speeches, and legal documents.

Trending

Subcategories trending in Data & Analytics

Business Intelligence Predictive Analytics Real-Time Analytics Market Research Data Cleaning & Prep Data Pipelines & Integration

Trending

Explore

Browse AI tools by category

Content & Design Productivity & Business Coding & Development AI Agents & Automation Research & Education Wellness & Lifestyle Career Development Marketing & Growth Data & Analytics Customer Support & CX Finance E-commerce