What are the key differences between VoxCPM2, VoxCPM1.5, and VoxCPM-0.5B?
VoxCPM2 is the latest stable release, featuring 2B parameters, 30 languages, 48kHz audio, and advanced voice design/cloning. VoxCPM1.5 is a legacy 0.6B model with 2 languages and 44.1kHz audio. VoxCPM-0.5B is an older 0.5B model, also with 2 languages and 16kHz audio. VoxCPM2 offers the most comprehensive features and highest quality.
Can VoxCPM2 be used for commercial projects?
Yes, VoxCPM2 is fully open-source and released under the Apache-2.0 license. This license permits free use for commercial purposes, making it suitable for integration into commercial applications and services without licensing fees.
What are the system requirements for running VoxCPM?
To run VoxCPM, you need Python ≥ 3.10 (<3.13), PyTorch ≥ 2.5.0, and CUDA ≥ 12.0. For optimal performance and real-time streaming, an NVIDIA RTX 4090 GPU is recommended, especially when using Nano-vLLM or vLLM-Omni for production deployment.
How does VoxCPM2 achieve voice design from a natural language description?
VoxCPM2 allows users to create a new voice by providing a natural-language description (e.g., "A young woman, gentle and sweet voice"). The model then synthesizes speech based on this description, eliminating the need for reference audio to generate a unique voice profile.
Does VoxCPM2 support real-time streaming for speech generation?
Yes, VoxCPM2 supports real-time streaming with a Real-Time Factor (RTF) as low as ~0.3 on an NVIDIA RTX 4090. This can be further accelerated to ~0.13 using Nano-vLLM or vLLM-Omni, enabling efficient and responsive speech synthesis for live applications.