ClipBERT
Visit ToolClipBERT is an open-source framework for end-to-end learning on image-text and video-text tasks. It uses sparse sampling for efficient multimodal learning and supports various downstream tasks.
At a glance
Trending
ClipBERT is an open-source framework for end-to-end learning on image-text and video-text tasks. It uses sparse sampling for efficient multimodal learning and supports various downstream tasks.
Trending
About
ClipBERT is an official PyTorch code implementation for an efficient framework designed for end-to-end learning across image-text and video-text tasks. Recognized with a CVPR 2021 Best Student Paper Honorable Mention, ClipBERT processes raw videos/images and text inputs to generate task predictions. It leverages 2D CNNs and transformers, incorporating a sparse sampling strategy to enable efficient multimodal learning. The framework supports end-to-end pretraining and finetuning for tasks such as image-text pretraining on COCO and VG captions, text-to-video retrieval on MSRVTT, DiDeMo, and ActivityNet Captions, video-QA on TGIF-QA and MSRVTT-QA, and image-QA on VQA 2.0. Its modular design allows for easy integration of additional image-text or video-text tasks.
Capabilities
Pricing & Plans
Open Source
Free
FAQs
Trending
Also listed in