ShypdShypd.ai
📉

Data & Analytics

Browsing page 24 of AI tools for Data Cleaning & Prep in Data & Analytics. Sorted by confidence score — our independent quality rating.

Persistent Data

Persistent Data

58%

Persistent Data is an AI tool designed for managing data persistence within artificial intelligence applications. It provides a solution for users to effectively store and retrieve data, which is crucial for the development and operation of AI models and research projects. This tool is particularly useful for developers and data scientists working on AI initiatives who require a reliable method for handling their application's data. The platform is accessible for free, making it an attractive option for individuals and teams looking for cost-effective data management solutions in the AI domain.

Pytesseract Ocr

Pytesseract Ocr

58%

Pytesseract OCR is a straightforward and efficient tool for optical character recognition, hosted as a Hugging Face Space. It allows users to upload an image and specify the language to accurately extract text content. The application then processes the image and returns the recognized text, making it ideal for digitizing printed or handwritten documents, processing images containing text, or automating data entry tasks. Its user-friendly interface simplifies the OCR process, making it accessible even for those without extensive technical knowledge. The tool is offered free of charge, providing a valuable resource for various text extraction needs.

AutoGL

AutoGL

58%

AutoGL is an open-source AutoML framework and toolkit specifically designed for machine learning on graphs. It enables researchers and developers to easily and quickly conduct automated machine learning tasks on graph datasets. The framework supports various graph-based machine learning tasks through its auto solver, which integrates five main modules: auto feature engineer, neural architecture search (NAS), auto model, hyperparameter optimization (HPO), and auto ensemble. AutoGL is compatible with popular graph libraries like PyTorch Geometric (PyG) and Deep Graph Library (DGL), supporting tasks such as node classification, link prediction, and graph classification. It also serves as a flexible framework for implementing and testing custom AutoML or graph-based machine learning models.

Versium

Versium

58%

Versium operates a comprehensive B2B2C identity resolution platform, enabling marketers to identify, understand, and reach prospects throughout the customer journey. Its proprietary data assets include over 2 billion contact points and 2+ trillion insight attributes, forming a rich identity graph for North American businesses and consumers. The flagship platform, Versium REACH, offers tools for data cleansing, audience building, and cross-channel activation. Additionally, the Versium API suite provides developers with real-time access to identity resolution and data enrichment capabilities. The company serves a diverse client base, from mid-market businesses to Fortune 500 enterprises, across various industries.

datumaro

datumaro

58%

Datumaro is a comprehensive dataset management framework designed for computer vision tasks. It offers both a Python library and a command-line interface (CLI) tool, enabling users to efficiently build, analyze, and manage their datasets. Key functionalities include reading, writing, and converting datasets between numerous popular formats such as COCO, PASCAL VOC, YOLO, and ImageNet. The framework also facilitates advanced dataset manipulation, including filtering by custom criteria, splitting datasets into train/validation/test subsets, and performing annotation conversions. Furthermore, Datumaro provides tools for dataset quality checking, comparison with model inference, and generating detailed statistics, making it an invaluable resource for data scientists and machine learning engineers working with computer vision data.

hamilton

hamilton

58%

Apache Hamilton is a lightweight Python library designed for creating directed acyclic graphs (DAGs) of data transformations. It enables data scientists and engineers to define testable, modular, and self-documenting dataflows that encode lineage, tracing, and metadata. The library is highly portable, running anywhere Python does, including scripts, notebooks, Airflow pipelines, and FastAPI servers. Hamilton emphasizes separation of concerns, allowing data scientists to focus on problem-solving while engineers manage production pipelines. It supports data and schema validation, built-in coding styles, and a plugin-based architecture for custom integrations. The Apache Hamilton UI provides automatic visualization, cataloging, and monitoring of execution, including data cataloging, dataset profiling, and execution tracking.

NVTabular

NVTabular

58%

NVTabular is a powerful feature engineering and preprocessing library specifically designed for tabular data, enabling the manipulation of terabyte-scale datasets. It accelerates computation on the GPU using the RAPIDS Dask-cuDF library, making it ideal for training deep learning-based recommender systems. As a core component of NVIDIA Merlin, it seamlessly integrates with other Merlin tools like Merlin Models, HugeCTR, and Merlin Systems to provide end-to-end acceleration for recommender systems on the GPU. NVTabular addresses challenges such as processing huge datasets, managing complex data pipelines, and overcoming input bottlenecks, allowing data scientists and ML engineers to focus on data transformation rather than scaling issues. It significantly reduces the time required for feature engineering and preprocessing, with reported completion times of 13 minutes on a single V100 GPU and 3 minutes on a DGX-1 cluster for the Criteo 1TB Click Logs Dataset.

snorkel

snorkel

58%

Snorkel is an open-source system designed for the rapid generation of training data using weak supervision. Originating from Stanford in 2015, the project aimed to bring mathematical and systems structure to the often manual process of training data creation. It empowers users to programmatically label, build, and manage training data, addressing the critical role of data quality in machine learning project success. While the original Snorkel project is no longer actively developed, its core ideas and techniques have evolved into Snorkel Flow, an end-to-end AI application development platform. Snorkel is particularly useful for developers and data scientists looking to efficiently create large, labeled datasets for various machine learning tasks.

ydata-synthetic

ydata-synthetic

58%

ydata-synthetic is an open-source Python package designed for generating synthetic tabular and time-series data. It incorporates state-of-the-art generative models, including various GAN architectures like CTGAN, WGAN, and TimeGAN, as well as Gaussian Mixture models. The tool provides a low-code experience for quick data generation and features a Streamlit-based UI for an intuitive workflow, from training models to generating and profiling synthetic data samples. It supports diverse applications such as privacy compliance, bias removal, dataset balancing, and augmentation, making it a versatile solution for data scientists and developers working with sensitive or limited datasets.

🐍💨 Data Contamination Database

🐍💨 Data Contamination Database

58%

The 🐍💨 Data Contamination Database is a Hugging Face Space designed to help users identify and manage data contamination within datasets and models. This application provides functionalities to filter and view data specifically related to contamination. Users can input particular evaluation datasets and contaminated sources, and then select various options to exclude or analyze these issues. It serves as a crucial resource for AI researchers and data scientists aiming to ensure the integrity and reliability of their data, ultimately leading to more robust and accurate AI models. The tool is hosted on Hugging Face Spaces, making it accessible for a wide range of users.

RoadGauge Ltd

RoadGauge Ltd

58%

RoadGauge Ltd offers an innovative solution for 3D road analysis, leveraging AI technology and readily available hardware like GoPro cameras. Users can mount a camera, record a drive, and upload the video to RoadGaugeAI for processing. The platform then reconstructs the road in 3D, providing sectional profiles with defects measured and geotagged to millimeter accuracy. It identifies safety hazards, profiles road surfaces, and helps locate, classify, and manage transport assets. This cost-effective system allows users to own their hardware, reduce inspection capital expenses, and receive survey results in various formats like PDF, KML, GPX, and CSV, with fast delivery times.

AutoRegex

AutoRegex

58%

AutoRegex is an AI-powered tool designed to simplify the creation of regular expressions (RegEx). Users can input plain English descriptions of the patterns they need, and the tool will automatically generate the corresponding RegEx. This makes complex RegEx accessible to both experienced developers and non-developers who may not be familiar with the intricacies of regular expressions. It is particularly useful for tasks such as data parsing, validation, and pattern matching across various programming languages and data manipulation scenarios. By translating natural language into precise RegEx, AutoRegex aims to save time and reduce errors in development workflows.

DeepTables

DeepTables

58%

DeepTables (DT) is an easy-to-use, open-source toolkit designed to apply deep learning techniques to tabular data. It addresses the inefficiencies of traditional MLP models in learning distribution representation and the need for extensive manual feature engineering. DT incorporates the latest research findings from models like FM, DeepFM, Wide&Deep, DCN, PNN, and others, which have shown strong performance in areas like CTR prediction. The toolkit aims to provide an end-to-end solution for tabular data, focusing on ease of use for non-experts, out-of-the-box performance, and a flexible, expandable architecture. It requires TensorFlow and offers optional GPU support for enhanced performance.

Invisible Technologies Inc.

Invisible Technologies Inc.

58%

Invisible Technologies offers AI software solutions for labs and enterprises, specializing in training and deploying AI models. The platform transforms data and manual processes into agent-ready workflows, having trained over 80% of the world's leading AI models. It provides modular systems that adapt models to specific business needs and integrates human expertise when necessary. Key offerings include AI training, back office automation, computer vision, contact center intelligence, and demand forecasting, catering to industries like asset management, banking, consumer, energy, healthcare, insurance, public sector, and sports.

Fatala Digital House

Fatala Digital House

58%

Fatala Digital House specializes in digital transformation for Small and Medium-sized Enterprises (SMEs) and mid-sized companies, focusing on data and AI. They assist organizations in improving performance and optimizing costs by placing data at the core of their strategy. Services include data strategy consulting, custom solution development and deployment leveraging Web, Data Engineering, and Data Science expertise, and training to build capable teams. Fatala offers a data diagnostic to identify potential areas for improvement, aiming to increase revenue, achieve operational excellence through process automation and algorithmic implementation, and enhance business intelligence for better decision-making. Their teams are based in Africa and Europe.

YouData.ai

YouData.ai

58%

YouData.ai provides a developer-first platform for AI data engineering, designed to connect and prepare enterprise data for AI applications. It ingests messy databases, automatically fixes schemas, and syncs data to Vector DBs with sub-50ms latency. The platform features self-healing pipelines that adjust to schema drifts upstream, ensuring no downtime. With over 200 integrations, it connects natively to various data sources like Postgres, Snowflake, and MongoDB. YouData.ai offers an SDK to manage infrastructure, rate limiting, and schema validation, eliminating the need for users to manage Kafka clusters or Airflow instances. It is SOC2 Type II and HIPAA compliant, with deployment options on-premise or in a managed VPC, and granular observability via Datadog and New Relic.

ATM

ATM

58%

ATM (Auto Tune Models) is an open-source AutoML system developed by the Data to AI Lab at MIT, focusing on simplifying machine learning model selection and tuning. It operates as a multi-tenant, multi-data system, allowing users to provide a classification problem and a dataset in CSV format. ATM then automatically searches for and builds the best predictive model. The system supports various data input methods, including local CSV files, AWS S3 buckets, and URLs. It offers a Python API for creating ATM instances, running model searches, and exploring results, including summaries of dataruns, best classifiers, and detailed scores. Users can export and load trained models for making predictions on new data. ATM is built on Python and is designed for both ease of use and scalability.

DatumFuse.AI

DatumFuse.AI

58%

DatumFuse.AI is an AI platform designed to automate the entire data preparation and storytelling process. It excels at data cleaning, harmonization, and augmentation, transforming raw, messy datasets into structured, enriched, and actionable information. Beyond preparation, DatumFuse.AI also offers robust data narration and visualization capabilities, enabling users to create clear, insightful stories from their data without requiring any coding expertise. This no-code approach makes complex data tasks accessible, providing clarity and efficiency for businesses looking to derive insights from their data quickly and effectively.

OpenHealth Technologies

OpenHealth Technologies

58%

OpenHealth Technologies specializes in transforming fragmented lab data into valuable, harmonized insights for the health industry. The platform offers a central Lab Data API that standardizes and enriches lab results with medical context, making them suitable for AI analysis and integration. It provides modular white-label solutions for visualizing health data through customizable digital reports and health wallets, accessible on any device. The tool caters to a wide range of stakeholders including clinics, labs, health insurance providers, health technology companies, sports & wellbeing organizations, the supplementation industry, and pharmaceutical companies, enabling them to modernize operations, improve patient engagement, and accelerate research and development.

OrcaSheets

OrcaSheets

58%

OrcaSheets is an AI-first data analytics tool designed for instant processing of massive datasets directly on your local machine. It provides local-first data analytics, ensuring enhanced security and robust offline capabilities. Users can ask questions in plain English to get immediate answers from unlimited data, automating repetitive tasks and connecting to various databases. By eliminating the need for cloud infrastructure, OrcaSheets helps reduce cloud bills while keeping sensitive data secure. Its focus on local processing makes it a powerful solution for data scientists and analysts who require speed, security, and the ability to work without an internet connection.

OpenSheet

OpenSheet

58%

OpenSheet is a modern spreadsheet and data exploration tool designed to simplify data analysis and transformation. It provides a user-friendly interface for working with diverse data formats including CSV, JSON, Excel, and Parquet. Users can easily analyze, visualize, and transform their data, making it an efficient solution for various data manipulation tasks. The platform aims to streamline the process of understanding and working with data, offering a comprehensive environment for data exploration and analytics.

OnDeck AI

OnDeck AI

58%

OnDeck AI offers a powerful vision model, Perception-0, designed to solve complex visual analysis tasks without the need for training or data labeling. It enables users to search and find anything across petabytes of unstructured footage, understand complex events, and generate human or machine-readable reports. The tool is built with a grounded vision architecture, enhancing reliability by using real data. OnDeck self-adapts to fit unique customer workflows and is deployable in the cloud, on-premise, or air-gapped environments. It is SOC 2 certified and used by defence agencies, universities, and robotics companies for applications like threat detection and accelerating robotics model training.

YData

YData

58%

YData Fabric is a comprehensive platform designed to empower data scientists by improving data quality and accelerating AI model development. It offers robust features such as automated data profiling for quick exploratory data analysis, an interactive data catalog to track changes and drifts, and advanced synthetic data generation to protect sensitive information and augment datasets. The platform also provides scalable data preparation pipelines for cleaning, transforming, and orchestrating data flows, significantly reducing time-to-market for AI solutions. YData is trusted by a large community of data scientists and is recognized for its accuracy, scalability, and enterprise readiness in synthetic data.

tiktoken-go

tiktoken-go

58%

Tiktoken-go is a Go port of OpenAI's tiktoken library, designed for efficient Byte Pair Encoding (BPE) tokenization. This tool allows Go developers to seamlessly integrate tokenization capabilities into their applications, particularly when working with OpenAI's various language models like GPT-3.5, GPT-4, and embedding models. It features a cache mechanism, similar to the original Python library, which can be configured via the TIKTOKEN_CACHE_DIR environment variable to store token dictionaries and avoid repeated downloads. For scenarios requiring offline operation or custom dictionary loading, Tiktoken-go supports alternative BPE loaders, including an offline loader that uses embedded files. The library also provides utility functions for counting tokens in chat API calls, adapting to different model versions and their specific token calculation rules.