Data & Analytics
Browsing page 21 of AI tools for Data Pipelines & Integration in Data & Analytics. Sorted by confidence score — our independent quality rating.
sematic
Sematic is an open-source platform designed for ML engineers and data scientists to develop and manage machine learning pipelines. It enables users to write complex end-to-end pipelines using simple Python code, which can then be executed locally on a laptop, in a cloud VM, or on a Kubernetes cluster to leverage cloud resources. The platform emphasizes easy onboarding with no deployment or infrastructure needed to get started, offering local-to-cloud parity. Key features include end-to-end traceability of pipeline artifacts, reproducibility of results, dynamic graphs, lineage tracking, and runtime type-checking. Sematic also provides a modern web dashboard for monitoring, tracking, and visualizing pipelines and artifacts, along with integrations for Apache Spark, Ray, Snowflake, Plotly, Matplotlib, and Pandas.
Vizzio Technologies Pte Ltd
Vizzio Technologies specializes in creating ultra large-scale 3D reconstructions of digital city models globally, powered by algorithms and AI. Their proprietary "EARTH ENGINE" technology builds dimensionally-accurate, photorealistic, and semantic 3D digital twins of the planet using deep learning and satellite imagery, without the need for drones or LIDAR. This enables timely, global coverage for 3D mapping and visualization. Vizzio's AI can identify building types, extrapolate from imperfect outlines, and reconstruct entire cities. The platform supports cross-platform embedding and offers solutions for immersive virtual tours, real-time digital twins with live video feeds, and enhanced safety, security, and operations management for smart stations and facilities.
Atlan
Atlan is the context layer for enterprise AI, designed to bridge the AI context gap by providing a unified understanding of an organization's data, business logic, and institutional knowledge. It connects over 80 business systems into an Enterprise Data Graph, pulling context from warehouses, BI definitions, and applications. Atlan leverages AI agents to bootstrap the context layer by generating asset descriptions, linking business terms, and surfacing key business questions. Human experts then collaborate to resolve conflicts, annotate, and certify the context. This certified context is then activated across all AI agents and downstream tools via SQL, APIs, and the Atlan MCP server, ensuring AI acts on trusted and relevant information.
Heex
Heex is a Smart-Data platform designed for autonomous systems, including robots, AMRs, and autonomous vehicles. It addresses challenges like data overload, limited insights from raw data, the simulation-reality gap, and compliance issues. The platform's software solution uses an event-based approach with pre-set triggers to extract small, relevant data packages directly at the edge or in the cloud. This process transforms raw Big Data into "Smart-Data," which is then seamlessly distributed to relevant teams. Heex aims to improve productivity, optimize costs and resources, and foster better collaboration across organizations. It supports R&D for engineering teams through data collection at the edge and server-side data replay, and operations monitoring for operational teams with smart anomaly monitoring and incident notifications. Heex integrates with various robotic control systems and offers robust security features.
Orca Scan SQL Connectors
Orca Scan SQL Connectors provides a no-code barcode system designed for easy asset and inventory tracking using smartphones. It eliminates the need for complex APIs or scripts, allowing DBAs to push and pull data to and from SQL databases like MySQL, PostgreSQL, MariaDB, SQL Server, and Oracle DB. The platform offers features such as offline barcode scanning, cloud access for full visibility, a history log to track product lifecycles, and custom workflows with triggers for data collection. Users can also design and print industry-compliant barcode labels using pre-configured templates or custom designs. It integrates with various systems including Google Sheets, Microsoft Excel, Zapier, Power BI, and Tableau, making it a versatile solution for businesses looking to improve efficiency and data visibility.
Code for Good
Code for Good provides an AI-driven platform designed to integrate seamlessly with existing business systems like ERP and CRM, enabling companies to harness the full potential of their data. The platform focuses on making both structured and unstructured data actionable, driving efficiency and growth. It operates by building a smart layer over current IT infrastructure, ensuring that businesses can adopt AI without complex migrations. Key components include a Data & AI Platform, Data Collection App, Vision & Sensors technology, and advanced Models & Algorithms. Code for Good aims to deliver measurable return on investment by transforming innovation into direct operational profit and growth, making AI accessible and impactful for various industries.
Striim
Striim is a comprehensive platform for real-time data integration and streaming, designed to unify data across various sources including databases, applications, and cloud environments. It leverages Change Data Capture (CDC) to stream trillions of events in real-time, enabling businesses to build AI-ready data pipelines. The platform offers solutions for AI & ML data unification, high-throughput streaming integration, and high availability through real-time database replication. Striim supports over 100 connectors for popular sources like AWS, Google Cloud, Azure, Databricks, and Snowflake, and features capabilities such as streaming SQL, intelligent schema evolution, and pipeline monitoring. It is available as a fully managed SaaS (Striim Cloud) or a self-managed platform (Striim Platform), catering to diverse deployment needs.
Tray
Tray is an AI orchestration platform designed for business automation, agent building, and enterprise integration. It provides a unified platform to run governed AI agents, deploy secure MCP (Multi-Cloud Platform) servers, and infuse intelligence into existing business processes. With over 700 pre-built connectors, Tray helps consolidate point-to-point integrations, reduce integration sprawl, and accelerate development. The platform offers centralized observability, security, and access control across agents, MCP, automation, and integration, making it suitable for global businesses looking to scale their AI initiatives and automate complex workflows.
lancedb
LanceDB is a developer-friendly, open-source embedded retrieval library designed for multimodal AI applications. Built on the Lance columnar format, it provides fast, scalable, and production-ready vector search capabilities, allowing users to store, index, and search petabytes of multimodal data and vectors. It supports comprehensive search including vector similarity, full-text, and SQL, along with advanced features like zero-copy and automatic versioning. LanceDB runs locally or in the cloud, offering complete data sovereignty. It integrates seamlessly with popular AI/ML frameworks like LangChain and LlamaIndex, and provides SDKs for Python, Node.js, and Rust, making it a central platform for building, training, and analyzing AI workloads.
maestro
Maestro is a general-purpose workflow orchestrator developed by Netflix, offering a fully managed workflow-as-a-service (WAAS) for data platform users. It serves a diverse user base, including data scientists, data engineers, machine learning engineers, software engineers, content producers, and business analysts, for various use cases. Maestro is designed to be highly scalable and extensible, supporting both existing and new use cases while offering enhanced usability. It schedules hundreds of thousands of workflows and millions of jobs every day, operating with strict Service Level Objectives (SLOs) even during traffic spikes. The tool provides a Python SDK client for creating and managing workflows programmatically, and supports integration with AWS and Kubernetes.
Stream.Estate
Stream.Estate offers a comprehensive real estate data API designed for PropTech companies, providing access to over 50 million deduplicated property listings from more than 1,500 sources across France, with more countries coming soon. The platform enables users to build AVMs and valuation tools, power lead generation widgets, track market trends with real-time webhooks, and analyze property portfolios at scale. Key features include a granular search API, market price evolution tracking, and real-time webhooks for instant notifications on new listings or price changes. Stream.Estate aims to help companies integrate real estate data quickly, reducing development time from months to days.
Lopus AI
Lopus AI unifies CRM, revenue, and customer data to provide comprehensive business analytics, enabling users to run ad-hoc analyses quickly. The platform connects data across sales, marketing, and product tools, offering a 'source of truth' for business context. Lopus learns business terminology upfront, ensuring consistent definitions for metrics like MRR and Churn across all queries, dashboards, and alerts. It features parallel research agents that trace root causes and surface hidden patterns, moving beyond traditional dashboards by allowing users to set alerts for critical insights. Every Lopus account includes a dedicated data engineer to handle edge cases and maintain data accuracy, and it supports over 500 integrations with common GTM, revenue, and product tools.
onefilellm
OneFileLLM is a command-line tool designed to simplify data aggregation for Large Language Models (LLMs). It automates the process of collecting information from diverse sources, including local files, GitHub repositories, web pages, PDFs, and YouTube transcripts. The tool then combines this multi-source data into a single, structured XML output, which is automatically copied to your clipboard. This structured format is optimized for LLM context, making it easier for models to process and understand complex information. OneFileLLM also features an alias system for creating simple and complex shortcuts to frequently used inputs, and advanced web crawling options for comprehensive documentation sites and academic sources.
Collibra
Collibra is a comprehensive data intelligence platform designed to unify governance for both data and AI, enabling organizations to achieve Data Confidence™ and scale AI initiatives from pilot to production. The platform offers a best-in-class catalog, flexible governance, continuous quality, and built-in privacy features. Key capabilities include AI Governance for cataloging, assessing, and monitoring AI use cases, Data Access for defining and enforcing data policies, Data Catalog for discovery, Data Governance for transparency, Data Lineage for visualizing data flow, Data Quality & Observability for monitoring, and Data Privacy for automated enforcement. Collibra also features Deasy Labs for transforming unstructured data into AI-ready assets, making it ideal for regulated organizations seeking trusted and valuable AI.
HashtagCashtag
HashtagCashtag is an open-source project that implements a big data processing pipeline based on a lambda architecture. It aggregates Twitter and US stock market data to perform user sentiment analysis and correlate it with stock price fluctuations. The pipeline utilizes Apache Kafka for data ingestion, Apache Spark and Spark Streaming for both batch and real-time processing, and Apache Cassandra for data storage. A Flask-based frontend, incorporating Bootstrap and HighCharts, provides visualization of trending stocks, historical data, and sentiment over time. This project demonstrates a comprehensive approach to real-time and batch data processing for financial market insights.
OpenMLDB
OpenMLDB is an open-source machine learning database that functions as a feature platform, providing consistent features for both training and inference in AI applications. It addresses common challenges in AI engineering such as data leakage, feature backfilling, and efficiency, which often consume significant time and effort. The platform prioritizes feature engineering using SQL, offering a unified programming language for defining and managing features. OpenMLDB includes a real-time SQL engine optimized for time series data, achieving ultra-low latency for real-time features, and a batch SQL engine based on a tailored Spark distribution. Its unified execution plan generator ensures consistency between batch and real-time SQL engines, enabling a "Development as Deployment" approach to significantly reduce costs from offline training to online inference.
NCompass Technologies
NCompass Technologies provides nCompass, an AI-powered performance profiling and debugging IDE extension for VSCode and Cursor. This tool is designed to help developers write highly performant code by offering advanced profiling capabilities. It goes beyond basic debugging to identify and resolve performance bottlenecks, ensuring that code is not only functionally correct but also optimized for speed and efficiency. By integrating directly into popular IDEs, nCompass streamlines the development workflow, allowing engineers to analyze and improve code performance within their familiar environment. The tool aims to empower developers to create robust and efficient applications with the assistance of AI.
TransmogrifAI
TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an open-source AutoML library written in Scala, designed to run on Apache Spark. Developed by Salesforce, it focuses on enhancing machine learning developer productivity by automating various stages of the ML workflow, from feature engineering and validation to model selection. The library enforces compile-time type-safety, modularity, and reusability, enabling the creation of robust machine learning applications in a fraction of the time compared to traditional hand-tuned methods. It supports building models with minimal machine learning expertise, making advanced ML accessible to a broader range of developers. TransmogrifAI is particularly useful for structured data and offers flexibility for users who require more control over their ML pipelines.
DALI
The NVIDIA Data Loading Library (DALI) is a GPU-accelerated library designed to optimize data loading and pre-processing for deep learning applications. It offers a collection of highly optimized building blocks and an efficient execution engine, specifically tailored for processing image, video, and audio data. DALI addresses the common bottleneck of CPU-bound data pipelines by offloading these tasks to the GPU, significantly enhancing performance and scalability for training and inference. It supports various data formats and is portable across popular deep learning frameworks like TensorFlow, PyTorch, and PaddlePaddle. Key features include prefetching, parallel execution, batch processing, and extensibility for custom operators, making it a versatile solution for accelerating complex deep learning workflows.
Real-time-stock-market-prediction
Real-time-stock-market-prediction is an open-source project that offers a complete server-side architecture for real-time stock market prediction using Machine Learning. It leverages TensorFlow.js for building the ML model architecture and Kafka for efficient real-time data streaming and pipelining. The system integrates MongoDB for updating databases with incoming stock market logs, enabling analysis and model training, and storing model performance. Developed entirely with Node.js, this architecture supports parallel processing for real-time analysis, ML model training, and prediction, making it suitable for those interested in applying machine learning to financial market analysis and developing robust predictive models.
feathr
Feathr is a scalable, unified data and AI engineering platform widely used in production at LinkedIn and now an open-source project under the LF AI & Data Foundation. It allows users to define data and feature transformations using Pythonic APIs, register these transformations, and share them across teams. Particularly useful for AI modeling, Feathr automatically computes and joins feature transformations to training data with point-in-time correctness to prevent data leakage. It supports materializing and deploying features for online production use, offers native cloud integration with scalable architecture, and has been battle-tested for over six years. Feathr handles billions of rows and petabyte-scale data with built-in optimizations, providing rich transformation APIs including time-based aggregations and sliding window joins. It also features a built-in registry for feature reuse and an intuitive UI for searching and exploring features and their lineages.
mldb
MLDB is an open-source SQL database specifically engineered for machine learning applications. Developed by MLDB.ai, it allows users to install it as a command-line tool, run scripts, or interact via a RESTful API. Key functionalities include storing data, exploring it using a specialized SQL dialect, training machine learning models, and deploying these models as APIs. The database is designed for high efficiency in data loading, classical ML algorithm training, and prediction endpoints. It features a data model and type system optimized for ML, supporting nested structures, embeddings, and tensors. MLDB is extensible through C++, Python, and Javascript plugins, and is currently being rearchitected for a smaller core and broader deployment platforms, aiming to simplify the creation and deployment of ML solutions.
robustmq
RobustMQ is a unified messaging engine built with Rust, designed as a communication infrastructure for the AI era. It operates as a single binary, one broker, and one storage layer, eliminating external dependencies and allowing deployment from edge devices to cloud clusters. It natively supports MQTT, Kafka, NATS, AMQP, and its own mq9 protocol on a shared storage layer, meaning a message written once can be consumed by any protocol. The mq9 protocol is specifically designed for AI Agent asynchronous communication, offering features like agent mailboxes with persistent store-first delivery, priority levels, and public mailbox discovery. RobustMQ emphasizes minimal operations, multi-tenancy, and ultra-low-latency dispatch, making it suitable for diverse messaging needs from IoT to streaming data pipelines.
Euno
Euno is an AI context platform designed for enterprise data, transforming metadata into automated and trusted context for AI agents. It enables AI agents to act reliably and safely at enterprise scale by providing them with everything they need to know about core data. Euno connects to AI agents to ensure they query the right data using the correct logic, even in complex environments. Key features include real-time context graph construction with lineage, activity, health, and business logic, built-in governance automation, and automated labeling of assets based on custom rules. It helps organizations avoid common AI failures like hallucinations and inconsistencies by grounding AI decisions in accurate, contextualized data.