Data & Analytics
Browsing page 13 of AI tools for Statistical & Scientific in Data & Analytics. Sorted by confidence score — our independent quality rating.
Recursion
Recursion is a clinical-stage TechBio company dedicated to decoding biology through AI to radically improve lives. Founded over a decade ago, Recursion utilizes its proprietary Recursion OS, an AI-native, end-to-end drug discovery and development platform. This platform integrates biology, chemistry, and clinical development into a unified intelligence system, powered by multimodal data, purpose-built AI models, and bilingual teams. Recursion aims to reduce the massive 90% failure rate of traditional drug discovery by using AI to understand cellular disruptions driving disease. The company has yielded an advanced pipeline of potential first-in-class and best-in-class treatments for conditions with high unmet need, including aggressive cancers and rare diseases, demonstrating significant improvements in speed, efficiency, and reduced costs from hit identification to IND-enabling studies.
bayesflow
BayesFlow is a powerful Python library designed for efficient Bayesian inference using deep learning techniques. It offers a user-friendly API that streamlines amortized Bayesian workflows, making complex modeling tasks more accessible. The library provides a rich collection of generative models, including diffusion and consistency models, and boasts multi-backend support via Keras3, allowing users to leverage PyTorch, TensorFlow, or JAX. It is suitable for both complex simulators and traditional statistical models, offering a streamlined workflow for parameter estimation, model comparison, and validation, especially when conventional methods are inefficient or unavailable.
Best Bike Split
Best Bike Split is a sophisticated physics-based cycling and triathlon race planning software designed for athletes and coaches. It leverages power data, real course files, weather conditions, and aerodynamic modeling to generate precise, variable power pacing plans for any event. Users can predict their bike split times, get specific power targets for goal results, and train effectively for course demands. The platform also allows for analysis of how factors like fitness, drag, and weather impact performance, and offers tools for comparing planned versus actual race data. Trusted by over 150,000 athletes and professional WorldTour teams, Best Bike Split aims to help riders achieve their fastest bike split with confidence.
pytorch_tabular
PyTorch Tabular offers a unified and accessible framework for applying deep learning models to tabular data. Designed with principles of low resistance usability, easy customization, and scalability, it simplifies the development and deployment of advanced models. The library integrates with PyTorch and PyTorch Lightning, enabling efficient training on both GPUs and CPUs, alongside automatic logging for experiment tracking. It supports a variety of state-of-the-art models including FeedForward Networks, NODE, TabNet, Mixture Density Networks, AutoInt, TabTransformer, GATE, GANDALF, and DANETs, as well as semi-supervised Denoising AutoEncoders. Users can also implement custom models, making it suitable for both real-world applications and research.
Resemblyzer
Resemblyzer is a Python package designed for advanced voice analysis and comparison, leveraging deep learning techniques. It functions by deriving a high-level representation of a voice through a sophisticated voice encoder model. The tool generates a summary vector consisting of 256 values, which effectively encapsulates the unique characteristics of a spoken voice. This capability makes it suitable for applications requiring detailed voice identification, verification, or similarity analysis, providing a robust framework for understanding vocal nuances in various contexts.
SuperGluePretrainedNetwork
SuperGluePretrainedNetwork is a research project from Magic Leap, presented at CVPR 2020, focusing on learning feature matching using Graph Neural Networks. The core of the project is the SuperGlue network, which integrates a Graph Neural Network with an Optimal Matching layer. This architecture is specifically designed to perform matching tasks on two distinct sets of sparse image features. The repository offers both the PyTorch code implementation and pretrained weights, making it accessible for researchers and developers interested in computer vision and feature matching applications. It serves as a valuable resource for those looking to implement or build upon advanced feature matching techniques.
stellargraph
StellarGraph is a comprehensive Python library designed for machine learning on various types of graphs and networks. It provides a rich collection of state-of-the-art algorithms, including GraphSAGE, GCN, GAT, Node2Vec, and Metapath2Vec, enabling users to perform tasks such as representation learning for nodes and edges, classification of nodes or entire graphs, and link prediction. The library supports diverse graph structures, from homogeneous to heterogeneous and knowledge graphs, and integrates seamlessly with TensorFlow 2, Keras, Pandas, and NumPy. This makes it user-friendly, modular, and extensible, allowing for smooth interoperability with existing machine learning workflows and easy augmentation of its core algorithms.
UCR_Time_Series_Classification_Deep_Learning_Baseline
UCR_Time_Series_Classification_Deep_Learning_Baseline is an open-source repository designed to provide a foundational deep learning model for time series classification. It specifically utilizes fully convolutional neural networks (FCNs) to establish a robust baseline for research and application. The tool is tailored for univariate time series data, making it suitable for a wide array of domains including finance, industrial applications, and healthcare, where time-dependent data analysis is crucial. It supports both representation learning and classification tasks, offering a valuable resource for data scientists and researchers looking to explore or implement deep learning solutions for time series analysis.
ClipBERT
ClipBERT is an official PyTorch code implementation for an efficient framework designed for end-to-end learning across image-text and video-text tasks. Recognized with a CVPR 2021 Best Student Paper Honorable Mention, ClipBERT processes raw videos/images and text inputs to generate task predictions. It leverages 2D CNNs and transformers, incorporating a sparse sampling strategy to enable efficient multimodal learning. The framework supports end-to-end pretraining and finetuning for tasks such as image-text pretraining on COCO and VG captions, text-to-video retrieval on MSRVTT, DiDeMo, and ActivityNet Captions, video-QA on TGIF-QA and MSRVTT-QA, and image-QA on VQA 2.0. Its modular design allows for easy integration of additional image-text or video-text tasks.
Leash Bio
Leash Bio is revolutionizing drug design by building a massive, proprietary dataset of protein-molecule interactions. The platform screens millions of compounds against thousands of proteins, generating over 30 billion data points. This extensive dataset is ideal for training advanced machine learning models, enabling faster and more effective drug discovery. Leash Bio employs a dynamic, cyclical engine that continuously harnesses data, iterates machine learning, and refines its approach, with each cycle taking only a few months. Their innovative software designs and refines novel chemical matter, leading to molecules with desired activities. The company is developing internal oncology programs and partnering with biopharma companies to explore new molecule opportunities.
Troople
Troople is a consultancy service that empowers businesses to embrace AI for enhanced efficiency and well-informed decision-making. They bridge the knowledge gap by leveraging data science potential through advanced AI technologies. Troople offers a dynamic process from strategy to implementation, including AI integration, reporting, data strategy formulation, and proof-of-concept implementation. Their services focus on data marketing, intelligent automation, and visualization to decode complex information and provide actionable insights. They also offer ongoing support, maintenance, AI governance, and ethical guidelines for responsible AI usage, ensuring businesses can effectively adopt and scale AI solutions.
jiwer
JiWER is a simple and fast Python package designed for evaluating automatic speech recognition (ASR) systems. It supports several key similarity measures, including word error rate (WER), match error rate (MER), word information lost (WIL), word information preserved (WIP), and character error rate (CER). These measures are computed efficiently using the minimum-edit distance algorithm, powered by the high-performance RapidFuzz library which leverages C++ for speed. The package also defines specific behaviors for empty reference and hypothesis pairs, addressing potential division-by-zero issues and allowing for testing models on silent audio. JiWER is released under the Apache License, Version 2.0, making it a robust and accessible tool for developers working with speech-to-text technologies.
tennis_analysis
Tennis_analysis is an open-source project designed to analyze tennis players and ball movements within video footage. It leverages advanced computer vision techniques, including YOLO v8 for player detection and a fine-tuned YOLO model for tennis ball detection. Additionally, the tool utilizes Convolutional Neural Networks (CNNs) to accurately extract court keypoints, providing a comprehensive understanding of on-court activity. This project is ideal for individuals looking to enhance their machine learning and computer vision skills through a practical, hands-on application. It measures player speed, ball shot speed, and the total number of shots, offering valuable insights for performance analysis.
HashtagCashtag
HashtagCashtag is an open-source project that implements a big data processing pipeline based on a lambda architecture. It aggregates Twitter and US stock market data to perform user sentiment analysis and correlate it with stock price fluctuations. The pipeline utilizes Apache Kafka for data ingestion, Apache Spark and Spark Streaming for both batch and real-time processing, and Apache Cassandra for data storage. A Flask-based frontend, incorporating Bootstrap and HighCharts, provides visualization of trending stocks, historical data, and sentiment over time. This project demonstrates a comprehensive approach to real-time and batch data processing for financial market insights.
PyTorch-BayesianCNN
PyTorch-BayesianCNN provides an implementation of Bayesian Convolutional Neural Networks (CNNs) with variational inference, specifically utilizing Bayes by Backprop, within the PyTorch framework. This tool allows researchers and developers to build CNNs that can infer intractable posterior probability distributions over weights, offering a significant advantage over traditional frequentist approaches by providing uncertainty estimations. It includes two types of Bayesian layer implementations: BBB (Bayes by Backprop) and BBB_LRT (Bayes by Backprop with Local Reparametrization Trick), which enhances sampling efficiency. The repository supports standard datasets like MNIST, CIFAR10, and CIFAR100, and includes implementations of common models such as AlexNet and LeNet, making it a valuable resource for experimenting with Bayesian deep learning and understanding model uncertainty.
SimGNN
SimGNN is a PyTorch implementation of a novel neural network approach designed for fast graph similarity computation, as detailed in the WSDM 2019 paper. It addresses the computational burden of traditional methods like Graph Edit Distance (GED) and Maximum Common Subgraph (MCS) while maintaining high performance. The tool employs a learnable embedding function to map graphs into embedding vectors, providing a global summary. A key feature is its attention mechanism, which emphasizes important nodes for specific similarity metrics. Additionally, SimGNN includes a pairwise node comparison method to supplement graph-level embeddings with fine-grained node-level information. This approach leads to better generalization on unseen graphs and offers quadratic time complexity in the worst case. Experimental results demonstrate its effectiveness and efficiency, achieving smaller error rates and significant time reductions compared to existing baselines.
vaderSentiment
vaderSentiment (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool, freely available under the MIT License. It is particularly attuned to sentiments expressed in social media, making it effective for analyzing user-generated content. The tool also demonstrates strong performance on texts from other domains. Key features include handling negations, contractions, conventional punctuation, word-shape emphasis (e.g., ALL CAPS), degree modifiers, slang, emoticons, and initialisms/acronyms. It offers Python 3 compatibility, improved modularity, and enhanced speed/performance, reducing time complexity significantly. Installation is simplified via pip, and it can work in conjunction with NLTK for analyzing longer texts.
modeltime
Modeltime is an open-source R package designed to simplify and accelerate high-performance time series analysis and forecasting. It integrates various time series models, including classical methods like ARIMA and ETS, with machine learning algorithms from the `tidymodels` ecosystem, and specialized models like Facebook's Prophet. This unified framework eliminates the need to switch between different tools, allowing users to leverage a wide array of techniques from a single platform. Modeltime emphasizes a streamlined workflow for forecasting, incorporating best practices and supporting advanced capabilities such as ensembling, resampling for backtesting, and scalable modeling for thousands of time series. It is part of a growing ecosystem that includes extensions for H2O AutoML and GluonTS deep learning.
Time-MoE
Time-MoE is an open-source project offering a family of decoder-only time series foundation models, utilizing a Mixture of Experts architecture. These models are designed for auto-regressive operation, enabling universal forecasting with arbitrary prediction horizons and context lengths up to 4096. It scales up to 2.4 billion parameters and is trained from scratch. A key component is the Time-300B dataset, the largest open-access time series data collection, comprising over 300 billion time points across more than nine domains. Time-MoE supports making forecasts, fine-tuning with custom datasets in jsonl format, and evaluation on benchmark datasets, making it suitable for advanced time series analysis.
ETM
ETM (Topic Modeling in Embedding Spaces) is a research tool designed to perform topic modeling by representing words and topics within a unified embedding space. This approach allows for the likelihood of a word under ETM to be modeled as a Categorical distribution, derived from the dot product between the word embedding and its assigned topic's embedding. ETM is particularly effective as a document model, capable of learning interpretable topics and word embeddings. Its design makes it robust against large vocabularies, including those with rare words and stop words, which is a significant advantage in natural language processing. The tool provides scripts for data preprocessing, training, and evaluation, supporting various datasets like 20NewsGroup and New York Times.
enercast
enercast is a leading technology provider specializing in weather-based artificial intelligence for the digital transformation of renewable energy. Its self-learning SaaS products deliver accurate power generation forecasts for wind and solar plants, enabling their efficient operation, ensuring grid stability, and increasing trading margins. The platform processes large amounts of weather data, combining numerical weather prediction models with site-specific measurement data to learn individual plant behavior. Founded in 2011, enercast delivers 400 million forecast data points daily to customers in 30 countries, covering 240 GW of installed capacity, supporting the emerging decentralized energy system.
caffe-cvprw15
caffe-cvprw15 is a deep learning framework developed by Kevin Lin, Huei-Fang Yang, and Chu-Song Chen for fast image retrieval. It introduces a novel approach to generate hash-like binary codes by adding a latent-attribute layer to a deep Convolutional Neural Network (CNN). This method efficiently learns domain-specific image representations and hash functions without relying on pairwise similarities, making it highly scalable for large datasets. The framework has demonstrated significant improvements in retrieval precision on datasets like MNIST and CIFAR-10, and its computational cost for Hamming distance calculation is substantially lower than traditional Euclidean distance measures, offering a speedup of approximately 982,600x. It provides resources for downloading pre-trained models and datasets, and includes scripts for training custom models.
Dataset Profiling
Dataset Profiling is a Hugging Face Space designed to help users analyze and understand their datasets. By uploading a dataset file, users can generate a comprehensive profile report that provides insights into data distributions and helps identify potential data quality issues. The tool is particularly useful for data scientists and machine learning engineers who need to quickly assess the characteristics of their data before further processing or model training. The generated report can be uploaded to the user's Hugging Face account, with customizable report names and versions, facilitating organized data management and collaboration.
QueryCraft
QueryCraft is an AI-powered tool designed to simplify the creation of JQL (Jira Query Language) queries. Users can input natural language descriptions of the data they are looking for, and QueryCraft will instantly generate the corresponding JQL query. This eliminates the need for manual query construction, saving time and reducing the complexity often associated with building specific Jira queries. It's ideal for anyone working with Jira who needs to efficiently retrieve data without extensive knowledge of JQL syntax, allowing them to work smarter and focus on analysis rather than query building.