Coding & Development
Browsing page 17 of AI tools for Testing & QA in Coding & Development. Sorted by confidence score — our independent quality rating.
Relyable
Relyable is a comprehensive platform designed for automated testing and monitoring of AI voice agents. It enables users to generate hundreds of realistic test conversations, evaluate every call against a custom rubric, and monitor production agents live to ensure high performance. The platform offers native integrations with Vapi, Retell, and ElevenLabs, allowing for quick setup. Users can create AI-assisted test cases from system prompts, define personas with over 200 presets, and assign them to conversation scenarios for extensive coverage. Relyable also provides real-time monitoring, logging and analyzing every live call, and sending alerts via various channels like Slack and PagerDuty when performance drifts. This ensures problems are addressed proactively, significantly accelerating the deployment of reliable AI voice agents.
Deepseek-V4.ai
DeepSeek-V4.ai serves as a dedicated tracker for the anticipated DeepSeek V4, an AI model expected to be a flagship for repo-level coding, long-context reasoning, and agentic workflows. As of February 16, 2026, the model is not officially released, but the site compiles rumored specifications like a ~1 Trillion parameter MoE system, 1M+ token context, and innovative memory core technologies. It also presents unverified leaked benchmarks, including impressive claims for SWE-Bench and HumanEval. The platform highlights the potential impact of DeepSeek V4 on cost pressure, repo-scale agent workflows, and local/self-hosted adoption, making it a valuable resource for developers and AI researchers awaiting its launch.
Hexometer
Hexometer acts as an AI-powered sidekick, continuously monitoring websites and key services around the clock. It specializes in detecting a wide range of issues including downtime, user experience problems, performance bottlenecks, broken pages, and errors. The tool also monitors SEO optimization, security vulnerabilities, and server configuration issues, providing proactive alerts before they impact business. With features like visual, content, and technology monitoring, Hexometer helps users stay informed about any changes on their web pages. It also includes tools for meta tag analysis, domain WHOIS lookup, broken link checking, and page speed scanning, making it a comprehensive solution for website health and growth.
CADY
CADY is an AI-powered platform designed for fast and accurate electrical schematic analysis, significantly reducing design review time and enhancing engineering precision. It leverages unique proprietary algorithms to automatically read and understand electrical component datasheets, offering a more comprehensive analysis than traditional Design Rules Check (DRC) tools. The platform supports all leading CAD software, including Siemens, Cadence, Altium, Zuken, KiCAD, and Eagle, and accepts various Netlist formats. Users simply upload their Netlist and BOM files for an instant, automated analysis, with no manual data input or component libraries required. CADY identifies a wide range of errors, from connection integrity to voltage mapping and communication protocols, and provides an interactive HTML report for review. The system is built with robust data security, ensuring complete privacy by not permanently storing user files and employing multi-layered cyber protection.
Secuarden
Secuarden AI offers a governance fabric specifically designed for AI-generated code, addressing the unique security and compliance challenges of modern software development. It functions as a static code analysis tool, meticulously examining code for vulnerabilities and transforming these security findings into actionable compliance evidence. The platform automates SDLC (Software Development Life Cycle) intelligence, ensuring adherence to critical industry standards such as SOC 2 and PCI-DSS. This makes Secuarden an essential tool for organizations leveraging AI in their coding processes, providing the necessary oversight and documentation to maintain robust security postures and meet regulatory requirements. It is built for the AI coding era, offering a proactive approach to code security and compliance.
MOSTLY AI
MOSTLY AI is a data intelligence platform designed to unlock the power of data through secure access, high-quality synthetic data generation, and seamless analysis. The platform features an AI Assistant for persistent data analysis and collaboration, allowing users to gain insights from live production data using natural language. It supports the creation of realistic mock data for safe experimentation and testing, and generates high-fidelity, privacy-safe synthetic datasets that mimic real data without exposing sensitive information. Additionally, MOSTLY AI enables the simulation of edge cases and future scenarios for stress testing strategies. The platform is built for individuals, teams, and enterprise organizations, offering scalable deployment options and an open-source Synthetic Data SDK for local data generation.
Rawbot
Rawbot is a dedicated platform designed to simplify the complex process of comparing and evaluating various AI models. It serves as an ultimate AI comparison tool, enabling users to efficiently identify the most suitable AI models for their specific research, development, or business needs. The platform provides comprehensive insights into the strengths and weaknesses of different models, facilitating informed decision-making. By offering a streamlined approach to AI model selection, Rawbot helps users optimize their projects and achieve better outcomes, making it an invaluable resource for anyone working with artificial intelligence.
exploraNote
exploraNote is an AI-powered tool designed to streamline the process of exploratory testing, note-taking, and report generation for software testers. It helps testers efficiently document their findings during testing sessions, automating aspects of note-taking and report creation. The tool aims to enhance the overall testing workflow by providing intelligent assistance in capturing observations and generating comprehensive reports, thereby improving the speed and accuracy of test documentation.
QuantPi
QuantPi offers a comprehensive AI testing platform designed to evaluate AI systems under real-world conditions, ensuring compliance and mitigating risks before deployment. The platform tests various AI types, including agentic AI, GenAI, computer vision, physical AI, voice, video, and multimodal systems, using a consistent methodology. Key capabilities include statistical certainty for risk quantification with confidence intervals, agent-level simulation without user risk, and a single engine for all models and modalities. QuantPi integrates into existing AI development cycles, providing audit-ready evidence for every release. It supports flexible deployment options, including public cloud and on-premises for sovereign AI, with enterprise-grade security features like SSO and role-based access control. The platform is built to test current and future AI systems, making it a long-term investment for AI-first enterprises.
dev3000
dev3000 is a debugging assistant designed to capture a comprehensive timeline of a web application's development process. It monitors and records server logs, browser console messages and errors, network requests and responses, and user interactions. Additionally, it takes automatic screenshots during navigation, errors, and interactions. All this data is organized into timestamped logs that AI assistants can readily understand, enabling them to identify issues and suggest accurate fixes with full context. The tool supports various web frameworks including JavaScript/TypeScript, Python, and Ruby, and can integrate with any AI assistant capable of reading files, such as Claude or OpenAI Codex. It offers diagnostic commands for error analysis, log viewing, and application crawling, making it a powerful aid for developers.
deepdrive
Deepdrive is an open-source simulator designed to facilitate experimentation and advancement in self-driving AI. It enables anyone with a PC to develop and test state-of-the-art autonomous driving systems within a realistic simulated environment. The simulator supports various AI agent types, including forward-agents, remote agents, and baseline agents like Mnet2 and C++ FSM/PID. Users can record training data for imitation learning, convert data to TFRecords, and train models using provided datasets or their own. Deepdrive offers detailed observation data, including vehicle dynamics, camera feeds (image, depth), and environmental information, all adhering to Unreal Engine conventions for units and rotations. It requires Linux, Python 3.6+, 10GB disk space, and 8GB RAM, with optional GPU requirements for baseline agents.
mteb
mteb (Massive Text Embedding Benchmark) is an open-source Python library designed for comprehensive evaluation of text and multimodal embeddings. It offers a standardized framework to benchmark the performance of different embedding models across a wide array of tasks, including classification, clustering, semantic textual similarity (STS), retrieval, and reranking. The tool supports both monolingual and multilingual evaluations, with a focus on reproducibility and ease of use. Developers and researchers can use mteb to select models, define custom models, run evaluations, and analyze results, contributing to an interactive leaderboard that tracks the state-of-the-art in embedding performance. Its modular design allows for easy integration of new models, datasets, and benchmarks.
evaluation-guidebook
The Hugging Face Evaluation Guidebook is a comprehensive resource for understanding and implementing Large Language Model (LLM) evaluation. It provides both practical insights and theoretical knowledge, drawing from the experience of managing the Open LLM Leaderboard and designing the lighteval framework. The guidebook covers various evaluation methods, including automatic benchmarks, human evaluation, and LLM-as-a-judge approaches. It offers guidance on designing custom evaluations, troubleshooting common issues, and provides tips and tricks for both beginner and advanced users. Additionally, it includes sections on general LLM knowledge, such as model inference and tokenization, making it a valuable resource for anyone looking to ensure their LLM performs effectively.
harbor
Harbor is a robust, open-source framework designed for the evaluation and optimization of AI agents and language models. Developed by the creators of Terminal-Bench, it provides a comprehensive toolkit for assessing agent performance, including those like Claude Code and OpenHands. Users can leverage Harbor to create and share custom benchmarks and environments, facilitating diverse experimental setups. The framework supports parallel execution of experiments across thousands of environments, utilizing providers such as Daytona and Modal, and can generate rollouts for reinforcement learning optimization. Its flexibility makes it suitable for a wide range of AI development and research tasks.
IUNA AI
IUNA AI offers advanced AI vision systems designed for precision, reliability, and efficiency in industrial manufacturing, particularly for automotive applications. Their flagship products, the Weld Inspector and Assembly Inspector, replace subjective manual checks with objective AI precision, enabling 100% inline inspection. The Weld Inspector focuses on detecting defects in various weld types (Steel, Aluminum, Laser) fully compliant with ISO standards, ensuring structural integrity. The Assembly Inspector provides comprehensive quality assurance for body shop and assembly, including precision metrology for gap & flush, hole positions, and angles, as well as assembly verification to eliminate missing parts like bolts, clips, and nuts. These turnkey systems include camera, lighting, and AI computing units, leveraging high-resolution industrial cameras and NVIDIA chip technology.
TransformerLens
TransformerLens is an open-source Python library designed for the mechanistic interpretability of GPT-2 style language models. Maintained by Bryce Meyer and created by Neel Nanda, this tool enables users to load over 50 different open-source language models and expose their internal activations. Researchers can cache any internal activation and add functions to edit, remove, or replace these activations during model execution. The library supports in-depth analysis to reverse engineer the algorithms models learn from their weights, making it a crucial resource for understanding how large language models function internally. It also includes experimental support for Mamba / SSM architectures, providing bridge adapters for Mamba-1 and Mamba-2.
transformers-interpret
transformers-interpret is a model explainability tool specifically designed to integrate seamlessly with the Hugging Face Transformers package. It enables developers to understand the predictions of their transformer models with minimal effort, requiring only two lines of code to generate explanations. The tool supports explainers for both text and computer vision models, offering insights into how different parts of the input contribute to the model's output. It also provides visualization capabilities, allowing users to view attributions directly in notebooks or save them as PNG and HTML files for easier analysis and sharing. This functionality is crucial for debugging, improving model performance, and ensuring transparency in AI applications.
yet-another-applied-llm-benchmark
Yet Another Applied LLM Benchmark is an open-source tool designed to evaluate the performance of large language models (LLMs) on practical, real-world tasks. The benchmark is unique because its tests are derived directly from questions the creator has previously asked LLMs to solve, covering scenarios like converting Python to C, decompiling bytecode, explaining minified JavaScript, and generating SQL queries. It features a simple dataflow domain-specific language (DSL) that allows users to easily create new, sophisticated test cases. This DSL enables complex evaluation flows, such as asking an LLM to generate code, running that code in a Docker container, and then using another LLM to evaluate its output. The tool emphasizes testing models on tasks that developers genuinely care about, providing a more realistic assessment than many academic benchmarks.
neuralminds.io
Neural Minds offers an AI-powered platform designed for self-assessment and evaluation, leveraging Generative and Predictive AI (GAP AI). Users can assess their skills and knowledge, gaining insights into their strengths. The platform facilitates team formation for projects by matching individuals based on their assessed skills. This approach helps users 'up their game' through continuous evaluation and strategic team building, ensuring projects align closely with team members' capabilities.
Arklex AI
Arklex AI provides a simulation-based evaluation platform for AI agents, enabling teams to generate realistic multi-turn conversations with synthetic users. This approach allows for the evaluation of every turn, identifying failure modes like context loss, tool misuse, and policy violations that often emerge only in complex interactions. Unlike other tools that require pre-existing datasets, Arklex generates test data, covering edge cases where users push back or change their minds. It supports any agent or framework that exposes an HTTP endpoint, speaks the A2A protocol, or is a Python class. Arklex integrates into development workflows as a CI/CD quality gate and a standalone platform for testing, governance, and deployment approval, ensuring agents meet readiness standards before production.
Djrango Qwen2vl Flux
Djrango Qwen2vl Flux is a Hugging Face Space designed for text-to-image generation. Users can enter a text description, and the application will generate a corresponding image. This tool is ideal for visualizing creative ideas, prototyping designs, or simply generating unique art pieces from textual prompts. It leverages the Qwen2vl model and is built with Gradio, providing an interactive interface for experimentation. The platform is hosted on Hugging Face, making it accessible for testing and exploring the capabilities of AI-driven image generation.
Mechanika Engineering
Mechanika Engineering acts as an R&D partner for manufacturing challenges that lack off-the-shelf solutions. They combine ingenious, compact-footprint design with advanced machine vision, developed in their own lab, to solve complex automation problems within existing factory constraints. Their offerings include turn-key machine-vision cells like DrillEye for inline hole inspection, Robo Eye for multi-camera quality control, and Depal Vision+ for intelligent depalletizing guidance. They also provide R&D services, custom machinery engineering, and comprehensive service and support, including remote diagnostics and spare-parts deliveries. Their solutions are designed for fast payback and minimal disruption across industries like wood, packaging, and metal.
Testbook
Testbook.ai is an innovative, AI-powered no-code testing platform designed to streamline web application testing. It offers robust features for automated regression testing, UI comparison, and cross-browser compatibility testing. Users can record interactions with their web applications and play them back instantly, eliminating the need for manual scripting. The platform seamlessly integrates with popular testing clouds such as Saucelabs, BrowserStack, Testing Bot, and LamdaTest, expanding testing capabilities across various environments. Testbook.ai also includes an AI-powered UI testing feature for intelligent detection of visual discrepancies, detailed reports, and a self-healing mechanism for stable test scripts. It supports hybrid testing with a manual stepper and offers Jira integration for efficient bug tracking.
Video-MME
Video-MME is the first-ever comprehensive evaluation benchmark designed to assess the capabilities of Multi-modal Large Language Models (MLLMs) in video analysis. It covers a wide range of visual domains, temporal durations, and data modalities, including short, medium, and long-term videos (from 11 seconds to 1 hour). The benchmark comprises 900 videos totaling 254 hours and 2,700 human-annotated question-answer pairs. It integrates multi-modal inputs beyond video frames, such as subtitles and audios, to provide a full-spectrum evaluation. Video-MME is suitable for both image MLLMs and video MLLMs, offering a robust framework for evaluating model performance in understanding and processing sequential visual data.