Coding & Development
Browsing page 12 of AI tools for Testing & QA in Coding & Development. Sorted by confidence score — our independent quality rating.
Adadot
Adadot is an AI-powered developer analytics platform designed to improve developer performance and well-being through data-driven insights. It benchmarks every data point against 50,000 active developer datasets and applies four layers of statistical analysis to ensure robustness. The tool helps engineering leaders understand the full impact of their decisions, while also providing developers with a "fitness tracker" to build trust and autonomy. Adadot pulls data from communication channels to assess collaboration health and engineering sustainability, helping to quantify the true cost of initiatives and identify areas for effort investment. It also offers "What if" scenario analysis to manage board expectations and protect developers.
Deontic
Deontic empowers mobility companies to seamlessly navigate regulations with cutting-edge AI-based software. It provides AI-driven engineering tools for the next generation of self-driving technology, focusing on safe autonomy at scale. The platform delivers generative AI workflows and intelligent agents to accelerate validation and verification for ADAS and autonomous driving. Deontic transforms manual scenario development into rapid, scalable simulations, saving time, cost, and effort for mobility innovators. Key features include agentic verification & validation, generative AI workflows for ODD scoping and scenario generation, simulation-ready scenarios from natural language, integrated validation & QA, and cross-ODD coverage for various conditions.
Arize Phoenix
Arize Phoenix is an open-source AI observability platform designed for experimentation, evaluation, and troubleshooting of LLM applications. It offers robust tracing capabilities using OpenTelemetry-based instrumentation, allowing users to monitor their LLM application's runtime. The platform also facilitates performance benchmarking through LLM-powered response and retrieval evaluations. Users can create versioned datasets for experimentation, evaluation, and fine-tuning, and track changes to prompts, LLMs, and retrieval. Phoenix includes a playground for optimizing prompts and comparing models, alongside prompt management features for systematic testing. It is vendor and language agnostic, with out-of-the-box support for popular frameworks and LLM providers, and can be deployed in various environments.
bert_score
BERTScore is an automatic evaluation metric for text generation, leveraging pre-trained contextual embeddings from BERT to compare candidate and reference sentences. It calculates precision, recall, and F1 scores based on cosine similarity, offering a robust method for assessing the quality of generated text. The tool supports approximately 130 models, with `microsoft/deberta-xlarge-mnli` currently offering the best correlation with human evaluation. It is compatible with Huggingface's transformers library and provides both a Python function and a command-line interface for ease of use. BERTScore also supports multiple reference sentences and offers options for rescaling scores with baselines and using inverse document frequency (idf) for weighted word importance.
claude-code-workflows
claude-code-workflows provides a curated collection of workflows and configurations designed for heavy users of Claude Code. It features an automated code review system, inspired by Anthropic's development process, that handles syntax, completeness, style, and bug detection. Additionally, it includes an automated security review system based on OWASP Top 10 standards to identify vulnerabilities and exposed secrets, offering severity-classified findings and remediation guidance. A design review system is also available, utilizing Microsoft's Playwright for UI/UX consistency and accessibility compliance. These workflows are based on applied learnings from an AI-native startup and are detailed with tutorials and demos on Patrick Ellis' YouTube channel, aiming to free up development teams for more strategic tasks.
smart-ide
smart-ide is an open-source AI code assistant designed as a VSCode extension, integrating ChatGPT capabilities to enhance the development workflow. It offers a suite of intelligent features including code review, automated unit test generation, error detection, and code optimization. Developers can also use smart-ide to add type definitions, generate documentation, explain code, refactor code, and perform language translation directly within their IDE. This tool is built to streamline various coding tasks, making the development process more efficient and intelligent for users.
The LLM Data Company
The LLM Data Company specializes in training frontier models for critical domains, with a current emphasis on medical applications. Their approach addresses the limitations of generalist models by developing specific intelligence for areas where ambiguity, resistance to sycophancy, and robust verification are paramount. They are currently developing the Kos series of medical models, with Kos-1 Lite achieving SOTA (State-Of-The-Art) performance on HealthBench Hard. The company focuses on post-training curricula to ensure models handle complex, sensitive data effectively, distinguishing itself from models optimized for coding or general tool-use.
AcceptMyApp
AcceptMyApp is an AI-powered assistant designed for iOS developers to streamline the app submission process. It meticulously analyzes your app's metadata against Apple's stringent Review Guidelines, proactively identifying potential rejection risks before you submit your build. This pre-check functionality helps developers avoid costly delays and rework. In cases where an app is rejected, AcceptMyApp provides clear insights into why Apple flagged the build and assists in generating reviewer-safe appeal replies, offering a clear path to fix, appeal, or submit with confidence. The tool leverages AI to provide comprehensive analysis and support throughout the app review lifecycle.
KernelBench
KernelBench is an open-source benchmark and toolkit designed to evaluate the capability of large language models (LLMs) in generating efficient GPU kernels. It specifically tasks LLMs with transpiling PyTorch operators into optimized CUDA or other DSL kernels for target GPUs. The platform offers four levels of problem categories, ranging from single-kernel operators to full model architectures, allowing for comprehensive evaluation. KernelBench provides core functionality for checking correctness and measuring performance against reference PyTorch operators, using a metric called `fast_p` to quantify tasks that are both correct and achieve a specified speedup. It supports various GPU programming languages and DSLs, including CUDA, Triton, and HIP for AMD GPUs, and offers flexible setup options for local or cloud-based evaluation.
ANTICIPATE
ANTICIPATE offers an AI-based visual quality control system designed to automate and digitalize inspection processes in both manual and machine-based manufacturing. The system integrates intelligent camera systems and screens into existing assembly, packaging, and testing stations, guiding workers with precise instructions and verifying work results directly within the production process. For automated lines, it seamlessly integrates advanced camera systems and sensors into machinery and conveyor belts, enabling automated product quality inspection and comprehensive data collection for production analysis. ANTICIPATE addresses common challenges of manual inspection, such as high error rates, low inspection speed, and lack of documentation, while overcoming limitations of classic image processing systems like complex interfaces, high pseudo-reject rates, and poor scalability. The solution provides consistent, traceable inspection results, creating a reliable data foundation for root-cause analyses and process improvement. It is GDPR-compliant and can be deployed locally to ensure data security.
Canary
Canary functions as an AI QA engineer, designed to integrate seamlessly into development workflows. It automatically analyzes code diffs in pull requests, understands the intent of changes, and generates comprehensive tests. These tests are then executed in real browsers, with live executions and results dropped directly into the PR comments. Canary provides detailed reports of passed and failed tests, including video recordings for every failure, allowing developers to quickly identify and address issues. It supports on-demand testing directly from PR comments and is built to help developers, QA engineers, and product managers ensure bug-free products, eliminating the need for brittle scripts or manual QA.
runx
runx is an open-source deep learning experiment management tool designed to automate common tasks in AI research. It facilitates hyperparameter sweeps, logging (including TensorBoard integration), and robust checkpoint management. The tool also provides experiment summarization capabilities with `sumx` and ensures code checkpointing for reproducibility. It automatically creates unique, per-run directories to prevent data overwrites and allows for easy submission of batch jobs to a farm. While the project is no longer maintained and contains security vulnerabilities, it offers a foundational approach to managing complex deep learning experiments.
uptrain
UpTrain is an open-source unified platform designed to evaluate and improve Generative AI applications. It offers over 20 preconfigured evaluations covering language, code, and embedding use cases, helping developers assess aspects like response completeness, factual accuracy, and context conciseness. The platform includes a web-based dashboard that runs locally, ensuring data privacy by keeping evaluations on your system. UpTrain also performs root cause analysis on failure cases, providing insights to resolve issues. It supports various LLM providers and embedding models, allowing for extensive customization of evaluations and the creation of custom evaluators. Developers can integrate UpTrain evaluations programmatically using its Python package.
WindowsAgentArena
WindowsAgentArena (WAA) is a scalable Windows AI agent platform designed for testing and benchmarking multi-modal, desktop AI agents. It provides researchers and developers with a reproducible and realistic Windows OS environment, enabling the testing of agentic AI workflows across a diverse range of tasks. WAA supports the deployment of agents at scale using Azure ML cloud infrastructure, allowing for parallel execution of multiple agents and delivering quick benchmark results for hundreds of tasks in minutes. The platform includes features like a new difficulty mode for tasks, the Navi agent with Omniparser, and the open-sourced Omniparser screen understanding model. Users can deploy locally using Docker and WSL 2, or leverage Azure for parallel benchmarking.
Keywords AI (YC W24)
Respan, formerly Keywords AI, is an LLM engineering platform designed to streamline the development and deployment of reliable AI applications. It offers a comprehensive suite of features including LLM observability, automated evaluations (evals), prompt optimization, and a unified LLM gateway. The platform allows developers to trace, log, and evaluate agent behavior, identify failures, and understand the impact of prompt or model changes. Respan supports over 500 models and integrates with popular frameworks like OpenAI, Anthropic, LangChain, and LlamaIndex, enabling teams to monitor, debug, and improve their AI systems efficiently. It is built to add observability without becoming a performance bottleneck, making it suitable for production use.
PreFab Photonics
PreFab Photonics offers an AI-powered virtual nanofabrication platform that simulates photonic chip fabrication with foundry-accurate process models. It predicts lithographic effects and process variation before tape-out, helping to eliminate design-manufacturing iteration loops. The platform allows users to integrate PreFab into existing Python workflows for manufacturing predictions in seconds, or design visually using Rosette, a browser-based photonic layout editor with built-in virtual nanofabrication. Beyond prediction, PreFab enables fabrication-aware optimization through differentiable models, allowing for inverse design that accounts for manufacturing constraints and optimizes for post-fab outcomes. It also helps in pre-compensating designs to match target specifications and provides insights into potential yield and robustness by highlighting uncertainty in predictions.
ZeroThreat
ZeroThreat is an AI-powered pentest tool designed to secure web applications and APIs through automated scanning and continuous penetration testing. It ensures compliance and provides actionable remediation insights, operating at 'dev speed' to support AI-generated code without slowing down development teams. The platform offers fast, automated security testing with 98.9% accuracy, scanning 5x faster than traditional DAST tools. It can re-scan single issues instantly and includes built-in API scanning. ZeroThreat scans for over 130,000 vulnerabilities, including OWASP Top 10, known CVEs, and business logic issues, and supports authenticated scans for areas behind login. It also assists with compliance needs like HIPAA, PCI, ISO 27001, and GDPR, providing audit-ready reports.
Etiq AI
Etiq AI is a reskilling copilot designed for enterprises to upskill and redeploy talent for AI without disruptive restructures. It provides applied learning in a user's environment, focusing on practical AI literacy for business-unit and IT professionals. The platform measures capability through verified assessments and skills matrices, allowing organizations to redeploy people into higher-value roles while retaining institutional knowledge. Etiq AI helps teams collaborate with AI, specify problems, interpret outputs, and operationalize AI safely, addressing the challenge of AI adoption in business units. It offers guardrails for safe learning, manager dashboards, and audit-ready trails, making reskilling practical and measurable.
prompttools
prompttools, created by Hegel AI, is an open-source, self-hostable toolkit designed for experimenting with, testing, and evaluating large language models (LLMs), vector databases, and prompts. It enables developers to test prompts and parameters across various models, including OpenAI, Anthropic, and LLaMA, and to assess the retrieval accuracy of vector databases. The tool offers evaluation through code, notebooks, and a local playground interface. It supports a wide range of integrations for LLMs like OpenAI, LLaMA.Cpp, HuggingFace, Anthropic, Mistral AI, Google Gemini, and Google PaLM, as well as vector databases such as Chroma, Weaviate, Qdrant, LanceDB, Milvus, Pinecone, and Epsilla. Users can persist results by exporting experiments to CSV, JSON, or MongoDB.
rubberduck-vscode
rubberduck-vscode is an open-source Visual Studio Code extension designed to enhance the developer experience with AI-powered code assistance. Leveraging the OpenAI API, it offers functionalities such as intelligent code editing, detailed explanations of code snippets, and efficient code generation directly within the VS Code environment. The tool also aids in error diagnosis and bug finding through an interactive AI chat interface, streamlining the debugging process. Its integration into a popular IDE makes it a convenient solution for developers looking to accelerate their coding workflows and improve code quality with AI.
Bugster
Bugster is at the forefront of building the future of QA, offering a suite of AI-powered tools designed to streamline the testing process. It provides end-to-end testing agents for developers, ensuring comprehensive coverage and efficiency. Beyond traditional testing, Bugster includes 'sow' for database branches tailored for AI agents, 'flick' for session replay bug detection, and 'monitor' for visual regression monitoring on production sites. These features collectively aim to help development and QA teams ship code with confidence by catching bugs early and adapting to UI changes without extensive maintenance.
CoTester by TestGrid
CoTester by TestGrid is an enterprise-grade AI software testing agent designed to create, run, and maintain self-healing test cases for complex enterprise applications. It acts as an always-available teammate, learning product context and adapting to QA workflows to write test code. Key features include instant test creation from JIRA stories, AI-powered auto-healing (AgentRx) for UI changes, seamless execution across real browsers, and AI testing with guardrails that pause for team validation. CoTester utilizes a multi-modal Vision Language Model (VLM) to interpret app screens like a human tester, automatically identifying and logging bugs. It supports scheduled test execution and adaptive learning, becoming smarter with each use. The tool offers no-code, low-code, and pro-code modes, allowing product managers, business analysts, manual testers, and automation engineers to generate and manage test cases effectively. It also provides private cloud/on-prem support, enterprise integrations, and full code ownership.
auto-playwright
auto-playwright is an open-source library designed to automate Playwright test steps using ChatGPT. It enables developers to write tests using plain-text instructions, eliminating the need for precise selectors and reducing the strong coupling with application markup often found in conventional testing. This approach facilitates rapid test creation and allows for a Test-Driven Development (TDD) workflow, where tests can be written concurrently with or even before functionality development. The tool supports various Playwright actions like clicking, filling, querying data, and asserting states, and can be configured with OpenAI or Azure OpenAI. While free to use, it incurs costs associated with OpenAI API usage, estimated at around $0.01 per test step with GPT-4 Turbo.
Automotive Artificial Intelligence (AAI) GmbH
AAI Innovations GmbH, formerly Automotive Artificial Intelligence (AAI) GmbH, offers TÜV-certified tools designed for ADAS and automated driving. Their product suite includes RepliMap for creating and editing ASAM OpenDRIVE-compliant road networks and 3D scenes for simulation, CORA (Compliance & Regulatory Assistant) which translates complex automotive regulations into actionable intelligence, and SGAF (Safety Guidance and Analytics Framework) for building and analyzing safety cases. The company is currently transitioning to AAI Innovations GmbH, expanding its focus beyond automotive to broader data- and AI-driven solutions, while maintaining its commitment to trust in autonomous technology.