💻

Coding & Development

Browsing page 13 of AI tools for Testing & QA in Coding & Development. Sorted by confidence score — our independent quality rating.

All Backend & APIs Code Assistants Coding Agents Database & SQL DevOps & Infrastructure Documentation Frontend & UI Game Development Mobile Development No-Code / Low-Code Open Source & Models Prompt Engineering Testing & QA Vibe Coding Web Scraping & Automation

BoxyVerified

61%

Boxy, developed by CodeSandbox, is an AI coding assistant designed to accelerate development by offering contextual explanations, code generation, and refactoring capabilities. Operating within the CodeSandbox cloud infrastructure, Boxy has access to the entire codebase, allowing it to understand project context uniquely. Key features include intuitive code refactoring directly from the app preview, contextual code generation tailored to specific needs, and automatic, meaningful commit message suggestions. It also aims to make learning more accessible by providing explanations and insights into code. Boxy was available to CodeSandbox Pro subscribers and was deprecated in July 2024, with AI features now available through Codeium.

Playrun

61%

Playrun is an AI-powered testing tool designed to automatically generate end-to-end tests for web applications. It allows users to run these tests periodically and receive alerts when issues are detected, ensuring that bugs are caught proactively. The platform emphasizes a 'no code required' approach, making it accessible for users to set up testing simply by providing their application's URL. This streamlines the QA process, helping development teams maintain application stability and deliver a better user experience by identifying and addressing problems before they impact end-users.

Bennu AI

61%

Bennu AI offers an autonomous AI agent designed to manage operations, deploy code, fix bugs, and maintain system uptime, allowing teams to focus on development. It provides zero-downtime monitoring, detecting crashes, restarting services, and archiving logs before users are impacted. The platform automates CI/CD processes, handling everything from Docker to production with minimal configuration. Bennu AI also integrates robust security features, scanning for misconfigurations, secrets, and access rights, blocking unsafe deployments in real-time. Users can deploy applications with a single prompt, describing their app in plain English for the AI to build, provision, and ship. It connects with existing stacks like GitHub, Docker, Kubernetes, and Terraform, orchestrating infrastructure, code, and operations with precision.

BugRaptors

61%

BugRaptors is a software testing company that provides AI-powered quality engineering services and top-notch QA solutions. The company leverages advanced AI, automation, and expert knowledge to deliver fast, reliable, and high-quality software testing. BugRaptors offers a range of services including manual testing, automation testing, performance testing, security testing, web testing, mobile testing, and AI testing solutions. They utilize proprietary AI-enhanced tools like RaptorAssist for test case generation, RaptorGen for test data, RaptorScan for broken links, RaptorVision for visual bugs, and RaptorSecurity for web application protection. BugRaptors serves diverse industries such as healthcare, retail, banking, energy, telecommunication, and media.

OpenSandbox

61%

OpenSandbox is a robust, open-source sandbox platform designed for AI applications, offering a secure, fast, and extensible runtime environment for AI agents. It provides multi-language SDKs in Python, Java/Kotlin, JavaScript/TypeScript, C#/.NET, and Go, along with unified sandbox APIs. The platform supports both Docker and high-performance Kubernetes runtimes, enabling local execution and large-scale distributed scheduling. OpenSandbox is ideal for scenarios such as Coding Agents, GUI Agents, Agent Evaluation, AI Code Execution, and RL Training. It features strong isolation with secure container runtimes like gVisor and Firecracker microVM, and includes built-in Command, Filesystem, and Code Interpreter implementations.

opik

61%

Opik, built by Comet, is an open-source platform designed to streamline the entire lifecycle of LLM applications, from prototype to production. It empowers developers to evaluate, test, monitor, and optimize their models and agentic systems with comprehensive tracing of LLM calls, conversation logging, and agent activity. Key features include advanced evaluation capabilities like LLM-as-a-judge for tasks such as hallucination detection and RAG assessment, experiment management, and integration into CI/CD pipelines. Opik also offers production-ready scalable monitoring dashboards, online evaluation rules, and dedicated SDKs for prompt and agent optimization, along with guardrails for safe AI practices. It supports a wide array of frameworks and offers client SDKs for Python, TypeScript, and Ruby.

prometheus-eval

61%

Prometheus-Eval is a comprehensive open-source repository designed for evaluating Large Language Models (LLMs) in various generation tasks. It leverages powerful models like Prometheus and GPT-4 to provide robust assessments. The tool supports multilingual meta-evaluation benchmarks, with recent iterations like M-Prometheus outperforming previous open LLM judges on multilingual meta-evaluation benchmarks such as MM-Eval and M-RewardBench. It also offers strong performance in English, surpassing Prometheus 2 7B and 8x7B on RewardBench. Prometheus-Eval facilitates both absolute grading, which assigns a score from 1 to 5, and relative grading, which compares two responses. It supports local inference via vllm and integration with LLM APIs through litellm, allowing users to utilize powerful evaluator LLMs like GPT-4.

CeLLife Technologies Ltd.

61%

CeLLife Technologies Ltd. specializes in AI-powered diagnostics, measurement, and quality control for the battery industry. Its patented AI measurement technology, Electrical Fingerprint (EFP™), enables rapid analysis of battery cells, modules, and systems, performing diagnostics up to 900 times faster than traditional methods. This technology significantly reduces waste and costs while maximizing the potential of every battery throughout its lifecycle, from manufacturing to second life. CeLLife's solutions cater to industries such as manufacturing, Battery Energy Storage Systems (BESS), and recycling, helping businesses ensure 100% production quality, catch defects early, and improve traceability. The tool aims to build confidence in batteries, protect margins, and contribute to a world powered by sustainable energy by preventing premature degradation and failures.

qodo-cover

61%

Qodo-Cover is an AI-powered tool designed to automate test generation and enhance code coverage for software projects. It leverages Generative AI models to streamline development workflows by creating unit tests. The tool can be integrated into GitHub CI workflows or run locally as a CLI tool, supporting various programming languages like Python, Go, and Java. Key components include a Test Runner, Coverage Parser, Prompt Builder, and AI Caller, ensuring tests contribute to overall effectiveness and interact with LLMs for generation. It requires an OpenAI API key and a Cobertura XML code coverage report for functionality, with active development for more coverage types.

spec-kit

61%

spec-kit is an open-source toolkit designed to accelerate software development through Spec-Driven Development. This approach transforms specifications into executable artifacts, directly generating working implementations rather than merely guiding them. The toolkit includes the Specify CLI for installation and project initialization, supporting integrations with AI coding agents like GitHub Copilot. Users can define project principles, create detailed specifications, develop technical implementation plans, break down tasks, and execute implementations using intuitive commands. It emphasizes focusing on 'what' and 'why' rather than specific tech stacks, promoting a more structured and predictable development process. The platform also supports community extensions and presets, allowing for customization and integration with various workflows and external platforms like Azure DevOps, Jira, and Confluence.

Tshabok AI

61%

Tshabok AI is an agentic AI tool designed to automate and streamline software quality assurance processes. It specializes in generating comprehensive test cases from various sources, including existing documentation or directly from URLs. The platform's AI-powered intelligence understands context and identifies critical edge cases, ensuring extensive test coverage. Tshabok AI aims to drastically reduce the time spent on manual test case creation, offering up to 80% time savings, and boasts 95% accuracy in requirement coverage. It supports continuous updates, allowing users to easily regenerate test cases as project requirements evolve. The tool also offers integration with popular test management tools like Jira and Azure DevOps, and can export test cases in Gherkin/BDD syntax.

sql-eval

61%

sql-eval is an open-source evaluation framework designed to assess the accuracy of SQL queries generated by Large Language Models (LLMs). It operates by taking a question/query pair, generating a SQL query (potentially from an LLM), and then running both the 'gold' query and the generated query on their respective databases. The tool then compares the resulting dataframes using both 'exact' and 'subset' matching criteria. Beyond accuracy, sql-eval logs other critical metrics such as tokens used and latency, providing a comprehensive view of LLM performance. It supports various database types including Postgres, Snowflake, BigQuery, MySQL, SQLite, and SQL Server, and offers runners for popular LLM APIs like OpenAI and Anthropic, as well as local Hugging Face models and vLLM.

Shelfmark

61%

Shelfmark offers a comprehensive inspection solution for web-based manufacturing, leveraging AI-enabled detection algorithms for real-time defect identification. The system includes production-line optimal hardware, configured by the Shelfmark team, and provides real-time production floor alerts, including audible and visual alarms. Users can access the platform from any device to review images, defect data, and production metrics, ensuring continuous visibility into quality performance. Shelfmark's solutions are tailored for specific industries like Textiles & Webbings, Labels & Films, Trusses, Direct-to-Film, and Coil Coating, preventing costly chargebacks and waste. The 'Managed AI' approach means Shelfmark handles hardware deployment, algorithm building, and system maintenance, eliminating the need for an internal AI team.

test-tube

61%

Test-tube is a Python library designed to streamline the logging and parallelization of hyperparameter searches for Deep Learning and Machine Learning experiments. It offers framework-agnostic compatibility, supporting popular libraries like TensorFlow, Keras, PyTorch, and Scikit-learn. Key features include the ability to log experiment hyperparameters and data, visualize results with TensorBoard, and optimize hyperparameters across multiple GPUs or CPUs. It also supports parallel hyperparameter optimization on HPC clusters using SLURM, making it suitable for large-scale research and development. The library is built on the Python argparse API, ensuring ease of use for developers.

llm-sandbox

61%

LLM Sandbox is an open-source Python library designed to securely execute code generated by Large Language Models (LLMs) within an isolated environment. It offers a lightweight and portable sandbox runtime, ensuring safety through features like isolated execution, custom security policies, resource limits (CPU, memory, time), and network isolation. The tool supports various container backends, including Docker, Kubernetes, and Podman, and provides comprehensive language support for Python, JavaScript/Node.js, Java, C++, Go, and R. It seamlessly integrates with popular LLM frameworks like LangChain and LlamaIndex, and includes advanced features such as artifact extraction, on-the-fly library management, file operations, and container pooling for performance optimization.

Parea AI (YC S23)

61%

Parea AI is an experimentation and human annotation platform designed for AI teams to confidently ship LLM applications to production. It provides comprehensive features for testing and evaluating AI systems, including experiment tracking, observability, and human feedback collection. Users can debug failures, track performance over time, and answer critical questions about model regressions or improvements. The platform also offers a prompt playground for tinkering with prompts, testing on large datasets, and deploying effective ones. With robust logging for production and staging data, Parea AI enables online evaluations, user feedback capture, and tracking of cost, latency, and quality, making it a complete solution for LLM development lifecycle.

Product Science

61%

Product Science provides an end-to-end orchestration platform for decentralized foundation model training. It offers a hardware-agnostic approach, allowing enterprises, research labs, and public institutions to train specialized AI models across fragmented and geo-distributed resources, from general-purpose GPUs to specialized ASICs. The platform emphasizes configurable data sovereignty and aims to overcome the barriers of centralized data centers and the scarcity of NVIDIA chips. By unifying fragmented hardware globally, Product Science unlocks elastic capacity and creates a permissionless, resilient environment for frontier-scale AI training, evolving beyond traditional GPU clusters into high-efficiency, trustless, and permissionless training environments. They previously incubated Gonka, a decentralized network for AI training and inference.

sweep

61%

Sweep is an AI coding assistant specifically designed for the JetBrains integrated development environment (IDE). It functions as a plugin, offering developers AI-powered support to streamline their coding workflows. The tool aims to enhance productivity and facilitate code creation within the JetBrains ecosystem. As an open-source project, Sweep encourages community contributions and provides a flexible platform for developers looking to integrate AI assistance directly into their daily coding routines. Its primary focus is on providing intelligent coding suggestions and automation to help developers write better code more efficiently.

VibeSec

61%

VibeSec is an advanced AI-powered security scanning tool designed to secure code within GitHub repositories. It leverages a combination of AI security intelligence and Semgrep to identify real security issues, secrets, insecure patterns, and known vulnerabilities. The platform provides instant, actionable reports for every scan, detailing what is wrong, why it matters, and how to fix it. VibeSec supports both public and private GitHub repositories securely using token authentication, requiring no agents or SDKs. Built for developers, it integrates security early into the development lifecycle, allowing users to scan repos, gain insights, and ship confidently without needing a dedicated security team. It also offers lightning-fast scans and an upcoming API for CI integration.

Hadrix

61%

Hadrix is an open-source, AI-powered security scanner designed to audit codebases for vulnerabilities. It operates locally on your machine, ensuring no data is stored by the tool, which enhances privacy and security. Hadrix combines static analysis with AI scanning to identify a wide range of issues, including injection, access control, authentication, secrets, logic issues, dependency risks, and misconfigurations. It supports JavaScript/TypeScript codebases and integrates with OpenAI and Anthropic models. The tool provides a detailed summary of findings, categorized by severity, and offers prioritized remediation suggestions, making it easier for developers to address critical security flaws.

AI-Codereview-Gitlab

61%

AI-Codereview-Gitlab is an automated code review tool designed to enhance code quality and development efficiency for teams using GitLab. It leverages large language models such as DeepSeek, ZhipuAI, OpenAI, Anthropic, Tongyi Qianwen, and Ollama to perform intelligent code reviews during merge requests or push operations. The tool offers instant notification delivery of review results via DingTalk, Enterprise WeChat, or Feishu. Additionally, it generates automated daily reports based on GitLab, GitHub, and Gitea commit records, providing insights into daily development progress. A visual dashboard centralizes all code review records, offering project and developer statistics. Users can also choose from various review styles, including professional, sarcastic, gentle, and humorous.

AlignX AI

61%

AlignX AI is an enterprise platform designed to ensure the reliability and predictable performance of AI agents in production environments. It applies software engineering principles, including robust testing, comprehensive observability, and strong governance frameworks, to AI systems. This approach helps businesses ship AI agents that meet compliance standards and ethical practices, mitigating risks and preventing unexpected behaviors. AlignX AI aims to streamline operations and facilitate informed decision-making by ensuring AI alignment with organizational goals, ultimately building trust and confidence in enterprise AI deployments.

Wafer

61%

Wafer is an advanced AI tool designed to deliver the fastest GPU inference in the world by autonomously profiling, diagnosing, and optimizing inference across the entire stack, from kernels to models and production pipelines. It helps developers and AI agents achieve superior performance for open-source models through a flat-rate API access. For enterprises, Wafer offers tailored inference optimization for custom models, hardware, workloads, and production constraints, promising setup in less than 24 hours. The platform boasts significant speed improvements, such as being 2.8x faster than base SGLang for specific models, ensuring efficient and high-throughput AI operations.

pyllms

61%

PyLLMs is a Python library designed to simplify connections to a wide range of Large Language Models (LLMs) from providers like OpenAI, Anthropic, Google, Groq, Reka, Together, AI21, Cohere, Aleph Alpha, and HuggingfaceHub. It offers features such as multi-model support for simultaneous completions, asynchronous and streaming capabilities, and chat history management. A key differentiator is its built-in model performance benchmark, allowing users to evaluate LLMs based on quality, speed, and cost. The library also supports advanced configurations, including using OpenAI API on Azure, Google Vertex AI, and local Ollama models, even allowing for a mix of local and cloud models within the same session. Note: The project is deprecated, with pydantic-ai recommended as an alternative.

EXPLORE OTHER CATEGORIES

🎨 Content & Design 📊 Productivity & Business 🤖 AI Agents & Automation 📚 Research & Education 🧘 Wellness & Lifestyle 💼 Career Development 📈 Marketing & Growth 📉 Data & Analytics 💬 Customer Support & CX 💰 Finance 🛒 E-commerce