Coding & Development
Browsing page 16 of AI tools for Testing & QA in Coding & Development. Sorted by confidence score — our independent quality rating.
AI Quality & Testing Hub GmbH
AI Quality & Testing Hub GmbH (AIQ) offers comprehensive services for the independent evaluation, testing, and development of Artificial Intelligence (AI) systems. Founded by the State of Hesse and VDE, AIQ aims to ensure that quality is not just a promise but a verifiable reality for AI applications. The company provides AI Training to impart practical knowledge for secure and high-performing AI systems, AI Audit for neutral analyses, tests, and examinations, and AI Solution for planning and realizing AI systems. AIQ places a strong emphasis on regulatory compliance, particularly with the European AI Act, ensuring that AI innovations are trustworthy and can reliably enter the market. Their 'Quality by Design' approach integrates quality considerations from the initial stages of AI development, reducing errors, boosting efficiency, and enhancing customer experience.
evalplus
EvalPlus is a comprehensive and rigorous evaluation framework designed for Large Language Models (LLMs) that generate code. It significantly expands upon existing benchmarks, offering HumanEval+ with 80x more tests and MBPP+ with 35x more tests than their original versions, ensuring a more precise assessment of code correctness. Additionally, EvalPerf evaluates the efficiency of LLM-generated code through performance-exercising tasks and test inputs. The framework supports various LLM backends, including HuggingFace, vLLM, OpenAI-compatible servers, Anthropic, Google Gemini, Amazon Bedrock, and Ollama, allowing for flexible integration. EvalPlus enables developers and researchers to benchmark LLMs, identify fragile code generations, and understand performance beyond mere correctness, making it a critical tool for advancing code AI.
guidellm
Guidellm is an open-source platform designed for evaluating and enhancing Large Language Model (LLM) deployments, focusing on real-world inference needs. It simulates end-to-end interactions with OpenAI-compatible and vLLM-native servers, generating workload patterns that reflect production usage. The platform produces detailed reports to help teams understand system behavior, resource needs, and operational limits. Guidellm supports both real and synthetic multimodal datasets, including text, image, audio, and video inputs, and offers flexible execution profiles. It provides SLO-aware benchmarking, capturing complete latency and token-level statistics for metrics like TTFT, ITL, and end-to-end behavior, ensuring consistent assessment of model performance, tuning deployments, and capacity planning.
Long-Context
Long-Context is an open-source repository from Abacus.AI designed to provide code and tooling for Large Language Model (LLM) context expansion. It offers a comprehensive suite of evaluation scripts and benchmark tasks specifically tailored to assess a model’s information retrieval capabilities within expanded contexts. The repository details various experimental results, including different positional encoding schemes like linear scaling and fine-tuning approaches, and provides instructions for reproducing and building upon these findings. It also shares weights for best-performing models, such as the scale 16 model, which is expected to perform well up to 16k context lengths. The project includes novel evaluation datasets like an extended LMSys dataset and WikiQA (Free Form QA and Altered Numeric QA) to rigorously test models across varying context lengths and answer locations, addressing potential issues like models answering from pre-trained knowledge rather than provided context.
AgenQA
AgenQA is an AI agent designed to automate the testing of web applications. It allows users to provide natural language instructions, which the AI then converts into fully automated tests for the entire web application, eliminating the need for manual coding. The tool features a simple visual interface, making it accessible for developers, QAs, product managers, and designers. AgenQA aims to find bugs that might be missed during manual testing and provides detailed usability reports. It also offers cloud synchronization for collaboration and automated runs, along with a CLI for integration into deployment pipelines.
SWE-agent
SWE-agent is an advanced agentic framework designed to enable language models (LMs) like GPT-4o or Claude Sonnet 4 to autonomously identify and fix issues within real GitHub repositories. Beyond software engineering tasks, it can be employed for offensive cybersecurity challenges, such as capture the flag, and competitive coding. The tool is highly configurable, governed by a single YAML file, and offers maximal agency to the LM, making it free-flowing and generalizable. Developed by researchers from Princeton University and Stanford University, SWE-agent has achieved state-of-the-art results on the SWE-bench benchmark. Users can try SWE-agent in their browser or explore its capabilities for offensive cybersecurity through its EnIGMA mode.
terminal-bench
terminal-bench is an open-source benchmark designed to evaluate the performance of AI agents, specifically Large Language Models (LLMs), in realistic terminal environments. It provides a comprehensive suite of tasks that challenge agents with complex, end-to-end scenarios, ranging from compiling code to training models and setting up servers. The tool consists of a dataset of tasks, each with an English instruction, a test script for verification, and a reference solution, along with an execution harness that connects the language model to a sandboxed terminal environment. This setup ensures reproducible and practical evaluation of system-level reasoning. It is currently in beta with approximately 100 tasks, with plans for significant expansion, and welcomes community contributions for new and challenging tasks.
TheAgentCompany
TheAgentCompany is an open-source benchmark designed to evaluate the performance of LLM agents on consequential, real-world tasks within a simulated software company environment. It allows for assessing how well AI agents can accelerate or autonomously perform work-related tasks by interacting with the web, writing code, running programs, and communicating. The platform offers diverse task roles, data types, and a comprehensive scoring system with multiple evaluation methods, including deterministic and LLM-based evaluators. It features simple one-command operations for environment setup and quick system resets, making it an extensible framework for adding new tasks and evaluators. The benchmark is available on GitHub and supports integration with platforms like OpenHands.
Thai Sentence Embedding Benchmark
Thai Sentence Embedding Benchmark is a specialized AI tool designed to evaluate and rank Thai sentence embedding models. It features a comprehensive leaderboard that showcases the performance of different models across a variety of datasets and tasks relevant to the Thai language. Users can access detailed scores for each model, enabling them to compare and select the most suitable embeddings for their specific natural language processing (NLP) applications. This tool is particularly valuable for AI researchers and NLP engineers who require robust benchmarks for developing and optimizing Thai language models.
VIBE Image Edit DEMO
VIBE Image Edit DEMO serves as a demonstration tool for the VIBE-Image-Edit model, hosted on Hugging Face Spaces. This application empowers users to interact with AI-driven image editing by either uploading an existing picture and describing desired modifications or by generating entirely new images from a text prompt. It provides a hands-on experience with the capabilities of the VIBE-Image-Edit model, allowing for creative exploration and practical application of AI in visual content creation. The tool is designed for ease of use, enabling individuals to experiment with advanced image manipulation techniques without requiring deep technical expertise.
Haize Labs
Haize Labs is an AI Agents & Automation tool designed to help ambitious enterprises accelerate their AI initiatives, moving them efficiently from proof-of-concept (POC) to full production deployment. The platform emphasizes the creation and deployment of highly reliable AI systems, aiming for 99.9% uptime and performance. By providing solutions that facilitate this transition, Haize Labs addresses the common challenge of operationalizing AI, ensuring that agentic systems are robust and perform as expected in real-world scenarios. This focus on reliability and production readiness makes it a crucial partner for businesses looking to scale their AI investments effectively.
AICGSecEval
AICGSecEval (A.S.E) is a comprehensive, repository-level AI-generated code security evaluation benchmark developed by Tencent Wukong Code Security Team. It's designed to assess the security performance of AI-assisted programming by simulating real-world development workflows. The framework includes code generation tasks derived from real-world GitHub projects and CVE patches, ensuring practical relevance and security sensitivity. It automatically extracts project-level code context to simulate realistic AI programming scenarios and integrates a hybrid evaluation suite combining static and dynamic analysis for balanced detection coverage and verification precision. A.S.E aims to be an open, reproducible, and continuously evolving community project, welcoming contributions to expand its dataset and improve the evaluation framework.
ChatGDB
ChatGDB is a powerful tool designed to enhance the debugging experience within GDB or LLDB debuggers by integrating the capabilities of ChatGPT. It allows developers to interact with the debugger using natural language, explaining what they want to achieve, and then automatically executing the relevant commands. Users can also ask ChatGPT to explain previously run commands or pose general questions. This integration helps accelerate the debugging workflow by reducing the need to recall specific GDB/LLDB commands, allowing developers to focus on resolving bugs more efficiently. The tool supports both gpt-3.5-turbo and gpt-4 models and offers options for custom API URLs.
Test Labs
Test Labs is an AI-powered platform designed to simplify mobile app testing on real devices, specifically addressing Google Play's 20-device testing policy. It automates the entire testing process, eliminating the need for manual effort and allowing developers to focus on app development. The platform ensures compatibility, performance, and compliance across various real devices, including mid-range and high-end models. Users receive comprehensive daily reports, device logs, and testing screenshots, providing clear visibility into the testing process and results. Test Labs aims to accelerate Play Store approval, offering a cost-effective and secure solution for individual developers, startups, freelance developers, and larger tech companies.
IDCardRecognition
IDCardRecognition is an AI-powered tool hosted on Hugging Face that simplifies the process of extracting information from various identification documents. Users can upload front and back images of ID cards, passports, or driver licenses, and the application will automatically read and extract key details. This includes essential data such as the individual's name, the document number, and the expiry date, presented in a clear and organized format. This tool is ideal for automating data entry and verification processes, offering a streamlined solution for tasks requiring quick and accurate identification data extraction.
Open Portuguese LLM Leaderboard
The Open Portuguese LLM Leaderboard provides a comprehensive platform for tracking, ranking, and evaluating open Large Language Models (LLMs) specifically designed for the Portuguese language. Users can easily explore and filter models based on various criteria such as type, size, precision, and language. This tool is invaluable for researchers, developers, and AI enthusiasts who need to compare the performance of different LLMs in Portuguese. By offering detailed benchmarks, it helps identify top-performing models for specific Portuguese language tasks, facilitating informed decision-making in model selection and development. The platform aims to foster innovation and collaboration within the Portuguese AI community by providing transparent and accessible performance metrics.
Proov
Proov is an AI-powered platform designed to revolutionize sports betting by offering smarter in-play engagement. It leverages next-gen AI technology to engage and retain players in real-time through personalized and conversational interactions. The platform features an AI assistant that delivers tailored suggestions and relevant offers, triggered by live game events and personalized to each player's behavior and betting history. Players can ask questions, explore markets, discuss sports, follow smart push notifications, and place bets within the same flow. Proov.ai provides real-time statistical reasons for in-play betting decisions, combines performance data, customer segments, and betting history for personalized opportunities, and is configurable to offer prompts as desired. It integrates seamlessly with existing odds feeds and can generate dynamic live banners, empower affiliates, and create new revenue streams from social media channels.
Qwen3-Coder
Qwen3-Coder is a code-focused large language model developed by the Qwen team, designed to assist with a wide array of coding and agentic tasks. It is available in multiple sizes, including Qwen3-Coder-Next, and offers exceptional performance comparable to leading models like Claude Sonnet. Key features include efficiency-performance tradeoffs, scaling agentic coding across various platforms, and robust long-context capabilities with native support for 256K tokens, extendable up to 1M tokens using Yarn. The model supports 358 coding languages and retains strong mathematical and general capabilities from its base model. It also supports fill-in-the-middle (FIM) for code insertion tasks and provides instruct models for chatting.
QualGent
QualGent is an autonomous AI test automation platform specifically designed for iOS and Android mobile applications. It leverages computer vision to eliminate flaky tests and ensure deterministic regression execution, recognizing UI elements precisely as a human would. The platform supports seamless integration into any CI/CD environment, allowing for automated regression suites on every PR or merge. QualGent generates comprehensive test plans from existing documentation like PRDs or Figma files, and its AI agent executes tests 24/7, covering more scenarios than manual testing. It supports multi-lingual testing, systems integration testing, and true end-to-end testing, including OTP, payments, and multi-device flows. QualGent also offers massive parallelization, allowing users to run entire test suites across thousands of AI agents instantly on emulators and real devices.
TesQuirel Solutions
TesQuirel Solutions offers AI-driven and no-code test automation platforms designed to enhance software quality engineering. The platform accelerates testing processes for various applications including web, mobile, APIs, and enterprise systems. By leveraging artificial intelligence, TesQuirel aims to streamline the creation and execution of test scenarios, test cases, and test data. This approach helps reduce the time required for functional QA, ensuring comprehensive test coverage and traceability. The solution is particularly beneficial for organizations looking to implement advanced automation strategies without extensive coding, thereby improving efficiency and accuracy in their software development lifecycle.
App Quality Copilot
Maestro is a modern end-to-end UI testing platform designed for mobile and web applications, making the testing process dead simple. It enables users to write their first test in under 5 minutes, offering both a free CLI and Maestro Studio Desktop for visual test creation, running, and debugging. The platform supports iOS, Android, and web applications, with a single framework compatible with various technologies like React Native, Flutter, Jetpack Compose, and SwiftUI. Maestro Studio provides visual testing by clicking on app UIs, autocompletion for dynamic code, and an element inspector. It also features MaestroGPT, an AI assistant trained on Maestro to generate commands and answer related questions. For scaling, Maestro offers enterprise-grade cloud infrastructure to run tests in parallel and integrate into CI/CD pipelines, helping teams catch issues early in the development lifecycle.
AIClient-2-API
AIClient-2-API is a powerful API proxy service designed to unify and simulate requests for various client-only large language models, including Gemini CLI, Antigravity, Codex, Grok, and Kiro. It encapsulates these into a local OpenAI-compatible interface, allowing any application to connect easily. Built on Node.js, it intelligently converts between OpenAI, Claude, and Gemini protocols, enabling tools like Cherry-Studio and NextChat to utilize advanced models such as Claude Opus 4.5 and Gemini 3.0 Pro at scale. The project features a modular architecture with account pool management, intelligent polling, automatic failover, and health checks, ensuring high service availability. It also offers a Web UI management console for real-time configuration and monitoring.
Modl
Modl is an AI-driven platform designed to automate game testing and quality assurance, helping developers find bugs, glitches, and performance issues more rapidly. It leverages AI agents and analysts to provide comprehensive test coverage, allowing QA teams to operate independently without needing SDKs, code hooks, or engineering support. Users can instruct the AI in plain language for daily test cycles or complex flows, and the system handles routine test cases as well as open-ended exploratory tasks. Modl automatically generates detailed bug reports with descriptions, visuals, and severity scores for detected issues like visual glitches, missing assets, and gameplay logic bugs. The platform supports testing on Android and desktop, with iOS support in development, and is particularly effective for mobile games and titles with structured interactions.
Satyaki Solutions
Satyaki Solutions pioneers transformative AI and ML technologies, offering bespoke solutions across various industries. Their expertise includes avant-garde Computer Vision applications that redefine industry standards, rigorous testing software ensuring impeccable quality, and streamlined fintech operations with unparalleled precision and security. They also provide AI Agent Development using advanced tools like AutoGen Studio and Crew AI, SaaS development for scalable and secure platforms, and comprehensive digital branding services. Additionally, Satyaki offers full-stack development for web and mobile applications and Testing as a Service (TaaS) for comprehensive software quality assurance. They focus on creating use-case specific solutions tailored to market-leading customers.