QA Engineer - Testing LLM agents

BrickRed Systems

Kirkland, WA

Role Summary

BrickRed Systems is seeking an experienced QA Engineer specializing in testing LLM agents and AI-driven workflows. This role focuses on evaluating agentic behavior, safety, reliability, grounding, automation quality, and deterministic vs. non-deterministic outcomes across advanced AI pipelines. You will collaborate closely with engineering, product, and AI research.

Key Responsibilities

Design and execute comprehensive test strategies for LLM agents, agentic workflows, multi-step planners, and tool-using AI systems
Implement Eval-Loops for continuous, automated evaluation of model performance, drift, consistency, and safety
Build and maintain Golden Datasets to benchmark model accuracy, grounding, and regression behavior
Use hill‑climbing evaluation techniques to iteratively improve prompts, policies, and model outputs
Evaluate and test safety shield models (e.g., ShieldGemma) for content filtering, policy enforcement, and guardrail robustness
Perform adversarial testing against hallucinations, ungrounded responses, safety violations, and reasoning failures
Develop automation harnesses using Python, REST APIs, LangChain, PromptFlow, and LLM evaluation frameworks
Assess agent behaviors across variations in prompts, contexts, tools, and reasoning paths
Analyze responses for factuality, coherence, instruction-following, policy adherence, and chain-of-thought integrity (when applicable)
Document findings, build structured bug taxonomies, and partner with engineering teams to resolve issues
Drive improvements in reliability, latency, determinism, and consistent execution of multi-step agent behaviors

Required Technical Skills

Strong QA experience (manual + automation) with AI/ML, LLMs, or agentic systems
Hands-on experience with Python, automation frameworks, evaluation scripts, REST/JSON APIs
Familiarity with LLM platforms (Azure OpenAI, OpenAI, Anthropic, Google Gemini, etc.)
Experience with evaluation frameworks such as:
PromptFlow evaluations
DeepEval / Ragas / Trulens
LangChain LCEL evaluations
Custom scoring functions for grounding, correctness, toxicity, etc.
Experience using or testing safety-shield models (e.g., ShieldGemma or similar)
Understanding of techniques such as:
Hill climbing optimization
Agent loop testing
Determinism scoring
Self-reflection / self-correction evaluation
Guardrail stress testing
Scenario-based reasoning tests
Strong analytical and problem‑solving skills for non-deterministic system behavior
Excellent documentation, communication, and cross-team collaboration skills

About Brickred Systems:

Brickred Systems is a global leader in next-generation technology, consulting, and business process service companies. We enable clients to navigate their digital transformation. Brickred Systems delivers a range of consulting services to our clients across multiple industries around the world. Our practices employ highly skilled and experienced individuals with a client-centric passion for innovation and delivery excellence.

With ISO 27001 and ISO 9001 certification and over a decade of experience in managing the systems and workings of global enterprises, we harness the power of cognitive computing hyper-automation, robotics, cloud, analytics, and emerging technologies to help our clients adapt to the digital world and make them successful. Our always-on learning agenda drives their continuous improvement through building and transferring digital skills, expertise, and ideas from our innovation ecosystem.