QA Engineer - Testing LLM agents

BrickRed Systems
Kirkland, WA

Role Summary

BrickRed Systems is seeking an experienced QA Engineer specializing in testing LLM agents and AI-driven workflows. This role focuses on evaluating agentic behavior, safety, reliability, grounding, automation quality, and deterministic vs. non-deterministic outcomes across advanced AI pipelines. You will collaborate closely with engineering, product, and AI research.


Key Responsibilities

  • Design and execute comprehensive test strategies for LLM agents, agentic workflows, multi-step planners, and tool-using AI systems
  • Implement Eval-Loops for continuous, automated evaluation of model performance, drift, consistency, and safety
  • Build and maintain Golden Datasets to benchmark model accuracy, grounding, and regression behavior
  • Use hill‑climbing evaluation techniques to iteratively improve prompts, policies, and model outputs
  • Evaluate and test safety shield models (e.g., ShieldGemma) for content filtering, policy enforcement, and guardrail robustness
  • Perform adversarial testing against hallucinations, ungrounded responses, safety violations, and reasoning failures
  • Develop automation harnesses using Python, REST APIs, LangChain, PromptFlow, and LLM evaluation frameworks
  • Assess agent behaviors across variations in prompts, contexts, tools, and reasoning paths
  • Analyze responses for factuality, coherence, instruction-following, policy adherence, and chain-of-thought integrity (when applicable)
  • Document findings, build structured bug taxonomies, and partner with engineering teams to resolve issues
  • Drive improvements in reliability, latency, determinism, and consistent execution of multi-step agent behaviors


Required Technical Skills

  • Strong QA experience (manual + automation) with AI/ML, LLMs, or agentic systems
  • Hands-on experience with Python, automation frameworks, evaluation scripts, REST/JSON APIs
  • Familiarity with LLM platforms (Azure OpenAI, OpenAI, Anthropic, Google Gemini, etc.)
  • Experience with evaluation frameworks such as:
  • PromptFlow evaluations
  • DeepEval / Ragas / Trulens
  • LangChain LCEL evaluations
  • Custom scoring functions for grounding, correctness, toxicity, etc.
  • Experience using or testing safety-shield models (e.g., ShieldGemma or similar)
  • Understanding of techniques such as:
  • Hill climbing optimization
  • Agent loop testing
  • Determinism scoring
  • Self-reflection / self-correction evaluation
  • Guardrail stress testing
  • Scenario-based reasoning tests
  • Strong analytical and problem‑solving skills for non-deterministic system behavior
  • Excellent documentation, communication, and cross-team collaboration skills


About Brickred Systems:

Brickred Systems is a global leader in next-generation technology, consulting, and business process service companies. We enable clients to navigate their digital transformation. Brickred Systems delivers a range of consulting services to our clients across multiple industries around the world. Our practices employ highly skilled and experienced individuals with a client-centric passion for innovation and delivery excellence.

With ISO 27001 and ISO 9001 certification and over a decade of experience in managing the systems and workings of global enterprises, we harness the power of cognitive computing hyper-automation, robotics, cloud, analytics, and emerging technologies to help our clients adapt to the digital world and make them successful. Our always-on learning agenda drives their continuous improvement through building and transferring digital skills, expertise, and ideas from our innovation ecosystem.