Role Summary
BrickRed Systems is seeking an experienced QA Engineer specializing in testing LLM agents and AI-driven workflows. This role focuses on evaluating agentic behavior, safety, reliability, grounding, automation quality, and deterministic vs. non-deterministic outcomes across advanced AI pipelines. You will collaborate closely with engineering, product, and AI research.
Key Responsibilities
- Design and execute comprehensive test strategies for LLM agents, agentic workflows, multi-step planners, and tool-using AI systems
- Implement Eval-Loops for continuous, automated evaluation of model performance, drift, consistency, and safety
- Build and maintain Golden Datasets to benchmark model accuracy, grounding, and regression behavior
- Use hill‑climbing evaluation techniques to iteratively improve prompts, policies, and model outputs
- Evaluate and test safety shield models (e.g., ShieldGemma) for content filtering, policy enforcement, and guardrail robustness
- Perform adversarial testing against hallucinations, ungrounded responses, safety violations, and reasoning failures
- Develop automation harnesses using Python, REST APIs, LangChain, PromptFlow, and LLM evaluation frameworks
- Assess agent behaviors across variations in prompts, contexts, tools, and reasoning paths
- Analyze responses for factuality, coherence, instruction-following, policy adherence, and chain-of-thought integrity (when applicable)
- Document findings, build structured bug taxonomies, and partner with engineering teams to resolve issues
- Drive improvements in reliability, latency, determinism, and consistent execution of multi-step agent behaviors
Required Technical Skills
- Strong QA experience (manual + automation) with AI/ML, LLMs, or agentic systems
- Hands-on experience with Python, automation frameworks, evaluation scripts, REST/JSON APIs
- Familiarity with LLM platforms (Azure OpenAI, OpenAI, Anthropic, Google Gemini, etc.)
- Experience with evaluation frameworks such as:
- PromptFlow evaluations
- DeepEval / Ragas / Trulens
- LangChain LCEL evaluations
- Custom scoring functions for grounding, correctness, toxicity, etc.
- Experience using or testing safety-shield models (e.g., ShieldGemma or similar)
- Understanding of techniques such as:
- Hill climbing optimization
- Agent loop testing
- Determinism scoring
- Self-reflection / self-correction evaluation
- Guardrail stress testing
- Scenario-based reasoning tests
- Strong analytical and problem‑solving skills for non-deterministic system behavior
- Excellent documentation, communication, and cross-team collaboration skills
About Brickred Systems:
Brickred Systems is a global leader in next-generation technology, consulting, and business process service companies. We enable clients to navigate their digital transformation. Brickred Systems delivers a range of consulting services to our clients across multiple industries around the world. Our practices employ highly skilled and experienced individuals with a client-centric passion for innovation and delivery excellence.
With ISO 27001 and ISO 9001 certification and over a decade of experience in managing the systems and workings of global enterprises, we harness the power of cognitive computing hyper-automation, robotics, cloud, analytics, and emerging technologies to help our clients adapt to the digital world and make them successful. Our always-on learning agenda drives their continuous improvement through building and transferring digital skills, expertise, and ideas from our innovation ecosystem.