Senior Developer — AI Evaluation & Cloud Infrastructure

Just Horizons Alliance
Boston, MA

Senior Developer, AI Evaluation & Cloud Infrastructure | Just Horizons Alliance

Join us to build the technical foundation for AI accountability.


The Role

Just Horizons Alliance is an 18-year-old applied research lab focused on ethics and technology. Our current focus is the AI Ethics Index, a measurement framework for evaluating AI systems on ethics, safety, and societal impact.


We need a senior engineer to own the technical infrastructure end-to-end: learn what exists, close critical gaps, and build something that lasts.


The evaluation methodology is validated and in use. We're now at the stage where the systems need to mature alongside the research. This is the first dedicated infrastructure hire for this work, and you'll shape how it scales.


What You’ll Do

Months 1–3: Learn the System

Map the current architecture with Sophia Zitman (AIEI Team Lead). Understand the evaluation methodology, the data flows, and the infrastructure that supports them. Identify what needs to evolve for multi-domain benchmarking—reproducibility, security posture, test coverage, deployment pipeline. Begin implementing the highest-priority improvements.

Months 4–6: Build for Scale

Architect the infrastructure to support the next phase of the Index. CI/CD that maintains stability as the system grows. IAM and secret management built for a production environment. Experiment tracking that makes every evaluation run auditable. Documentation that enables the research team to work independently.

Months 7–12: Expand

Multi-domain benchmarking across education, healthcare, finance, and other sectors. Reproducibility standards that meet external scientific scrutiny. A system the research team can extend without engineering support for every change. At this point, the infrastructure should be stable enough that you're focused on capability, not maintenance.


Why This Role Is Difficult

This is infrastructure for a scientific standard, not a product feature.

Correctness and delivery both matter. A bug in the evaluation engine doesn't break a feature, instead it invalidates a measurement. A flawed pipeline doesn't slow things down, it compromises the credibility of the research. At the same time, methodology that never runs in production has no impact. The role requires both rigor and momentum.

You're translating between disciplines. Your stakeholders are researchers, ethicists, and governance specialists. You'll need to take concepts like "operationalizing an ethical construct" and turn them into data models and pipelines. This is a translation problem as much as an engineering problem.

The work is novel. There's no existing system to reference. The AI Ethics Index is defining what rigorous AI evaluation looks like. You'll be making architectural decisions in areas where best practices don't yet exist.

You'll have full ownership. This is not a role where you're executing someone else's technical vision. You're setting the direction. That means autonomy, but it also means accountability.


You're probably the right person if

✅ You've built evaluation systems or data pipelines that other people depended on for correctness, not just uptime

✅ You're comfortable with GCP and have deployed containerized workloads in a real production context

✅ You've worked with LLM APIs and understand their reliability and reproducibility characteristics

✅ You can read a paper about measurement methodology and turn it into a working data structure

✅ You build for durability. Your code is still running 18 months later because you thought about the next person

✅ You've worked somewhere between 5 and 50 people and you're comfortable being the person who figures things out without a playbook

✅ You find working on AI ethics infrastructure more interesting than building another e-commerce checkout flow


You're probably not the right fit if

❌ Enterprise environments make up most of your experience. This is not a large-team context

❌ You need clearly defined requirements before you can start. The requirements here evolve through conversation with ethicists

❌ You're based on the West Coast US or expect West Coast US working hours

❌ You mainly build user-facing APIs and features — this is systems and infrastructure work

❌ You're looking for a high-growth startup where shipping speed is everything. This is a scientific organization. Correctness is everything.


Hard Skills

These are the technical capabilities you need going in — or need to be able to build up fast using an AI coding agent. We're not looking for someone who ticks every box. We're looking for someone who closes gaps quickly and knows how to learn.

  • Python — strong enough to design systems architecture and reason about failure modes, even if you work with AI assistance for implementation details
  • Google Cloud Platform — specifically Cloud Run, IAM design, secret management, and containerized workload deployment in a real production context
  • API and model documentation — able to read, write, and navigate API specs and model documentation fluently; you know how to figure out how a system behaves from its documentation without needing someone to walk you through it
  • Structured step-by-step reasoning — when you hit a complex problem, you decompose it immediately and visibly into logical steps; you don't disappear into your head and come back with an answer, you think out loud and in sequence, which makes collaboration with the ethics and research team possible
  • LLM API integration — understanding the reliability, reproducibility, and failure characteristics of external model endpoints
  • Data pipeline architecture — building evaluation or measurement systems where correctness is non-negotiable, not just data-moving
  • Experiment tracking and reproducibility standards — always looking to improve the evaluation pipeline; you understand what needs to be tracked, why reproducibility matters scientifically, and you find the right approach for the context without being dogmatic about tooling
  • Statistical reliability concepts — enough to understand what inter-rater reliability means for evaluation output and why reproducibility matters scientifically


What you get

The role: You'll work directly with Sophia Zitman (AIEI Team Lead) as the technical backbone of the AI Ethics Index. Full engineering ownership of the evaluation engine.

The comp: Base salary $110,000.

The team: Small, split between ethicists and engineers. You will interview with Janet Kang (Executive Director) and Sophia Zitman (AIEI Team Lead).

The environment: Boston-based non-profit (501(c)(3)). East Coast US or Western Europe time zones. Collaborative but autonomous — Sophia won't micromanage, but she will hold you to a high standard of systems thinking.

The upside: You'll have built the technical foundation of what may become the globally referenced standard for AI system evaluation. That's a meaningful line on any CV — and a genuinely hard thing to have done.