Our AI/ML engineering team ensures Code and Theory delivers innovative, immersive web experiences that delight our clients and their customers. We are always striving to balance the demanding nature of working on cutting-edge technologies with the real-world demands of high performance, high security, and accessibility.
Requirements
- Write Python and SQL scripts to evaluate outputs from large language models (LLMs)
- Design and implement LLM-as-Judge evaluations with clear scoring rubrics
- Define and calculate quality metrics such as exact match, token-level F1, ROUGE, and subjective rubric scores
- Build and maintain ground-truth datasets for benchmarking and regression testing
- Automate evaluation pipelines and integrate them into CI/CD workflows
- Conduct in-depth analysis of large unstructured datasets to identify inconsistencies, anomalies, missing values, and potential biases
- Diagnose and report failure modes (hallucinations, irrelevant answers, formatting errors)
- Collaborate and serve as a crucial link between AI engineers, QA, data scientists and product managers to set quality standards and release criteria
- Document processes and maintain reproducibility of evaluation runs
- Create comprehensive technical documentation, including design specifications, architecture diagrams, and code comments