Project details
Recommended projects
Evaluation Scenario Writer (m/w/d)
AI Evaluation Consultant (m/w/d)
Freelance Electrical Engineer with Python Experience (m/w/d)
Freelance Automotive Engineer (with Python) - Quality Assurance / AI Trainer
Freelance Mechanical Engineer with Python Experience (m/w/d)
Freelance Ruby Developer (m/f/d)
Freelance Chemistry Expert for AI Model Training (m/f/d)
Freelance Physics Expert (with Python) - Quality Assurance / AI Trainer
AI Consultant - Machine Learning (m/w/d)
AI Consultant for Vibe Coding (m/w/d)
Freelance Biology Expert for AI Model Training (m/f/d)
Freelance Java Developer (m/w/d)
Physicist with Python Experience (m/w/d)
Mathematician with Python Experience (m/w/d)
Freelance Cybersecurity Consultant for AI Red Teaming
AI Consultants - Data Science (m/w/d)
Freelance Editor (m/f/d)
Developer for Consent Management Implementation (m/f/d)
Chemist with Python Experience (m/w/d)
Project Manager for Magazines / Magazine Production (m/f/d)
Data Engineer (m/f/d)
Senior Project Manager Customer Interaction
Development of TM1 Planning Analytics and Interfaces (m/f/d)
Senior Web Developer (m/f/d)
Fullstack Engineer (m/f/d)
Biologist with Python Experience (m/w/d)
Freelance Product Owner for Point of Sale App
Adobe Experience Cloud Consultant (m/f/d)
IT Project Manager ServiceNow (Senior)
Dentist for Training AI Models (m/f/d)
Senior Factor 10 Developer (IPS / IPM) (m/f/d)
Frontend developer to HR platform with Angular experience
Evaluation Scenario Writer (m/w/d)
Project info
- Daily rate290 - 640€
- Language
- English(Advanced)
- English
- Remote100%
Description
We’re looking for someone who can design realistic and structured evaluation scenarios for LLM-based agents. You’ll create test cases that simulate human-performed tasks and define gold-standard behavior to compare agent actions against. You’ll work to ensure each scenario is clearly defined, well-scored, and easy to execute and reuse. You’ll need a sharp analytical mindset, attention to detail, and an interest in how AI agents make decisions.
Although every project is unique, you might typically:
- Designing structured test scenarios based on real-world tasks.
- Defining the golden path and acceptable agent behavior.
- Annotating task steps, expected outputs, and edge cases.
- Working with devs to test your scenarios and improve clarity.
- Reviewing agent outputs and adapting tests accordingly
Requirements
- Bachelor's and/or Master’s Degreein Computer Science, Software Engineering, Data Science / Data Analytics, Artificial Intelligence / Machine Learning, Computational Linguistics / Natural Language Processing (NLP), Information Systems or other related fields.
- Background in QA, software testing, data analysis, or NLP annotation.
- Good understanding of test design principles (e.g., reproducibility, coverage, edge cases).
- Strong written communication skills in English.
- Comfortable with structured formats like JSON/YAML for scenario description.
- Can define expected agent behaviors (gold paths) and scoring logic.
- Basic experience with Python and JS.
- Curious and open to working with AI-generated content, agent logs, and prompt-based behavior.
- You are ready to learn new methods, able to switch between tasks and topics quickly and sometimes work with challenging, complex guidelines.
- Our freelance role is fully remote so, you just need a laptop, internet connection, time available and enthusiasm to take on a challenge.
Nice to Have
- Experience in writing manual or automated test cases.
- Familiarity with LLM capabilities and typical failure modes.
- Understanding of scoring metrics (precision, recall, coverage, reward functions).