We’re looking for someone who can design realistic and structured evaluation scenarios for LLM-based agents. You’ll create test cases that simulate human-performed tasks and define gold-standard behavior to compare agent actions against. You’ll work to ensure each scenario is clearly defined, well-scored, and easy to execute and reuse. You’ll need a sharp analytical mindset, attention to detail, and an interest in how AI agents make decisions.
Although every project is unique, you might typically:
- Designing structured test scenarios based on real-world tasks.
- Defining the golden path and acceptable agent behavior.
- Annotating task steps, expected outputs, and edge cases.
- Working with devs to test your scenarios and improve clarity.
- Reviewing agent outputs and adapting tests accordingly
- Bachelor's and/or Master’s Degreein Computer Science, Software Engineering, Data Science / Data Analytics, Artificial Intelligence / Machine Learning, Computational Linguistics / Natural Language Processing (NLP), Information Systems or other related fields.
- Background in QA, software testing, data analysis, or NLP annotation.
- Good understanding of test design principles (e.g., reproducibility, coverage, edge cases).
- Strong written communication skills in English.
- Comfortable with structured formats like JSON/YAML for scenario description.
- Can define expected agent behaviors (gold paths) and scoring logic.
- Basic experience with Python and JS.
- Curious and open to working with AI-generated content, agent logs, and prompt-based behavior.
- You are ready to learn new methods, able to switch between tasks and topics quickly and sometimes work with challenging, complex guidelines.
- Our freelance role is fully remote so, you just need a laptop, internet connection, time available and enthusiasm to take on a challenge.
Nice to Have
- Experience in writing manual or automated test cases.
- Familiarity with LLM capabilities and typical failure modes.
- Understanding of scoring metrics (precision, recall, coverage, reward functions).
Frequently asked questions
Where is the project located?
What is the remote work policy for the project?
What is the daily rate for the project?
What language skills are required for the project?
Which industries is the project related to?
Which business areas does the project cover?
Not available? Can I still benefit from the project?
How to apply for the project?
Similar Projects
AI Evaluation Consultant (m/w/d)
Freelance Electrical Engineer with Python Experience (m/w/d)
Freelance Automotive Engineer (with Python) - Quality Assurance / AI Trainer
Freelance Mechanical Engineer with Python Experience (m/w/d)
AI Consultant - Machine Learning (m/w/d)
Vibe Coding Web Scraping Expert (m/f/d)
AI Consultants - Data Science (m/w/d)
Area Product Manager (m/f/d)
Senior Project Manager Customer Interaction
Development of TM1 Planning Analytics and Interfaces (m/w/d)
Data Engineer (m/f/d)
Freelance Product Owner for Point Of Sale App
Adobe Experience Cloud Consultant (m/f/d)
ERP Transformation Manager (m/f/d)
Senior Cloud Developer TypeScript (m/f/d)
Expert in Process Automation for Law Firm Environments (m/f/d)
Commissioning & Qualification (C&Q) Engineer (m/f/d)
Java IT Architect (m/f/d)
Freelance E-Engineer (m/f/d)
Social Compliance Auditor (m/f/d)
Project Manager (Project Control Focus) (m/f/d)
Management Consultant (Senior Level) (m/f/d)
Cyber Security Consultant – Product Security & Regulatory Compliance (m/f/d)
Interim Accounting Lead / Head Of (m/f/d)
Financial Accountant (m/f/d)
Construction Manager according to LBO - Civil and MEP (m/f/d)
Auditor – FSC® and PEFC Chain of Custody (m/f/d)
ISO 20121 Auditor (w/m/d)
Interim Staff Product Manager (m/w/d)
Safety and Health Protection Coordinator (SiGeKo) and Safety Specialist (SiFa) (m/f/d)