Evaluation Scenario Writer (m/f/d)

Daily rate 288 - 640€

Remote 100%

Languages

English (Advanced)

Industries

Information Technology

Areas

Quality Assurance

Description

We’re looking for someone who can design realistic and structured evaluation scenarios for LLM-based agents. You’ll create test cases that simulate human-performed tasks and define gold-standard behavior to compare agent actions against. You’ll work to ensure each scenario is clearly defined, well-scored, and easy to execute and reuse. You’ll need a sharp analytical mindset, attention to detail, and an interest in how AI agents make decisions.

Although every project is unique, you might typically:

Designing structured test scenarios based on real-world tasks.
Defining the golden path and acceptable agent behavior.
Annotating task steps, expected outputs, and edge cases.
Working with devs to test your scenarios and improve clarity.
Reviewing agent outputs and adapting tests accordingly

Requirements

Bachelor's and/or Master’s Degree in Computer Science, Software Engineering, Data Science / Data Analytics, Artificial Intelligence / Machine Learning, Computational Linguistics / Natural Language Processing (NLP), Information Systems or other related fields.
Background in QA, software testing, data analysis, or NLP annotation.
Good understanding of test design principles (e.g., reproducibility, coverage, edge cases).
Strong written communication skills in English.
Comfortable with structured formats like JSON/YAML for scenario description.
Can define expected agent behaviors (gold paths) and scoring logic.
Basic experience with Python and JS.
Curious and open to working with AI-generated content, agent logs, and prompt-based behavior.
You are ready to learn new methods, able to switch between tasks and topics quickly and sometimes work with challenging, complex guidelines.
Our freelance role is fully remote so, you just need a laptop, internet connection, time available and enthusiasm to take on a challenge.

Nice to Have

Experience in writing manual or automated test cases.
Familiarity with LLM capabilities and typical failure modes.
Understanding of scoring metrics (precision, recall, coverage, reward functions).

Not applying this time?

Get notified about similar projects matching your experience.

Frequently asked questions

The project is fully remote, providing complete location flexibility.

The project is 100% remote. You can work from any location.

The project offers a daily rate of 288 - 640€ which breaks down to an hourly rate of 36 - 80€/h.

The project requires the following languages: English (Advanced).

The project is related to the following industry: Information Technology.

The project covers the following business area: Quality Assurance.

Yes! Recommend a freelancer for the project and earn 30% of FRATCH's profits every time they get placed — for the duration of that project. Simply share your invite link with a colleague to get started.

To apply for the project, click the Apply button on the project page to submit your profile for review. We will forward your resume to the client and get back to you within a few days.

Join other experts who are already part of our network