Scoring Quality Assurance
Some info on our scoring team and how we manage quality:
What are the backgrounds of the scorers? How are they selected? Trained? Managed?
Woven scorers all have backgrounds in computer science, software development, or software engineering. They are hired via a work simulation where they score a set of dummy assessments. Roughly 1 in 5 candidates pass this bar.
Next, there is a multi-week onboarding process. During onboarding, scorers work through training modules. Then they score live assessments as a redundant 3rd scorer whose scores are not used. For the scenarios they scored with a high-enough reliability rating (currently 6% or lower error rate for initial certification rate, with 5% or less as the minimum for months 2 and beyond), they become certified to be one of the 2 double-blind scorers in production.
During normal day-to-day, individual scorers receive reports on every mistake they’ve made, along with a note from the QA team for improvement.
Every month, Woven management reviews the error rate for each scorer. If error rate goes above our threshold, they’re given coaching. If the error rate doesn’t improve, that scorer is terminated.
Do you have any data on the reliability of the graders, such as how similar their scores are?
Our scorers are kept double-blind (to each other and the candidate background and information) and are only grading specific true/false rubric items within a scenario. Their scores should be the same. However, there are times where they may differ, which requires a reconciliation scorer (3rd scorer) to be the deciding factor of the score.
We do track the number/frequency in which a reconciliation score is necessary per evaluator. When scorer 1 differs from scorer 2 and the 3rd reconciliation score agrees with scorer 2, that is marked as an error for scorer 1. That error rate is reported on a monthly basis to a management layer and then shared with the scorer.
Our initial certification minimum bar is 6%. In subsequent months, 5% is the minimum. The average single-scorer error rate is 3.5%. Because of double-scoring (2 independent scorers who are blind to each other), the double error rate (when both scorers make the same error) is approximately 1% (because errors aren’t perfectly uncorrelated).