Asynchronous Proctoring for ChatGPT Detection

Wes Winham Updated by Wes Winham

Introducing Asynchronous Proctoring: A Human-Centered Approach to Detecting AI Assistance

With technology like ChatGPT so easily accessible, the trustworthiness of technical assessments has become complicated quickly. We're excited to introduce our latest advancement: Asynchronous Proctoring that efficiently identifies external assistance, including the use of AI assistants like ChatGPT.

Algorithmic Detection Doesn’t Work

You may have seen ChatGPT detectors that attempt to discern patterns in code and language. We explored these technologies and found each of them riddled with both false negatives and false positives. Even OpenAI, with $1b in funding, couldn’t figure out algorithmic detection.

Analyzing the final output of an assessment alone isn’t only difficult for so-called AI detection software—it’s also hard for humans. We found that the way to detect use of external assistance is by looking for a combination of a set of cues:

  • Time spent on the assessment
  • Behavior in the IDE
  • Length of submitted content
  • Patterns in the content, observed over thousands of candidates

Introducing Asynchronous Proctoring

To overcome the limitations of algorithmic detection and output inspection, we've introduced a revolutionary human-centered approach. 

Our technique involves skilled human experts reviewing every single assessment for signs of external assistance. They watch a live playback of the candidate’s submission, evaluating their behavior and final responses against guidelines our team has defined and tested. 

Examples

The below candidate pasted a large amount of code into the IDE just 3 and a half minutes into the challenge. This would be cause for us to be suspicious of their behavior, investigate further, and likely conclude that they used external assistance. 

Another candidate pasted part of a conversation with ChatGPT into the IDE at one point during their effort. Although they deleted it before they submitted, it was a dead giveaway that they had used external assistance. We flagged the candidate’s response and let the customer know. If we were only expecting the output, this is evidence we would have missed. 

How you’ll be notified

We’re now checking for external assistance first, and if we find conclusive evidence then we will not continue to score the candidate. We’ll deliver the recommendation with a notice that we found evidence of external assistance, along with the evidence itself. 

We also let the candidate know we found a code of conduct violation. If you or the candidate replies with concerns, we’ll gladly take another look or score the candidate as normal. 

In the ever-evolving AI landscape, we’re leveraging our team of certified engineers to do what algorithmic detection can’t: Async Proctoring to identify use of AI assistants. We’d love to hear your thoughts on how we can make this protocol even stronger, so you can continue to trust Woven’s analysis in helping you make great hires. 

FAQ

Q: Which scenarios get Async Proctoring? 

For now, we’re evaluating all programming scenarios, as well as our Mobile Architecture scenarios and our Product Usage Drop-Off scenarios. Next, we’ll be adding Asynchronous Proctoring for more of our free text scenarios like Architecture Debugging and Debugging Social Media.

Q: How do candidates know they can’t use ChatGPT and AI Assistance?

We repeat the message in several places. It's prominent in the prep guide that candidates can review before the timer. They must also click through an acknowledgement modal before every scenario.

Example text:

AI assistants and outside resources:

  • For this scenario, using ChatGPT or other AI assistants will disqualify you as a candidate. This ensures a fair and authentic assessment of your skills and abilities.
  • You may use outside resources such as Stack Overflow to look up libraries, syntax, and other details as long as you cite your sources in the notes.

Q: Can candidates appeal?

Yes! We give candidates the option to provide new evidence and appeal. That triggers a manual review from our team.

The candidate gets the following message in their feedback email:

"We identified a violation of our Code of Conduct. If you believe this is a mistake, please reply to this email, especially with any evidence. Our Trust and Safety Team will be happy to take another look.”

There's no time limitation on when a candidate can submit an appeal but once an appeal is opened Woven will complete the appeals process within 72 hrs.

Q: What does the appeals process look like?

  1. Customer Notification of Appeal: When a candidate appeals, we’ll update the candidate’s Slack message to make it clear to you that we’re reviewing the candidate’s appeal. This will work similarly to the messages about candidates who have issues with their assessment that we’re handling. For transparency, we’ll also communicate final appeal results, even for appeals that we ultimately overturn (which is ~90% of appeals, historically).
  2. Appeal Decision Service Level Agreement (SLA): We’re instituting a 72-hour SLA for appeal decisions, starting from the time the candidate appeals. We’ve also analyzed the workflow and handoffs involved to catch the gaps that allowed candidates to fall through the cracks. Tactics include visualizing the work, a report that shows appeal aging, and clear ownership of initial appeal.
  3. Communicating “Inconclusive” Code of Conduct status: With now several months of appeals data, we’ve identified that the main driver of overturned appeals is the case summarized as “probably used ChatGPT, but maybe they’re just a very fast outlier using their own editor.” Today, 6% of overall candidates fall in this bucket.

Q: How long should I wait to reject a candidate for cheating if they are flagged for Code of Conduct Violation?

We recommend that you continue your process as normal and without delay. If the candidate's appeal is overturned we will communicate this with both the candidate and you and at that time you can decide if you would like to move them back into your process. So far we've had 10 candidates get scored as a 0 and 2 have appealed. Only 1 of those appeals was legitimate. Across our entire customer base, we see about 5% of 0-scored candidates appeal and about 1 in 10 of those where we made a mistake an overturn our original decision.

Q: The appeal resulted in a "Maybe Cheated", what does this mean?

We have suspicion, but not conclusive proof, that this candidate used ChatGPT or other external assistance that this candidate used ChatGPT to respond to a scenario(s). We proceeded with scoring because we can't be certain. We do strongly recommend asking follow up questions about their response if you choose to advance the candidate. 

Note you will see:

This candidate appealed the Code of Conduct violation for the following scenario(s): scenario name

Upon further review, our Trust and Safety team has overturned the initial Code of Conduct Violation ruling.

Our new ruling is: Possible, but inconclusive, violation of Code of Conduct

It is possible that this candidate violated the code of conduct by using ChatGPT or outside help, but they claim to have been using their own IDE throughout the scenario. We viewed the detail playback again to confirm. We didn’t find the evidence of usage of external assistance sufficiently conclusive. We updated their evaluation, and we want you to be aware of this.Going forward, we are now communicating these “maybe cheated” cases with highly-visible note for candidates in this category so that subsequent interviewers are aware.

Q: The only reason I can see for a large paste like they did would be using an alternative editor because of frustrations with your editor (or just comfort with your own). What is the level of risk here?

This is using their own editor plus maybe ChatGPT/AI/help. They did make reasonable edits within the IDE and the code they pasted in didn't 100% work, so this one is a bit of a judgment call.



How did we do?

What does Woven do to deter cheating?

Contact