We are, as of right now, operating under the evidence-backed assumption that humans cannot identify the difference between GPT and non-GPT posts with sufficient operational accuracy to apply these methods to the platform as a whole.
What "evidence" do you have for that? Have you actually tested this with one of the Stack Overflow Mods who have reviewed thousands of GPT posts?
What if you:
- Take two of the Stack Overflow moderators who have handled thousands of AI flags.
- Take 100 random, pre-2020 questions from Stack Overflow that have existing answers
- Paste the markdown of the question into ChatGPT
- Take the unedited ChatGPT results and ask the Mods to identify the ChatGPT answer (or whether they believe it is inconclusive).
- Then tell us what the error rateshow many GPT answers were. identified correctly and how many were "wrongly suspended" ;-)
Have you done anything close to this to support your "evidence-backed assumption"?
That's certainly the "easy" version of the test, since most GPT answers do (IMHO) have edits at this point. But if you don't believe it's possible for a human to identify a GPT answer at all, let's start with this easy test.
If you want to go a little more "difficult":
- Take two of the Stack Overflow moderators who have handled thousands of AI flags.
- Take 100 random, pre-2020 questions from Stack Overflow that have existing answers
- Take a random 50 of those
- Paste the markdown of the question into ChatGPT
- Add the unedited ChatGPT results to the pool of answers for those 50 questions
- Place all 100 original questions before the Mods and ask them to identify whether there's a GPT answer (and which one), no GPT answer, or inconclusive.
- Then tell us how many GPT answers were identified correctly and how many were "wrongly suspended" ;-)
I'll put my money where my mouth is and happily take the challenge as well. That way you can even determine the rate at which a non-Moderator is potentially flagging false-positives.
I'd be interested in the results if it was run twice, once with the subjects using automated detectors to verify their suspicions, and once with no automated-detectors allowed.