Return to Answer

added 609 characters in body

Source Link

edited Jun 7, 2023 at 21:45

We are, as of right now, operating under the evidence-backed assumption that humans cannot identify the difference between GPT and non-GPT posts with sufficient operational accuracy to apply these methods to the platform as a whole.

What "evidence" do you have for that? Have you actually tested this with one of the Stack Overflow Mods who have reviewed thousands of GPT posts?

What if you:

Take two of the Stack Overflow moderators who have handled thousands of AI flags.
Take 100 random, pre-2020 questions from Stack Overflow that have existing answers
Paste the markdown of the question into ChatGPT
Take the unedited ChatGPT results and ask the Mods to identify the ChatGPT answer (or whether they believe it is inconclusive).
Then tell us what the error rateshow many GPT answers were. identified correctly and how many were "wrongly suspended" ;-)

Have you done anything close to this to support your "evidence-backed assumption"?

That's certainly the "easy" version of the test, since most GPT answers do (IMHO) have edits at this point. But if you don't believe it's possible for a human to identify a GPT answer at all, let's start with this easy test.

If you want to go a little more "difficult":

Take two of the Stack Overflow moderators who have handled thousands of AI flags.

Take 100 random, pre-2020 questions from Stack Overflow that have existing answers

Take a random 50 of those

Paste the markdown of the question into ChatGPT

Add the unedited ChatGPT results to the pool of answers for those 50 questions

Place all 100 original questions before the Mods and ask them to identify whether there's a GPT answer (and which one), no GPT answer, or inconclusive.

Then tell us how many GPT answers were identified correctly and how many were "wrongly suspended" ;-)

I'll put my money where my mouth is and happily take the challenge as well. That way you can even determine the rate at which a non-Moderator is potentially flagging false-positives.

I'd be interested in the results if it was run twice, once with the subjects using automated detectors to verify their suspicions, and once with no automated-detectors allowed.

We are, as of right now, operating under the evidence-backed assumption that humans cannot identify the difference between GPT and non-GPT posts with sufficient operational accuracy to apply these methods to the platform as a whole.

What "evidence" do you have for that? Have you actually tested this with one of the Stack Overflow Mods who have reviewed thousands of GPT posts?

What if you:

Take two of the Stack Overflow moderators who have handled thousands of AI flags.
Take 100 random, pre-2020 questions from Stack Overflow that have existing answers
Paste the markdown of the question into ChatGPT
Take the unedited ChatGPT results and ask the Mods to identify the ChatGPT answer.
Then tell us what the error rates were.

Have you done anything close to this to support your "evidence-backed assumption"?

We are, as of right now, operating under the evidence-backed assumption that humans cannot identify the difference between GPT and non-GPT posts with sufficient operational accuracy to apply these methods to the platform as a whole.

What "evidence" do you have for that? Have you actually tested this with one of the Stack Overflow Mods who have reviewed thousands of GPT posts?

What if you:

Take two of the Stack Overflow moderators who have handled thousands of AI flags.
Take 100 random, pre-2020 questions from Stack Overflow that have existing answers
Paste the markdown of the question into ChatGPT
Take the unedited ChatGPT results and ask the Mods to identify the ChatGPT answer (or whether they believe it is inconclusive).
Then tell us how many GPT answers were identified correctly and how many were "wrongly suspended" ;-)

Have you done anything close to this to support your "evidence-backed assumption"?

If you want to go a little more "difficult":

Take two of the Stack Overflow moderators who have handled thousands of AI flags.

Take 100 random, pre-2020 questions from Stack Overflow that have existing answers

Take a random 50 of those

Paste the markdown of the question into ChatGPT

Add the unedited ChatGPT results to the pool of answers for those 50 questions

Place all 100 original questions before the Mods and ask them to identify whether there's a GPT answer (and which one), no GPT answer, or inconclusive.

Then tell us how many GPT answers were identified correctly and how many were "wrongly suspended" ;-)

I'll put my money where my mouth is and happily take the challenge as well. That way you can even determine the rate at which a non-Moderator is potentially flagging false-positives.

I'd be interested in the results if it was run twice, once with the subjects using automated detectors to verify their suspicions, and once with no automated-detectors allowed.

Source Link

answered Jun 7, 2023 at 21:33

NotTheDr01ds

We are, as of right now, operating under the evidence-backed assumption that humans cannot identify the difference between GPT and non-GPT posts with sufficient operational accuracy to apply these methods to the platform as a whole.

What "evidence" do you have for that? Have you actually tested this with one of the Stack Overflow Mods who have reviewed thousands of GPT posts?

What if you:

Take two of the Stack Overflow moderators who have handled thousands of AI flags.
Take 100 random, pre-2020 questions from Stack Overflow that have existing answers
Paste the markdown of the question into ChatGPT
Take the unedited ChatGPT results and ask the Mods to identify the ChatGPT answer.
Then tell us what the error rates were.

Have you done anything close to this to support your "evidence-backed assumption"?