Skip to main content
added 609 characters in body
Source Link

We are, as of right now, operating under the evidence-backed assumption that humans cannot identify the difference between GPT and non-GPT posts with sufficient operational accuracy to apply these methods to the platform as a whole.

What "evidence" do you have for that? Have you actually tested this with one of the Stack Overflow Mods who have reviewed thousands of GPT posts?

What if you:

  • Take two of the Stack Overflow moderators who have handled thousands of AI flags.
  • Take 100 random, pre-2020 questions from Stack Overflow that have existing answers
  • Paste the markdown of the question into ChatGPT
  • Take the unedited ChatGPT results and ask the Mods to identify the ChatGPT answer (or whether they believe it is inconclusive).
  • Then tell us what the error rateshow many GPT answers were. identified correctly and how many were "wrongly suspended" ;-)

Have you done anything close to this to support your "evidence-backed assumption"?

That's certainly the "easy" version of the test, since most GPT answers do (IMHO) have edits at this point. But if you don't believe it's possible for a human to identify a GPT answer at all, let's start with this easy test.

If you want to go a little more "difficult":

  • Take two of the Stack Overflow moderators who have handled thousands of AI flags.
  • Take 100 random, pre-2020 questions from Stack Overflow that have existing answers
  • Take a random 50 of those
  • Paste the markdown of the question into ChatGPT
  • Add the unedited ChatGPT results to the pool of answers for those 50 questions
  • Place all 100 original questions before the Mods and ask them to identify whether there's a GPT answer (and which one), no GPT answer, or inconclusive.
  • Then tell us how many GPT answers were identified correctly and how many were "wrongly suspended" ;-)

I'll put my money where my mouth is and happily take the challenge as well. That way you can even determine the rate at which a non-Moderator is potentially flagging false-positives.

I'd be interested in the results if it was run twice, once with the subjects using automated detectors to verify their suspicions, and once with no automated-detectors allowed.

We are, as of right now, operating under the evidence-backed assumption that humans cannot identify the difference between GPT and non-GPT posts with sufficient operational accuracy to apply these methods to the platform as a whole.

What "evidence" do you have for that? Have you actually tested this with one of the Stack Overflow Mods who have reviewed thousands of GPT posts?

What if you:

  • Take two of the Stack Overflow moderators who have handled thousands of AI flags.
  • Take 100 random, pre-2020 questions from Stack Overflow that have existing answers
  • Paste the markdown of the question into ChatGPT
  • Take the unedited ChatGPT results and ask the Mods to identify the ChatGPT answer.
  • Then tell us what the error rates were.

Have you done anything close to this to support your "evidence-backed assumption"?

That's certainly the "easy" version of the test, since most GPT answers do (IMHO) have edits at this point. But if you don't believe it's possible for a human to identify a GPT answer at all, let's start with this easy test.

We are, as of right now, operating under the evidence-backed assumption that humans cannot identify the difference between GPT and non-GPT posts with sufficient operational accuracy to apply these methods to the platform as a whole.

What "evidence" do you have for that? Have you actually tested this with one of the Stack Overflow Mods who have reviewed thousands of GPT posts?

What if you:

  • Take two of the Stack Overflow moderators who have handled thousands of AI flags.
  • Take 100 random, pre-2020 questions from Stack Overflow that have existing answers
  • Paste the markdown of the question into ChatGPT
  • Take the unedited ChatGPT results and ask the Mods to identify the ChatGPT answer (or whether they believe it is inconclusive).
  • Then tell us how many GPT answers were identified correctly and how many were "wrongly suspended" ;-)

Have you done anything close to this to support your "evidence-backed assumption"?

That's certainly the "easy" version of the test, since most GPT answers do (IMHO) have edits at this point. But if you don't believe it's possible for a human to identify a GPT answer at all, let's start with this easy test.

If you want to go a little more "difficult":

  • Take two of the Stack Overflow moderators who have handled thousands of AI flags.
  • Take 100 random, pre-2020 questions from Stack Overflow that have existing answers
  • Take a random 50 of those
  • Paste the markdown of the question into ChatGPT
  • Add the unedited ChatGPT results to the pool of answers for those 50 questions
  • Place all 100 original questions before the Mods and ask them to identify whether there's a GPT answer (and which one), no GPT answer, or inconclusive.
  • Then tell us how many GPT answers were identified correctly and how many were "wrongly suspended" ;-)

I'll put my money where my mouth is and happily take the challenge as well. That way you can even determine the rate at which a non-Moderator is potentially flagging false-positives.

I'd be interested in the results if it was run twice, once with the subjects using automated detectors to verify their suspicions, and once with no automated-detectors allowed.

Source Link

We are, as of right now, operating under the evidence-backed assumption that humans cannot identify the difference between GPT and non-GPT posts with sufficient operational accuracy to apply these methods to the platform as a whole.

What "evidence" do you have for that? Have you actually tested this with one of the Stack Overflow Mods who have reviewed thousands of GPT posts?

What if you:

  • Take two of the Stack Overflow moderators who have handled thousands of AI flags.
  • Take 100 random, pre-2020 questions from Stack Overflow that have existing answers
  • Paste the markdown of the question into ChatGPT
  • Take the unedited ChatGPT results and ask the Mods to identify the ChatGPT answer.
  • Then tell us what the error rates were.

Have you done anything close to this to support your "evidence-backed assumption"?

That's certainly the "easy" version of the test, since most GPT answers do (IMHO) have edits at this point. But if you don't believe it's possible for a human to identify a GPT answer at all, let's start with this easy test.