le uses a ChatGPT-like technology to detect spam and AI content and rank websites. //
Although the headline is misleading, it is only true insofar that it uses the term ChatGPT.
Instead of calling the technology “ChatGPT like”, it lets you, the reader know immediately what type of technology I am referring to. (Also the former wouldn’t be as clickable …).
This article will focus on an older but still relevant Google paper, ” Generative models are unsupervised predictors of page quality: A colossal-scale study.”
What’s the paper all about?
Let’s begin with the description of these authors. The authors introduce the topic as follows:
“Many people have expressed concern about the dangers of neural-text generators in the wild due to their ability to create human-looking text at large scales
Recently, machine-generated content on the internet has been monitored by classifiers that can distinguish between machine-generated and human text. However, little work has been done to apply these classifiers for other purposes, despite their appealing property of not requiring labels – they only require a corpus and a generative model. We show that page quality can be classified using a combination of off-the-shelf machine and human discriminators. Texts that are machine-generated often appear unintelligible or incoherent. We apply the classifiers on half a million English webpages em> to understand why there is low page quality.
They are basically saying that the same AI-based copy detection classifiers can also be used to detect low quality content using the same models.
This leaves us with a crucial question:
Is this cause (i.e. the system picking up it because it’s really good at it) or relationship? (i.e. is there a lot of spam that was created in a way that is simple to use with better tools)
Let’s first look at the work of some authors and their conclusions before we get into that.
The set-up
They used the following as a reference:
-
Two text generation models, OpenAI’s RoBERTa–based GPT-2 detector (a detector which uses the RoBERTa model and GPT-2 output to predict whether it is AI-generated) and the GLTR model. Both have access to the top GPT-2 outputs and operate similarly.
On the paper I have copied, you can see an example showing the output of the model.
- Three datasets HTML500M (a random selection of 500 million English websites), HTML2 Output (250k text generation GPT-2 texts) and Grover–Output (they internally generated over 1.2M articles with the pre-trained Gateway model which was intended to detect fake news).
- The Spam Basis is a classifier that was trained using the Enron Spam Mail Dataset. This classifier was used to determine the Language Quality number that they would assign. For example, if the model determined that a document was not spam with a probability greater than 0.2, then the Langage Quality (LQ), score assigned to the document was 0.2.
Get the daily newsletter search marketers rely on.
” />
” />
input type=”inlineEmail control rounded-0, w-100″ placeholder=”Enter business email here.” required=”” type=”email”/>
Processing…Please wait.
A side note about spam prevalence
I want to briefly mention some of the interesting discoveries made by the authors. The following illustration (Figure 3 of the paper) illustrates one such finding:
It is important to pay attention to the score below each graph. A score of 1.0 indicates that spam content is more likely to be found. We are seeing that low-quality documents have been a problem since 2017 (and spiking in 2019).
They also found that low-quality content had a greater impact on certain sectors than others. (Remember, a higher score indicates a higher likelihood of spam).
I was confused by a few of these. It made perfect sense to me.
Books and literature came as a surprise. So was health, until the authors mentioned Viagra and other adult health product sites as “health”, and essay farms and “literature”.
These findings were published
Apart from the discussion about the sectors and the 2019 spike, the authors found many interesting things that SEOs could learn from. This is especially important as we begin to use tools like chatGPT.
- Low-quality content tends not to be as long (peaking at 3,000 characters).
- High-level content can be classified by detection systems that are trained to detect if text was written by machines.
- They refer to our content for ranking as a specific suspect, but I suspect they are referring to the garbage we all know shouldn’t be there.
These authors don’t claim that this is the end-all-be all solution. They simply suggest a starting point. I’m certain they have moved the bar up in the last few years.
Note about AI-generated content
Over the years, language models have also evolved. GPT-3 was not yet available when this paper was published. However, detectors used by the researchers were based upon GPT-2, which is a much inferior model.
GPT-4 is expected to be available soon and Google’s Sparrow will be released later in the year. Combinations will become easier to find, not only are the technology improving on both sides (search engines vs content generators), but also the battleground is becoming more competitive.
Google can detect Sparrow and GPT-4 content. Maybe.
What if the Sparrow generated it and sent it to GPT-4 with a prompt for rewrite?
Remember that this paper is based on autoregressive models. They predict a score for each word based upon what they expect that word to be like given the preceding words.
The detection of AI may fall as models become more sophisticated and create full ideas instead of just a few words.
However, AI-generated content may be the best option for detecting poor content.
The post Google uses a ChatGPT-like system to detect spam and AI content and rank websites appeared first on Search Engine Land.