LLMs are a recipe for SEO disaster
“ChatGPT is able to pass the bar.”
“GPT receives an A+ in all exams.”
“GPT passes MIT entrance test with flying colors.”
How many people have read an article that said something similar to the above?
I’m sure I have seen many of these. Every day there seems to be a new thread that claims GPT is close to Skynet, artificial general intelligence, or better than humans.
Recently, I was asked “Why does ChatGPT not respect my word counts input?” You’re sure it’s a machine? It’s a reasoning engine, right? It should count the words in a sentence.”
This is a common misunderstanding about large languages models.
The form of ChatGPT is deceiving in some ways.
The presentation and interface are those of a conversational robotic partner. Part AI companion, part calculator, part search engine – this is the chatbot that ends all chatbots.
This is not the case. In this article I will go over some case studies. Some are experimental, and some are in the real world.
We will discuss how these tools were presented, the problems that arise, and if there is anything we can do to improve their weaknesses.
Case 1: GPT vs. MIT
A team of undergraduate researchers recently wrote about GPT acing MIT EECS Curriculum, which went somewhat viral on Twitter. They received 500 retweets.
The paper is a mess. I will review the major issues here. Here, I’d like to focus on two of the most important ones – plagiarism and hype based marketing.
GPT was able to answer certain questions because it had already seen them. This is discussed in the response article in ” Leakage of Information in Few Shots.”
The study team used information to reveal the answers for ChatGPT as part of the prompt engineering.
The 100% claim has a problem because some of the questions on the test are unanswerable. This is either because the bot was unable to access the information needed to answer the question, or the question relied upon another question that the bot could not access.
Another issue is prompting. This paper’s automation had a specific part:
"Review your previous answers and identify any problems. Improve your answer based on the problems that you identified. Please provide feedback about the incorrect answer. Please answer the question again after receiving feedback. "]]
prompt_response = prompt(expert) # calls fresh ChatCompletion.create prompt_grade = grade(course_name, question, solution, prompt_response) # GPT-4 auto-grading comparing answer to solution
This paper commits itself to a problematic grading system. The GPT’s response to these prompts does not necessarily lead to factual and objective grades.
Recreate a tweet by Ryan Jones:
The prompting for some of these questions would almost always lead to a correct solution.
GPT may also not be able accurately compare its answer to the correct one, because it is generative. It says that even when the answer is corrected, there were no errors.
Natural language processing is primarily either abstractive or extractive. The generative AI tries to be both the best and worst of both worlds, but is neither.
Gary Illyes had to recently take to social media in order to enforce this:
This is a great opportunity to discuss hallucinations, prompt engineering and other topics.
Hallucination is the term used to describe situations in which machine learning models (specifically generative AI) produce unexpected and incorrect results.
Over time, I’ve become frustrated by the term used to describe this phenomenon:
- This implies “thought” and “intention”, which these algorithms lack.
- GPT does not know the difference between hallucinations and the truth. It is a very optimistic idea to think that they will decrease in frequency, because this would mean that an LLM has a better understanding of the truth.
GPT is hallucinating because it follows patterns in the text and applies them repeatedly to other patterns of text; when these applications are incorrect, there is no different.
I’m now going to talk about prompt engineering.
The new trend is to use GPT or tools similar to it for prompt engineering. “I’ve created a prompt to get me what I want. Learn more by purchasing this ebook!
The prompt engineer is a relatively new and well-paying job. How can I improve GPT?
It is possible to over-engineer prompts.
GPT becomes less accurate as it is forced to deal with more variables. The more complex and longer your prompt is, the less effective the safeguards are.
When I ask GPT to audit my site, I receive the standard “AI language model …”” response. The more complex my request, the less likely I am to receive accurate information.
Xenia Volynchuk is real, but not the website. Yulia Sapegina does not appear to exist and Zeck Ford doesn’t seem to be an SEO website at all.
Your responses will be generic if you underengineer. Your responses will be wrong if you overengineer.
Case 2: Math vs. Math
A question similar to this goes viral every few months on social media.
How do you multiply 23 by 48?
Addition is a common method. Some people multiply 3 by 8 to get 11 and then add that to 20+40. Some people add 2 and 8 for 10. They then add 60 plus one. Different people’s brains calculate things differently.
Let’s review fourth grade math. Do you remember the multiplication tables? What did you do with them?
There are worksheets that show how multiplication works. For many students, however, the goal is to memorize functions.
When I hear 6×7 I don’t do the maths in my head. My father used to drill my multiplication tables over and over. Not because I know, but because 42 is what I have memorized.
This is because it’s closer to the way LLMs approach math. LLMs examine patterns in vast amounts of text. It does not know what “2” means, only that it tends to appear in certain contexts.
OpenAI is particularly interested in solving the problem of logical reasoning. GPT-4 is their latest model that, according to them, has better logic reasoning. OpenAI engineers are not me, but I would like to discuss some of the things they did to make GPT-4 a better reasoning model.
OpenAI aims to address the shortcomings of LLM models in the same way Google strives for algorithmic perfection and tries to eliminate human factors like links from ranking.
OpenAI can be used in two different ways to improve ChatGPT’s “reasoning capabilities”:
- GPT can be used by itself, or with external tools (e.g. other machine-learning algorithms).
- Other solutions that do not use the LLM code
OpenAI refines models in the first group by stacking them on top of one another. This is the real difference between ChatGPT vs regular GPT.
Plain GPT is a simple engine that outputs the next likely tokens following a sentence. ChatGPT, on the other hand is a model that’s trained to follow commands and steps.
The way these layers interact and how models of this scale can recognize patterns in different contexts is what makes GPT a “fancy autocorrect”.
The model can make connections between answers, expectations and contexts of different questions.
GPT is able to expand on these connections even if no one has asked for “explain statistics with a dolphin metaphor.” It understands how to explain a topic using a metaphor. It also knows what dolphins are.
Anyone who has dealt with GPT regularly will tell you that the further away from GPT’s materials, the more disastrous the results become.
OpenAI’s model is trained using various layers.
- Conversations.
- Avoiding any controversial responses
- Keep it within the guidelines
Context and commands can be arranged in an infinite number of ways. Humans can be creative and find endless ways to violate the rules.
This means that OpenAI is able to train an LLM by exposing them to layers of reasoning to help them mimic and recognize patterns.
Not understanding the answers but memorizing them.
OpenAI’s models can also be enhanced by using other elements. These elements have their own issues. OpenAI is using plugins to solve GPT issues with non-GPT solution.
Link reader is a plugin for ChatGPT. It allows users to add links into ChatGPT, and the agent will visit the link to get the content. How does GPT accomplish this?
The plug-in does not “think” about whether to use these links or not. It assumes that each link is needed.
The HTML code is generated when the text is analysed. These plugins are difficult to integrate in a more elegant way.
The Bing plugin, for example, allows you to search Bing. However, the agent assumes that you will search more frequently than you would otherwise.
GPT can be difficult to train consistently, even with multiple layers of training. This can happen immediately if you use the OpenAI API. While you can mark “as open AI model,” some responses may have different sentence structures or ways of saying no.
It is difficult to create a code that expects consistent input.
What triggers the search function in an OpenAI app?
What if you wanted to discuss search in an article. It can also be challenging to chunk inputs.
ChatGPT has difficulty distinguishing between the fantasy and reality of the prompt.
The easiest way to make GPT reason is to incorporate something better at reasoning. It is easier to say than do.
Ryan Jones posted a great thread on Twitter about this:
The LLMs themselves are then discussed.
No calculator or thought process is required, you can simply guess the next word based on an enormous corpus of texts.
Case 3: GPT and riddles
What is my favorite type of case? Children’s riddles.
Which word does not belong? What word is not in the set?
- Green, yellow, red, blue.
- April, December, November, June.
- Cirrus, calculus, cumulus, stratus.
- Carrots, radishes, potatoes, cabbages.
- Fork, comb, rake, shovel.
Think about it for a moment. A child can help you.
These are the real answers:
- Green. Yellow, red, and blue are primary colours. Green is not.
- December. Other months only have 30 days.
- Calculus. Calculus.
- Cabbage. Other underground vegetables include: Cabbage, cauliflower, broccoli and cabbage.
- Shovel. Other shovels have prongs.
Let’s now look at some of the responses from GPT.
It is interesting that this answer has the correct shape. The correct answer was “not primary color” but it didn’t know what primary colors or colors were.
You could call this one-shot querying. I do not provide any additional information to the model and expect it to work things out on its own. As we’ve seen from previous answers, GPT may get it wrong if you over-prompt.
GPT is not clever. It is impressive but not “general-purpose” enough.
It does not know the context of what it does or says, nor what a particular word means.
GPT is the math of the world.
Tokens represent the web as a large array of points that are interconnected by vectors.
The LLM is not as smart as you think
The lawyer who used chatGPT in court claimed he “thought that it was a searching engine.”
The high-profile case of professional misconduct is entertaining. But I’m terrified of the consequences.
This information was submitted to the court by a lawyer, a subject-matter expert who is highly-skilled and highly-paid.
People are doing this all over the country because it looks human, and it almost feels like a search-engine.
Everything can be at stake. ChatGPT has eaten up what is left of the misinformation that is rampant on the internet.
Metal from sunken vessels is not irradiated, so we have to collect it.
Data from before 2022 is also a valuable commodity because it comes from the text that was intended to be unique, true and human.
This kind of discourse is often based on a misunderstanding of GPT and what it’s used for.
OpenAI is partly responsible for these misunderstandings. They are so eager to be creating artificial general intelligence that it is hard for them to accept weaknesses in GPT.
GPT cannot be master of anything because it is “master of everything”.
If it cannot use slurs then it can’t moderate the content.
It can’t be a fiction writer if it must tell the truth.
It cannot be always accurate if it must obey its user.
GPT is and not your friend, general intelligence or fancy autocorrect.
The sentences are made by mass-applied statistics and dice rolling. The thing about luck is that sometimes you make the wrong call.
The post How LLMs can cause SEO disaster first appeared on Search Engine Land.