to remove sensitive data from Google’s Index //
Better keyword rankings. Traffic increase. Conversions from organic searches. These KPIs are used to measure SEO results.
Some consultants and agencies manage SEO campaigns for clients without considering a crucial element:
Google Search Results: Preventing confidential content to appear
Neglecting this can lead to a breach in trust or costly litigation, which could ultimately result in the end of a client-client relationship.
This can be avoided if you understand how easy it is for client data to enter Google’s search index, and how to prevent this.
Discover the important search indexing issue that many SEOs overlook, including the accidental exposure of client information on Google and how to deindex this content.
What I did when I found sensitive information
I am a full-time, independent SEO consultant. Since 2018, I have worked with various midsize companies to improve organic search results.
To check the results of a technical audit on SEO, I use Google’s site search operator by entering site:domain.com. I can see quickly how different URLs, site titles and snippets appear across various page categories.
I notice patterns in what gets indexed. Perhaps adding keywords to the operator will help me to be more specific.
Most clients will have noticed that dev/testing/staging websites are getting indexed. I also notice thin content reducing link equity, or harming search efficacy, (or leading keyword cannibalization), and paid landing pages which aren’t intended to rank.
I’ve noticed that SaaS customers are exhibiting a pattern of behavior that is alarming.
Pages that are indexed under subdomains, which no one in marketing or product ever thought about.
Subdomains for customers that allow them to customize their login experience are the most innocent (e.g. client.example.com
).
A client may still not want their name to appear in the search results. This could be a way to differentiate your product or expose it as vulnerable.
Web-based forms that collect data from specific individuals could be used in more serious cases.
A lack of password protection can lead to the access and modification of form fields.
These are not directly related to organic search but I am quick to mention them. I thought that there was a lot at stake.
This became a “all hands on deck” problem at least in several cases. I was told to remove this data from search results as quickly as possible.
A CEO stated that his security consultants had never brought up this possibility. This was found quickly by a simple step that many SEOs perform in an audit.
It is fair to say that it takes a lot of searching to find this type of page.
Consider the strange searches your clients or even your team would make. Never forget that 15 % of Google’s search queries are unique!
Even if it is not a legal problem, sensitive information in search results that are found first by clients could still damage your relationship.
Why are these data on Google?
It only takes a single, unobtrusive link from any search engine, anywhere on the internet, to a specific page:
- Does the page appear in your XML Sitemap even if you don’t link to it on your website?
- Was there a reference to JavaScript on your website in the past?
- Most often, clients link to a page, but only for certain people, such as survey participants. It’s not meant for the general public.
Awareness is half the battle. You can start the process of removing the pages from Google search immediately after you have identified the pages that need to be removed.
How to deindex Google content quickly
Search for patterns in the URLs that contain sensitive data displayed by Google.
It’s not uncommon to have a web-based SaaS version housed in a subdomain called data.example.com. Use the site search operator in order to scan through results.
You can view all URLs indexed by Google Search Console using the Page Indexing Report.
This may not be all. You may find it helpful to contact your product team, who can provide you with more information.
Double-check the URLs
Consider all URL variations that can canonicalize to what you see when searching.
The alternate versions of the URL may be indexed if the canonical version is removed.
Use the pattern (the 2nd radio button under New Request), a likely subdomain or list all URLs by creating a new request using the GSC Removals tool.
The URL Inspection tool can be used to speed up the removal of a small number of pages. It will also confirm that the current status is correct. You must do this one by one. You can also use Microsoft Bing’s Block URL Tool, which is not as big as Google, but it works just the same.
If you follow these steps, your removal from Google will last only six months.
This will not stop the problem from happening again or on other search engine, so there is a last step you need to complete.
How to permanently remove content from Google
There are two methods that can be used.
1. Use the noindex Meta Robots Tag on those pages.
Your web developer should add this to all page templates.
- You can add a X-Robots-Tag HTTP Header with a noindex/none value for PDFs, images, and non-HTML contents. It is possible to do this for regular HTML pages, but it’s not as quick.
Note Don’t use robots.txt rules that disallow crawling ( exceptions for images), as they only work when there is no problem. A disallow block crawling, but not indexing.
2. Gate the content
By password-protecting files or webpages, you can ensure that only authorized users have access to them. It is another way to prevent your content from being displayed on Google.
Search results that contain sensitive information are not displayed
You can be assured that after taking these steps, pages with sensitive data about clients will not reappear in Google’s search results. Pages are usually removed within one day.
You should always tell your customers what has happened. Nothing on the Web ever completely disappears.
The article How do I remove sensitive client information from Google’s search engine first appeared on Search Engine Land.