ex scrapes Google, and other SEO lessons from the source code leaked //
Yandex’s codebase was leaked online last Wednesday as “Fragments”. Yandex, much like Google, is a platform that includes many aspects, such as email and maps. It also offers a taxi service. It contained chunks of almost all this code.
According to documentation, Yandex’s source code was merged into Arcadia in 2013. Leaked codebase contains a subset all Arcadia projects. We find many components related to search engines in the “Kernel”, “Library,” Robot,” Search,” and “ExtSearch,” archives.
This is a completely unprecedented move. This is the first time that material so relevant to a search engine has been made public since 2006, when the AOL search query data was released.
Even though we don’t have the data or the files that are referenced here, this is the first glimpse at how modern search engines work at the code level.
Personaly, I cannot believe how amazing it is to finally see the code after finishing my book “The Science of SEO”. I am talking about Information Retrieval and how modern search engines work.
Anyway, I have been reading through the code since Thursday. Any engineer will tell anyone that it isn’t enough time to fully understand how everything works. As I continue to tinker, I expect there to be many more posts.
Before we get started, I want to thank Ben Wills from Ontolo, for sharing the code and pointing me in the right direction. We also went back and forth as we deciphered the information. Download the spreadsheet with all of the data about the ranking factors.
Also, Ryan Jones deserves a shout-out for digging in and sharing key findings over IM.
OK, let’s get busy!
We don’t care about Google’s code.
Some people believe that looking at the codebase is distracting and will not affect their business decisions. This is curious, considering that these people are part of the same SEO community which used the CTR model in 2006 AOL data for modeling across all search engines for many years.
Yandex isn’t Google, however. Both are web search engines at the forefront of technology and remain state-of the-art.
Software engineers from both companies go to the same conferences (SIGIR, ECIR, etc) and share findings and innovations in Information Retrieval, Natural Language Processing/Understanding, and Machine Learning. Yandex has an office in Palo Alto, while Google used to have a Moscow presence.
A quick LinkedIn search reveals a few hundred engineers who have worked at both companies. However, we don’t know how many have actually worked on Search for either company.
Yandex makes use of Google’s open-source technologies, which have been crucial to innovation in Search such as TensorFlow and BERT, MapReduce and, to a lesser extent, Protocol Buffers.
Yandex is not Google. However, it’s not a random research project. This codebase contains a wealth of information about how modern search engines are built.
We can at least disabuse ourselves from some outdated notions that still permeate SEO instruments like text-to code ratios and W3C compliance.
A little context about Yandex’s architecture
Source code can be difficult to understand if you don’t have the context to compile, run and then step through it.
New engineers typically receive documentation and walk-throughs. They also engage in pair programming to be onboarded to existing codebases. There is limited documentation about setting up the build process, which can be found in the docs archives. Yandex’s code references internal Wikis throughout. However, these have not been leaked, and the comments in the code are also very sparse.
Yandex’s public documentation gives some insight into the architecture. A few patents that Yandex has published in the US also help to shed some light. Namely:
- A computer-implemented system and method for searching an inverted Index with a plurality post lists
- Search result ranker
Through various patents, whitepapers and talks by engineers, I have gained a deeper understanding of Google’s ranking system structure as I searched for it to write my book. I have also spent much time gaining a better understanding of Information Retrieval best practice for web search engines. Yandex is a great example of best practices and commonalities.
Yandex describes a dual-distributed crawler systems in its documentation. One is for real-time crawling, the “Orange Crawler”, and one for general crawling.
Google has a history of having an index divided into three buckets: one for housing real time crawl, one to be regularly crawled, and one for seldom crawled. This is a good practice in IR.
Google and Yandex differ in this regard, but the idea of segmented crawling driven based on an understanding of updates frequency holds.
Yandex does not have a separate rendering system for JavaScript. This is stated in the documentation. Although they do have a Webdriver-based system called Gemini for visual regression testing, they only allow text-based crawl.
Also, the documentation discusses a sharded structure database that splits pages into an inverted index as well as a document server.
The indexing process works in the same way as other search engines. It creates a dictionary and caches pages. Data is then placed into an inverted index so that bigrams, trigams, and their position in the document are represented.
Google is not able to do this because they have moved to phrase-based indexing. This means that n-grams can be longer than trigrams long ago.
Yandex uses BERT as part of its pipeline. Therefore, documents and queries are converted into embeddings, and nearest neighbour search methods for ranking.
This is where the fun begins.
Yandex offers a layer called in which popular search results are cached after the query is processed. The search query is then sent to thousands of machines in the Basic Searchlayer. Each machine creates a postinglist of relevant documents and returns it to MatrixNet (Yandex’s neural network application to re-rank) to build the SERP.
According to videos in which Google engineers talk about Search’s infrastructure, the ranking process is very similar to Google Search. They speak of Google’s tech being shared environments, where different applications are on each machine and jobs are spread across these machines depending on the availability computing power.
This is one of the uses cases. It involves the distribution of queries among a variety of machines that will quickly process the relevant index fragments. The first place we must look at the ranking factors is when computing the posting lists.
The codebase contains 17,854 ranking factors
On the Friday following the leak, the inimitable Martin MacDonald eagerly shared a file from the codebase called web_factors_info/factors_gen.in. This file is from the codebase leak’s “Kernel” archive and contains 1,922 ranking factors.
The SEO community was quick to share the information. Many people have translated the descriptions, and created tools such as ChatGPT and Google Sheets to help make sense of the data. These are all great examples of how the community can be so powerful. The 1,922 is just one set of ranking factors within the codebase.
Deeper dives into the codebase reveal that there are many ranking factor files for different subsets Yandex’s query processing or ranking systems.
We can see that there are 17,854 ranking factors total. There are many metrics that relate to these ranking factors.
- Clicks.
- Time for rest.
- Metrika, Yandex’s equivalent to Google Analytics, can be used.
A series of Jupyter notebooks has 2,000 additional factors than the core code. These Jupyter notebooks are likely to be tests in which engineers are looking at additional factors to improve the codebase. You can also review all these features and metadata from the codebase by clicking this link.
Yandex’s documentation clarifies that there are three types of ranking factors: Dynamic, Static and those specific to the search performed and how it was done. According to their own words:
These are listed in the codebase’s rank factors files, with the tags TG_STATIC or TG_DYNAMIC. Multiple tags can be used to identify search-related factors, such as TG_QUERY_ONLY and TG_QUERY.
We have found potential 18k ranking factors, but MatrixNet documentation indicates that scoring is based on tens or thousands of factors. It is customized based upon the search query.
This suggests that the ranking environment is dynamic and similar to Google’s. Google’s patent ” Framework to evaluate scoring functions” states that they have had something similar for years. Multiple functions are run, and the best results are returned.
Finally, even though the documentation refers to tens of thousand of ranking factors in its documentation, it is important to remember that many other files in the archive are not included in the code. There is probably more to the story than we can see. You can see this by looking at the onboarding documentation, which includes images that show other directories not included in the archive.
I believe there may be more information about the DSSM in /semantic/ directory.
The ranking factors are ranked in order of their initial weight
At first, I assumed that the codebase did not have weights for ranking factors. It was then that I was stunned to discover that the nav_linear.h directory file contained the initial coefficients or weights associated with ranking factors.
This section highlights 257 ranking factors out of 17,000+ we have identified. Ryan Jones deserves a special thanks for identifying these ranking factors and aligning them with the descriptions. )
A search engine algorithm is a complex mathematical equation that scores pages based on a variety of factors. This makes it clearer. Although this is a simplified example, the screenshot below shows an example of such an equation. The coefficients indicate the importance of each factor and the computed score is used to score pages relevant.
This is not the only place where ranking takes place, as these values are hard-coded. This function is where the initial relevance scoring takes place to generate a series if posting lists for each shard that will be considered for ranking. This is described as query-independent relevance (QIR), which limits documents before evaluating them for query-specific relevancy (QSR) in the patent mentioned above.
MatrixNet receives the resulting posting list with query features for comparison. These weights, even though we don’t yet know all the details of downstream operations, are still useful because they indicate the criteria for a page to be considered for the consideration set.
This brings us to the next question: What do we know about MatrixNet.
The Kernel archive contains neural ranking code. There are many references to MatrixNet, “mxnet”, as well as references to Deep Structured Semantic Models(DSSM) throughout this codebase.
One of the FI_MATRIXNET ranking factors is described to indicate that MatrixNet can be applied to all factors.
{Factor Factor
Index: 160
CppName: “FI_MATRIXNET”
Name: “MatrixNet”
Tags: [TG_DOC, TG_DYNAMIC, TG_TRANS, TG_NOT_01, TG_REARR_USE, TG_L3_MODEL_VALUE, TG_FRESHNESS_FROZEN_POOL]
Description: “MatrixNet can be applied to all factors – it is the formula.”
}
There are also several binary files that could be pre-trained models, but I will need to spend more time decoding those parts of the code.
It is obvious that there are many levels of ranking (L1, 2, and L3), as well as a variety of ranking models that can each be chosen at each level.
The selecting_rankings_model.cpp file suggests that different ranking models may be considered at each layer throughout the process. This is how neural networks operate. Each level represents an aspect of the operation. The combined computations produce the re-ranked document list that eventually appears as a SERP. When I have time, I will do a deeper dive into MatrixNet. If you want a peek at the Search result ranker patent, click here.
Let’s now take a look at some intriguing ranking factors.
Top 5 factors in the initial ranking that are negatively weighted
Below is a list with weights of the most negatively weighted first ranking factors. A brief explanation is provided based on Russian translations.
- Fi_ADV: This factor determines if there is any advertising on the page, and issues the maximum weighted penalty for one ranking factor.
- FIELD_AGE: 0.2774373667 – This is the difference between the current and the date determined by a dater function. If the document date is the exact same as today, the value is 1, 0 if it is older than 10 years, and 0 if it is not. Yandex prefers older content.
- QURL_STAT_POWER This is the number URL impressions that relate to the query. They seem to want to remove URLs that appear in multiple searches to increase diversity.
- FI_COMM_LINKS_SEO_HOSTS: -0.1809636391 – This factor is the percentage of inbound links with “commercial” anchor text. If the percentage of such links exceeds 50%, then it will revert to 0.1. Otherwise, it will be set to 0.
- FI_GEO_CITY_URL_REGION_COUNTRY: -0.168645758 – This factor is the geographical coincidence of the document and the country that the user searched from. If 1 is a match, this one makes no sense.
These factors suggest that you should take the following steps to get the highest score possible:
- Avoid ads.
- Instead of creating new pages, update older content.
- Be sure to include branded anchor text in all of your links
You are not responsible for anything else on this list.
Top 5 positively weighted first ranking factors
Here’s a list containing the most weighted positive ranking factors.
- FI_URL_DOMAIN_FRACTION: +0.5640952971 – This factor is a strange masking overlap of the query versus the domain of the URL. Chelyabinsk lottery, also abbreviated as “chelloto”, is the example. Yandex uses three letters to compute the value (che, hel and lot, respectively) and then calculates the proportion of the three-letter combinations in the domain.
- FI_QUERY_DOWNER_CLICKS_COMBO: +0.3690780393 – The description of this factor is that is “cleverly combined of FRC and pseudo-CTR.” There is no immediate indication of what FRC is.
- FI_MAX_WORD_HOST_CLICKS: +0.3451158835 – This factor is the clickability of the most important word in the domain. Click on the wikipedia pages for any query containing “wikipedia”.
- FI_MAX_WORD_HOST_YABAR: +0.3154394573 – The factor description says “the most characteristic query word corresponding to the site, according to the bar.” I’m assuming this means the keyword most searched for in Yandex Toolbar associated to the site.
- FI_IS_COM +0.2762504972 The domain must be a.COM.
And this is how it works:
- Play word games with domain.
- It must be a dot-com.
- Encourage others to search for your target keywords using the Yandex bar.
- Continue clicking.
There are many unexpected factors that can be used to rank your first ranking.
The unexpected ranking factors are what’s even more fascinating in the initial weighted rankings. Here’s a list with seventeen factors that stood out.
- FI_PAGE_RANK +0.1828678331 PageRank is 17th most weighted factor in Yandex. It’s not surprising that PageRank is so low on the Yandex ranking system because they had removed all links.
- FI_SPAM_KARMA +0.00842682963 The Spam Karma is named after spammers and measures the likelihood that the host has been spammed. It is based on Whois information
- FI_SUBQUERY_THEME_MATCH_A: +0.1786465163 – How closely the query and the document match thematically. This is the 19th most weighted factor.
- FI_REG_HOST_RANK +0.1567124399 Yandex has a host ranking factor.
- FI_URL_LINK_PERCENT +0.08940421124 Ratio of links with anchor text that is URL (rather text) to total number of links
- FI_PAGE_RANK_UKR +0.08712279101 This is the Ukranian PageRank.
- FI_IS_NOT_RU +0.08128946612 It’s a good thing that the domain isn’t a.RU. The Russian search engine does not trust Russian sites.
- FI_YABAR_HOST_AVG_TIME2: +0.07417219313 – This is the average dwell time as reported by YandexBar
- FI_LERF_LR_LOG_RELEV: +0.06059448504 – This is link relevance based on the quality of each link
- FI_NUM_SLASHES +0.05057609417 The number of slashes within a URL is a ranking indicator.
- FI_ADV_PRONOUNS_PORTION: -0.001250755075 – The proportion of pronoun nouns on the page.
- FI_TEXT_HEAD_SYN 0.01291908335 – The presence of [query] keywords in the header. Synonyms are taken into consideration
- FI_PERCENT_FREQ_WORDS: -0.02021022114 – The percentage of the number of words, that are the 200 most frequent words of the language, from the number of all words of the text.
- Fi_YANDEX_ADV 0.0926121965 – Yandex penalizes pages that use Yandex ads.
- FI_AURA_DOC_LOG_SHARED: -0.09768630485 – The logarithm of the number of shingles (areas of text) in the document that are not unique.
- FI_AURA_DOC_LOG_AUTHOR: -0.09727752961 – The logarithm of the number of shingles on which this owner of the document is recognized as the author.
- FIE_CLASSIF_IS_SHOP 0.1339319854 – Yandex will give you less love if the page is a shop.
These ranking factors are not all that unique. However, there are many possible ranking factors.
Google’s “200 signals” report is likely to be 200 classes of signal, each signal being a combination of several other components. Google Search is likely to have classes of ranking signals that are composed of multiple features, much like Google Analytics has many dimensions.
Yandex searches Google, YouTube, Bing and YouTube for information
Yandex also has parsers for websites and services. This is evident from the codebase. Westerners will be most familiar with the parsers I have listed in the heading. Yandex also has parsers that can be used to access a wide range of services, including those it offers.
It is obvious that the parsers have all features. Each meaningful element of the Google SERP has been extracted. Anyone who might want to scrape any of these services may find it useful to look at this code.
Other code indicates Yandex may be using Google data in the DSSM calculations. However, the 83 Google-named ranking factors make it obvious that Yandex relies heavily on Google’s results.
Google wouldn’t pull the Bing search of copying other search engines’ results and would not rely on them for core ranking calculations.
Yandex has anti SEO upper bounds on some ranking factors
There are 315 ranking factors that have thresholds. Any value that exceeds that threshold indicates to the system that the page’s optimization is too high. 39 ranking factors are included in the initial weighted factors which may prevent a page being added to the initial postings list. These ranking factors can be found in the spreadsheet linked above. Filter for the Anti-SEO column and the Rank Coefficient.
Conceptually, it’s not unreasonable to assume that search engines today set thresholds for certain factors that SEOs abuse, such as keyword stuffing, CTR, and anchor text. Bing, for example, was reported to have used the abusive use of meta keywords as a negative element.
Yandex increases “Vital Hosts.”
Yandex offers a variety of boosting mechanisms in its codebase. These are artificial enhancements to certain documents that will ensure they rank higher in ranking.
Here is a comment from “boosting wizard”, which suggests that smaller files are most likely to benefit from the boosting algorithm.
There are many types of boosts. I have seen one related to links. I also saw a series called “HandJobBoosts”, which I believe is a strange translation of “manual” modifications.
One of the boosts that I found especially interesting was related to Vital Hosts. A vital host can be any site. Yandex specifically mentions NEWS_AGENCY_RATING, which makes me believe Yandex biases its results towards certain news organizations.
This is a very different approach to Google without getting into geopolitics. They have been very strict about not including biases such as this in their ranking systems.
The structure of the document server
This codebase shows how documents are stored in Yandex’s document server. This codebase is useful in understanding how Yandex’s document server stores documents. It doesn’t just make a copy of a page and save it in its cache. Instead, it captures metadata that it can use to determine downstream rankings.
Below is a screenshot that highlights some of the most interesting features. Another file with SQL queries suggests that the document server may have more than 200 columns, including the DOM tree and sentence lengths. Fetch time can also be included. A series of dates and antispam scores, as well as the redirect chain and whether the document has been translated. The most complete list I’ve come across is in /robot/rthub/yql/protos/web_page_item.proto.
The most fascinating thing about this subset is the number simhashes. Simhashes, which are numerical representations of content, can be used by search engines to quickly compare duplicate content. The robot archive contains many instances that explicitly indicate that duplicate content has been removed.
The codebase also includes TF-IDF and BM25 as part of its indexing process. Because there are some redundancies in using all these mechanisms in the code, it’s not easy to understand why they exist.
Prioritization and link factors
Yandex’s handling of link factors is especially interesting, as they had previously disabled their effect. You can also find a lot of information in the codebase about link factors, and how they are prioritized.
Yandex’s link-spam calculator looks at 89 factors. Anything marked SF_RESERVED will be removed. These factors are described in the Google Sheet linked above.
Yandex, for example, has a host rank as well as scores that seem to last long after a page or site develops spammy reputation.
Yandex also reviews the copy across domains to determine if duplicate content is being used with links. It could be sitewide links, duplicate pages or links that have the same anchor text from the same website.
This shows how easy it is to ignore multiple links from one source, and how important it to target unique links from different sources.
What can Yandex do to help us learn more about Google?
This is the question that everyone still has in their minds. Although there are many similarities between Yandex & Google, it is true that only a Google Software Engineer who works on Search can definitively answer this question.
But, that’s a wrong question.
This code can help us expand our understanding of modern search. The majority of our collective knowledge of search comes from what the SEO community has learned through testing, and what search engineers have heard from search engineers back in the early 2000s when search was much more opaque. Unfortunately, this has not kept pace with the rapid pace at which innovation is occurring.
The Yandex leak’s many features and factors should give rise to more hypotheses about what you can test and use for Google ranking. They should also include more items that can be analyzed and measured using SEO crawling, link analysis and other ranking tools.
A measure of cosine similarity between documents and queries using BERT embeddings might be useful to understand versus competitor pages, since this is something modern search engines do.
Similar to how the AOL Search logs helped us guess the distribution of clicks on SERP (and vice versa), the Yandex codebase helps us move away from abstract to concrete, and our “it depends” statements can be more qualified.
This codebase is a gift that will never stop giving. This codebase has already provided us with some compelling insights even though it’s only been one weekend.
It is possible that ambitious SEO engineers will continue digging, filling in the gaps and eventually compiling this thing and getting it working. Engineers at search engines also look for innovations to add to their systems.
Google lawyers may simultaneously be drafting cease and desist letters in relation to scraping.
I am eager to see how our space evolves.
However, if you find that the actual code doesn’t provide valuable insights, you can always go back and argue about subdomains versus directories.
The post Yandex Scrapes Google and Other SEO Learnings From the Source Code Leak was first published on Search Engine Land.