Google Data Leak

Introduction

Thousands of Google's internal search ranking documents recently leaked, giving us a rare glimpse into the innermost workings of how Google ranks sites. This blog will talk about the entire situation: what the leaked data was all about, how it would help the SEO community, and how Google responded to everything.

What the Leak Was About and Who Leaked It

The leaked documents showed information about the data Google collects to rank sites. Erfan Azimi, a search marketer and the founder of EA Eagle Digital, came across the leak at first. He wanted to let the search community know how the ranking system actually works and on May 5, reached out to Rand Fishkin, co-founder of Moz, since he felt that he was the best person to make this information public.

Rand checked with some of his friends who are ex-Googlers to make sure whether the leak was authentic, and then turned to Mike King, CEO of iPullRank, to decode the documents. He analyzed the documents and later published an article sharing his insights.

When It Was Leaked

The documents were released on Github on March 13 by a bot called yoshi-code-bot. They came from Google’s internal Content API Warehouse, which the employees use to store their files, and were not taken down until May 7.

What the Leaked Data Is About

There were 2,596 modules in the API files with 14,014 attributes, and the data seemed to be about:

Clicks: According to Mike King, Google most likely uses clicks and post-click behavior for ranking.

Chrome Data: In Rand Fishkin’s blog, he says that Google probably uses the number of clicks on pages in the chrome browser to identify the most popular URLs on the site.

Links: Erfan Azimi mentioned to Rand that Google has three tiers for classifying their link indexes- low, medium, and high quality.

Whitelists in certain domains: Rand said that Google is whitelisting specific domains like Travel, Covid, and Politics.

Website Authority: Mike said that Google has an overall domain authority, when he saw that there was a module that mentions a feature called “siteAuthority”.

How It Is Useful for the SEO Community

Some key takeaways from Mike and Fishkin’s analysis of the documentation are:

User Experience: According to Mike, getting more clicks to a site with good user experience signals to Google that your page has to rank, and this will probably help you bounce back from the Helpful Content Update.

Authorship matters: He also said that authors are being measured by Google. So we can probably assume that authorship is important for ranking, and pay more at

Links are probably still a big deal: Links from a fresh or top-tier page are more valuable and improve your ranking performance. From Mike’s analysis, Google checks the average weighted font size of terms in documents and the anchor text of links. And they also value a link based on the trust they have for the homepage.

Page Titles are still important: Mike said that Google still considers how well the page title matches what the user is searching for, since there was a mention of a feature ‘“titlematchScore”.

Video-focused sites are treated differently: If more than 50% of the site has video content, it qualifies as video-focused, and is evaluated differently.

Your Money Your Life is specifically scored: Content that can directly affect people’s well-being, safety and happiness has a separate ranking score.

Locally relevant links might be more valuable: Links from websites in the same country have more weight than those from other locations.

Quality rating: Fishkin found mentions of quality raters in the documents, including those of ‘Human Ratings’. So we should probably keep in mind how quality raters view our websites too.

Google’s Response

On May 30, Google confirmed that the leaked documents were authentic, but also said that we shouldn’t assume that the data provided a complete picture of the ranking system. Google spokesperson David Thompson said, “We would caution against making inaccurate assumptions about Search based on out-of-context, outdated, or incomplete information”. They also said that their ranking signals are always changing.

Conclusion

SEO experts are still decoding the documents and we will probably gain even more insights in the months to come. And while the data leak cannot exactly be used to get a quick win in SEO, there is still a lot of information in there that can confirm our best practices and bring us more on the right track, such as the importance of authorship and of getting links to a website with a good user experience.