PageRank: Deconstructing the Ideal

The backbone of Google’s link-based algorithm follows a simplistic and highly debated model known as PageRank. Larry Page first constructed the process when he realized that both the number and nature of inbound links across the web into a single page was a meaningful way to make assumptions about the pages semantics and relevancy for a given topic or keyword.

The original model basically compared links to ‘votes’ – that is, the more votes a page (not necessarily a website) had in its favor across the web, the higher it would appear in the search results. A voting page would also pass a bit of its own incoming votes to whatever it voted for. That is, if Website A votes for Website B, which in turn votes for Website C – then Website C would attain some value from the original vote from Website A.

Here’s a diagram from Matt Cutts on the original flow of PageRank

It’s a decently simple process. Today we take it for granted, but at the time this was a breakthrough in search algorithms – as popular engines such as Lycos and Excite didn’t exploit link-based metrics.

The PageRank algorithm has changed a lot of the years, and what started off as a simple & divisive model has turned into something that has a lot of supplementary processing and contextual relevance applied to it. According to Matt Cutts:

“Even when I joined the company in 2000, Google was doing more sophisticated link computation than you would observe from the classic PageRank papers. If you believe that Google stopped innovating in link analysis, that’s a flawed assumption.”

As well as...

"Google doesn’t view pages exactly in the framework as “classic PageRank” any more"

There’s been a lot of different theory applied to how PageRank currently operates. The idea of PageRank sculpting took hold of the search community a couple years ago and interest has recently been rekindled as a result of Google’s ‘new’ position on the rel=nofollow tag.

Let’s be clear about something though:

No one knows how to channel/sculpt/map PageRank. Outside of Google, anyone claiming such things is just plain full of it.

Why? Because no one in the search community really knows how PageRank flows. Anyone can make guesses, but outside of the original model (a model that has long been outdated) we really have no confirmed information related to its processing functions. Except of course the rel=”nofollow” tag, which is kind of a moot point by now.

So what can we really say about PageRank? Well, here are a few points that have been confirmed recently as a result of all the nofollow uproar:

1) PageRank still possesses some divisive properties

The original PageRank model is still a player in the game, but certainly not the only player on the team anymore. PageRank still flows divisibly into, around, and out of websites. That is, the more links on a page, the less PageRank will be passed to each of them. And vice versa. Matt Cutts recently replied in a comment thread to Danny Sullivan:

"For example, from your proposed paragraph I would strike the “The number of links doesn’t matter” sentence because most of the time the number of links do matter, and I’d prefer that people know that."

But make no mistake, Google itself solely decides how much PageRank will flow to each and every link on a website, and a link can often be ignored or devalued for reasons I’ll mention later.

2) PageRank is not computed within intervals

You wouldn’t necessarily think I’d have to mention this, but after talking with quite a few SEO-affluent individuals, many people still think the toolbar update is a, well – ‘master dump’ of PageRank processing within Google. According to Matt Cutts:

“By the time you see newer PageRanks in the toolbar, those values have already been incorporated in how we score/rank our search results.”

And…

“I believe that I’ve said before that PageRank is computed continuously; there are machines that take inputs to the PageRank algorithm at Google and compute the resulting PageRanks.”

3) PageRank has a diminishing value in the realm of 10-15% per ‘hop’

PageRank can’t be infinite for a number of reasons. The biggest of which is the ‘loop problem.’ Which basically says that if Website A links to Website B which links to Website C which links back to Website A:

"No PageRank would ever escape from the loop, and as incoming PageRank continued to flow into the loop, eventually the PageRank in that loop would reach infinity. Infinite PageRank isn’t that helpful so Larry and Sergey introduced a decay factor–you could think of it as 10-15% of the PageRank on any given page disappearing before the PageRank flows along the outlinks." –Matt Cutts, 2009

You’d be surprised how many SEO ‘experts’ claim the opposite in some circumstances. There’s been a lot of talk about ‘trust’ and the TrustRank algorithm lately, especially from Rand Fishkin, that assumes a page (or a domain) has a certain level of innate ‘trust’ with Google based on a lot of factors that have been more or less, well – completely unexplained by the TrustRanking Proponents.

I think an important consideration that is often ignored when talking about TrustRank is this:

TrustRank was a research paper written at Stanford by a Yahoo! Intern that has never been publicly employed by any major search engine. It’s theory with no evidence of application. It’s neat theory that deals with web semantics and internet mapping – but still theory. And I think this whole ‘neat’ factor is why so many SEO professionals claim that Google is using it now.

I was generally concerned hearing some of this Trust Propaganda from Jim Boykin recently, who basically stated that ‘Trust’ shoots outward from a site, never diminishes (which defies the diminishing law of PageRank), and is originally passed from a seed set of sites outward into the web. Whereby NASA.gov receives a Trust Score of 10 – all outbound links it points to would receive a Trust Score of 9, and so on, and so forth.

The real problem with this theory, aside from the fact that the infrastructure never existed in the first place, is that what if 2 seed sites interlinked between each other? Wouldn’t their ‘inter-beam’, and their resulting outbound ‘beams,’ follow a continuous expansion into infinity?

Yes they would. And this is in complete contradiction of Google’s view on the flow of web authority. Just like Google has said, “you only get so much PageRank for your site.”

Also discouraging to the Trust theory is the fact that Google's primary objective is still, to this day - to return relevancy in their search results. Because relevancy and authority are not synonymous - they would be foolish to base their search results off these premises.

A more likely candidate of a supplementary PR model would be the HillTop algorithm, for several reasons.

HillTop aligns well with Google’s stance on relevant link patterns:

"The popular search engines are known to ignore links obtained from irrelevant pages, even if the PR were high". –Matt Cutts

“It is possible to rank well for competitive keywords without high pagerank and there are many examples of it.” –Matt Cutts

"PageRank and our search result rankings qualifies as an opinion and not simply some rote computation." –Matt Cutts

Now here we have Matt saying that PageRank is not always a direct causation of competitive ranking. What we can assume then, is that there are supplementary algorithms in addition to PageRank that kick on and off in accordance with link data. One of these supplementary algorithms is likely to be HillTop (and I say this is much likelier than TrustRank).

The ‘Florida Update’ that occurred in 2003 showed a catastrophic effect for many websites which had incoming backlinks from website with no relevancy. Lots of optimized sites with highly diverse & authoritative link profiles disappeared overnight. According to Aaron Wall, a website could rank well and easily before Florida if it had many incoming backlinks with targeted anchor, but after the Florida update – this was not the case.

Relevance is the keyword here, not authority (which TrustRank relies on) due to the previous quotes from Matt. And that’s the key difference between TrustRank and HillTop; where TrustRank deals with authority, HillTop aligns itself with context.

The HillTop algorithm was developed by Krishna Bharat in 1999-2000, and who patented a ‘certain’ algorithm in partnership with Google in 2001. Not surprisingly, Bharat now works for Google.

The process, simply stated – is that there are ‘expert documents’ aligned to certain keywords. There are essentially 3 steps involved:

  • Google constructs a pre-defined list of expert documents either organically or by computation (from Matt’s previous quote dealing with ‘rote computation’ - my bet would be on organic compilation at this point.)

    What are expert documents? They are basically highly relevant websites related to a keyword; think institutions, accredited organizations, or anyone else who provides real editorial value to a chosen topic. Searching for ‘news’ within Google would likely yield a good expert document in position #1.

  • A search query is executed and a sub-list of expert documents is created and correlated based on the keyword string
  • A “LocalScore” is now given to a page that has votes (links) from one of the expert documents that was just compiled.

And if a site doesn’t have a vote (link) from at least 2 expert documents – it does not receive a HillTop LocalScore. This doesn’t mean that it won’t appear high in the results (remember we still have PageRank computing) but its location will likely be effected by a lack of LocalScoring.

So by now we’ve concluded that there is a finite amount of PageRank that diminishes upon flow, that PageRank likely has supplemental processes within it (or it is in and of itself a supplemental process by this point), and that PageRank still contains a original divisive elements.

With this information, going forward – SEOs need to be very mindful of their inbound link neighborhoods (as Michael Martinez so famously says – "manage your link investment risk") and how many associative properties the linking websites have with their websites.

One very common mistake among linkbuilders is looking at and evaluating the website from its Toolbar PR, outbound link number, or domain type (.edu, .gov). We can conclude from Matt Cutts that this is a very, very flawed assumption that many SEO's make when associating a cause and effect relationship between high PR backlinks and search rankings.

As far as on-site strategy goes; internal linking needs to be natural, well thought out (there are actually mathematical formula’s you can apply to flat architecture and link ratio’s, but these won’t be discussed here), and be mindful of the degree of hopping involved from each page since PR has diminishing qualities. Be especially mindful of this if you’ve got a lot of deep rooted backlinks, such as a news site or blog.

A good analysis of this would be to construct (either through a tool, or manually) a spreadsheet of every page on your website. Look at 3 metrics for each page:

  • Number and nature of backlinks
  • Number of internal links
  • Number of pointing links from page

This is not a drop-it-in and turn-it-on type of strategy for every page, but it will give you an idea about how effectively you’re leveraging the different PageRank hubs on your website. And also it will tell you how you might tract that power, through interlinks and/or anchor text, to a different block or section of the website. Because you can be sure that outbound links aren't the only links evaluated on a contextual level.

Finally – keep it free flowing. Letting PageRank flow freely within your site, as its design was intended to do – is what Google and Matt Cutts advocate. Responsible SEO clearly gives us ways to emphasize certain blocks more than others (I always ask myself – does it “make sense” from a usability perspective?), but to try and sculpt PageRank is a battle that you’ve already lost.