The amazing librarian
This article is a runner up in the general public category of the Plus new writers award 2008
"The universe, which others call the Library, is composed of an indefinite and perhaps infinite number of hexagonal galleries..."
Thus begins The Library of Babel, a short story by the Argentinean writer Jorge Luis Borges.
Later on, he recalls that "Like all men of the Library, I have travelled in my youth; I have wandered in search of a book, perhaps the catalogue of catalogues." And he adds, "There are five shelves for each of the hexagon's walls; each shelf contains thirtyfive books of uniform format; each book is of four hundred and ten pages; each page, of forty lines, each line, of some eighty letters ..."
Sergey Brin, one of the two founders of Google. Image courtesy James Duncan Davidson/O'Reilly Media, Inc. Reproduced under Creative Commons Attribution 2.0.
Perhaps Borges was anticipating in his portentous library the vastness of the World Wide Web, and longing for the perfect librarian who could resolve the despair of a search. Alas, Borges died in 1986, three years before Tim Berners-Lee, working at the European Organization for Nuclear Research, launched the idea of the Web as an information conglomerate. Another nine years had to pass until in 1998 two graduate students at Stanford University, Sergei Brin and Larry Page, published The Anatomy of a Large-Scale Hypertextual Web Search Engine, where they introduced Google and revolutionised the design of Web search engines.
As Brin and Page tell in their article, google is a common spelling for googol, the name of the impossibly large number 10100. There is no doubt that Google, the search engine, and Google, the company, have both lived up to the expectations of their name. A single number can give us the proof:
This was the estimated market value of the company, in dollars, when it went public in 2004, and it is the approximate number of documents ranked by Google.
For the sake of brevity, we will concentrate on how Google decides the relative importance of a page, ignoring all other aspects. But we must not forget that before doing any ranking, a Web search engine has to identify the public pages and it has to index them, so key words or phrases can be found easily.
The importance of being connected
Google judges the importance of a page on the basis of the importance of the pages linked to it. In this simple principle resides much of Google's success. Rather than using word of mouth, expert advice or other human interventions, Google lets the Web do the ranking, using the Web's own linking structure.
The alert reader might have noticed the troubling fact that our stated criterion did not really define what importance is. It merely referred the importance of one page to the importance of other pages. So, we ask now, how does Google quantify this notion of importance?
Although the full story of what Google does is not quite known, here is the gist of it: if the page has a total of, say, 10 links to other pages, among them the page , then will transfer one tenth of its importance to .
Now suppose that the pages that link to page are , , etc, up to page These pages are called the back links of . Write , , etc, for the importance of the pages , , etc. In the same way, write , , etc, for the total number of links on page , , etc. Then we can work out the importance of page as
To make more apparent how this formula works for a whole web, first label all the pages in the web by , , , etc. Now build an array of numbers, in which the entry corresponding to the row and column is
has the array
Now let's pretend for a moment that we know the importance of each page and collect this information in a vector
This is independent of the nature and size of the web, and of the actual values in the vector it always works because of the way we have defined them in general terms.
And here comes the crucial bit: linear algebra also provides the tools to solve the matrix equation
The property that is mathematically described by saying that the importance vector is an eigenvector of the link matrix, with eigenvalue 1. (In general, a vector is an eigenvector of with eigenvalue , if multiplied by produces the vector with each entry multiplied by .)
Going back to our first mini web, the eigenvector we are looking for turns out to be
We should point out here that the equation remains true if we multiply each entry of the vector by some number . This means that there are infinitely many eigenvectors of eigenvalue 1. It is customary to choose the only one whose entries add up to 1, as is the case in our example here. The zero vector always satisfies the above equation, but it will not give us any information about the relative importance of the pages. Therefore we always look for a non-zero eigenvector.
Much remains to be said and done if instead of our mini web we are dealing with the real Web. Finding the eigenvectors of a matrix with 25,000,000,000 rows and columns is a tremendous task. There is not much hope for finding these eigenvectors in a reasonable time, and instead one aims at capturing good enough approximations. Other mathematical techniques are brought into this effort. We will not dwell in the details, because more challenges lie ahead.
You see, so far we have taken for granted that if we put enough mathematics to work, we will be able to find exactly one solution to our problem. But this might not be true, regardless of the size of the web. Disconnected webs, such as
are bad news. The link matrix
The disconnection of this web is reflected in the block structure of the link matrix. It has two sub-matrices
In spite of this problem, Google still manages to come up with a sensible solution. It does it by creating a new matrix that combines the original link matrix and the matrix for which all the entries are equal to 1/5 (because there are 5 pages in our web). The new matrix is calculated as
Google does not publicise its recent choices for , but it is known that in the past it has used the value , which gives us the combined matrix
Another problem that might arise is that the link matrix has no non-zero eigenvectors at all. This can occur when there are dangling pages, which do not have any back links. But thankfully, linear algebra provides the techniques for dealing with these cases too. In the real Web, one would have to keep track of both, dangling pages and disconnected pages.
All the steps we have described, together with its modifications and more, constitute the so-called PageRank algorithm. We can think of it as a mathematical evaluation of the Web. Google runs its PageRank algorithm monthly, counteracting in this way the ever changing nature of the Web.
I do not want to leave you with the idea that Google is the only path to a rewarding Web search. Other successful search engines do exist, but Google still enjoys a lion's share of the market, inspiring millions of fans to search the Web with confidence. One of these fans, the cartoonist Gerry Trudeau, describes Google as "the Swiss Army knife of information retrieval".
About the author
Josefina (Lolina) Alvarez was born in Spain. She earned a doctorate in mathematics from the University of Buenos Aires in Argentina, and is currently a professor of mathematics at New Mexico State University in the United States. Her research interests lie in the areas of functional and harmonic analysis. She is on the editorial board of Matematicalia and writes the column What if...?. She likes to hike, and with husband Larry and dogs Brandywine and Sofia, she has walked many kilometres along the beautiful trails of New Mexico and elsewhere. You can read more about her on her website.