Google can apparently find anything online. It almost seems like magic. How is this possible? There’s really just four steps involved.
Publishing. Anyone can post stuff on the Web. Some Web pages are published by companies, to advertise products or services or post news; some are by individuals who blog (short for Web log, a sort of public online diary). All told, there are about 1 trillion pages out there.
Crawling. Suppose you ask your 100 best friends for a list of their favorite Web pages. Say one of these is the front page of Yahoo.com. Now, for each page, you look at the first link off that page, and follow it. For example, the first link off of Yahoo.com might take you to, say, Microsoft.com. So you go there. Then you look at the first link off of Microsoft.com. Once you get to a page with no other links, you back up and start following the links you haven’t inspected yet. This process is called crawling, and if you start with a large enough set of pages, and spend enough time following links, eventually you’ll visit every possible page. As you can imagine, this takes a long time. Google has tens of thousands of machines that do nothing but crawl, 24 hours a day, 365 days a year. They store a copy of every page crawled. This requires about 1 quadrillion bytes (1,000,000,000,000,000), but with some clever tricks can be “compressed” to a mere 250 trillion bytes or so, the equivalent of about 1,000 consumer hard drives.
Indexing. Imagine a “dictionary” containing every word that appears on every page in the World Wide Web. The dictionary “entry” for each word isn‘t the word’s meaning, but rather a list of all Web pages on which that word appears. This dictionary is called an index, and lets you ask the question “Show me all the Web pages where this word appears.” What about phrases consisting of more than one word, like “Merry Christmas”? Simple: besides storing what pages contain a given word, you also remember what position in the page the word occurred at. For example, if Merry is the 75th word on some page and Christmas is the 76th word, you have found the phrase “Merry Christmas”; but if Christmas is the 100th word, you haven’t. So, searching for a term (or phrase) corresponds to doing one or more lookups in the dictionary. To find the phrase “I love you” means to find every page in which the words “I”, “love” and “you” occur in consecutive positions.
Searching. Given how large this “dictionary” is—millions of pages, for Google—how do searches happen so fast? Imagine a 1 million page dictionary. You recruit 1 million friends and give each of them 1 page of the dictionary. Now, when it’s time to do a search, all 1 million people simultaneously consult their particular page. If anyone finds a match, that person raises his hand. If the dictionary gets larger, you just get more people. This is called parallel processing and Google has tens of thousands of machines doing Web searches all day in just this manner. This particular task is said to be embarrassingly parallel because each of your 1 million friends can do his or her job without interacting with anyone else, so in principle you could speed things up even more by just adding more people. (A key challenge in computing is that most large tasks aren’t embarrassingly parallel.)
Common misconceptions. As you see, Web searching is purely mechanical. There is no human involvement in the process. Google has developed very sophisticated methods for ranking—deciding which search results are most relevant to what you probably wanted, and showing those at the top of the results list—but the ranking is based entirely on the usage patterns of other people. (When you click on a link resulting from a Google search, that tells Google which of the search results was deemed most relevant by you. That information is used to improve their future rankings, so every search you perform actually improves their system!) In the same way, Google News is “mechanical” in that the “editorial decisions” of what to show under Top Stories are based on how many people have visited or searched those stories, not on some human being’s opinion on what makes them Top Stories.
Having said that, it is “mechanical” at a scale that is unprecedented in the history of computing: between Google Search, Google Earth, Google Maps and the other Google services, Google‘s computers are processing enough data each day to fill more than 50,000 consumer hard drives; hundreds of drives fail each day from wear and tear, but Google’s software maintains multiple copies of every item, so losing a drive doesn’t mean losing the information. All of this is performed in dozens of datacenters worldwide—specially designed buildings each housing between 50,000 and 250,000 computers that run 24 hours a day, 365 days a year. Other large companies, including Amazon, eBay, Yahoo, Microsoft and others, also maintain datacenters of their own.