Next in my series commenting on the New Scientist magazine series “Eight things you didn’t know about the internet” is part 3, “How big is the net?”, by Colin Barras.

That the internet is vast is undoubted. In July 2008, web surfers were introduced to Cuil.com, billed by its designers as “the world’s biggest search engine”. It indexed an impressive 120 billion pages, but shortly before its launch Google announced that its systems had registered a trillion unique pages (see Internet census 2007 and 2008).

Even this might represent a fraction of what is out there. Some estimates suggest that there could be hundreds of times more information stored on the internet than Google or Cuil have so far indexed.

In the early 1990s, in earlier days of the worldwide web, the NCSA had a web page that listed all the new web sites that had recently gone up — first it was weekly, then daily — and you could actually check out everything that was new on the Internet that week, or that day. Indexing and searching the relatively small number of web sites was a fairly easy task at the time, and publishing a complete list was reasonable. If you missed checking their list for a few days, or a week, you could go back and catch up.

That soon changed, of course, and nowadays complex web crawlers compete in finding the most useful subset, it being essentially impossible to index everything any more. The web monitoring service Netcraft reckons that as of their April 2009 survey there were more than 230 million web sites, with 6 million having been added in the last month — an average of some 200,000 per day. You’d have to check about 3 sites every second, 24 hours a day, in order to look at all of the new ones now.

Even defining what a “web site” is is more complicated than it used to be, and Netcraft has its own definition and mechanism for deciding. I think we’d all agree that each blog hosted at blogspot.com is its own “site”. What about each different Facebook page? Each LinkedIn profile? How many distinct “sites” are there at ibm.com? We probably wouldn’t consider what you get if you click on Services to be a distinct site, but the alphaWorks pages might be, and the IBM Research pages probably are. How do we decide?

There’s also more to how “big” the Internet is than number of web sites, bytes of stored data, or number of Internet users. How much hardware is out there? How much electricity does it use? How many bits are shifted around from place to place? Should we look at a connectivity map? How much money is spent on Internet infrastructure? It seems rather like asking how big outer space is.

Maybe a more useful question is “How important is the Internet?” To that, I say that I can’t imagine doing any sort of business or research today without it.