The core data comes from CommonCrawl, a non-profit group designed to crawl the web and provide data for anyone to use. Gil Elbaz is both a founder of CommonCrawl and of Factual, a start-up that creates tables of structured information from data found on the open web (see Factual: Parting The Curtains Of The Invisible Web).
Factual found stats such as I cited above after examining 4 million web sites. In particular:
* 28% of sites have Google Analytics on them
* 12% of sites have AdSense
* 5% of sites have EITHER a Twitter or Facebook link but…
* 2% of sites have BOTH a Twitter or Facebook link
There’s also a chart that shows other interesting stats but without precise percentages. I’ll estimate as best I can:
* About 20% of sites have Flash
* About 19% of sites have an RSS feed
* About 6% of sites have a sitemaps file
* About 1% of sites have a Google Webmaster Central verification code
* About 1% of sites have Quantcast tracking code
* About 0.5% of sites have a Creative Commons attribution