I borrowed the title of this post from a great article entitled “We need a Wikipedia for data” written by Bret Taylor (emphasis is mine):
I have come to realize how hard it is for a everyday programmer to get access to even the most basic factual data. If you want to experiment with a new driving directions algorithm, it is infinitely more difficult than coming up with an algorithm; you have to hire a lawyer and a sign a contract with a company that collects that data in the country you are developing for. […] Even when data is available under a reasonable license, it often suffers from extremely serious quality or discoverability problems. […] I think all of these barriers to data are holding back innovation at a scale that few people realize. The most important part of an environment that encourages innovation is low barriers to entry. […] The interesting thing is, almost every internet company would benefit if this data were freely available.[…] To this end, I think we should create a Wikipedia for data: a global database for all of these important data sources to which we all contribute and that anyone can use. […]
The entire article is worth reading, including the comments that give interesting links. I just feel the title of the article is somehow misleading: more than a wiki that contains non-structured data, what Bret explains is that we need free access to structured data (aka databases). Applied to the Shopping / CSE world, the idea of free databases would mean getting easy access to:
Ideally, all those data should be internationalized, and not english-only… Even limited to the shopping world, that’s really ambitious!! But some initiatives already exist. Here are some examples of free data sets that can be used for shopping.
Freebase is “an open database of the world’s information. It’s built by the community and for the community – free for anyone to query, contribute to, build applications on top of, or integrate into their websites“. Despite the huge data sets already available, it’s quite poor for shopping-related data. Sure it has data for cultural goods (Music, Books, Films, Videogames), but for manufactured products, only data for Cars and Digital Cameras are currently available. However, it shows the potential of the concept.
The “Digital Cameras” section not only lists cameras, but also contains lists of data related to Digital Cameras:
The fact Freebase relies on lists to define each product attribute ensures the consistency of product definitions across a category. And indeed, the strength of Freebase is its database model and corresponding user interface that is really powerful to manage structured data and ensure this consistency. Its weakness? Freebase is currently English-centric :( If you want to know more about Freebase, you can read “Freebase: Dispelling the Skepticism” by ReadWriteWeb.
Dbpedia is a “community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia and to link other datasets on the Web to Wikipedia data“. You can find data sets for actors, movies, books, music albums, musicians, videogames… The main difference with Freebase is the support of all the Wikipedia languages. Its weakness? As it’s a university project focused on semantic technologies, expect a less polished user interface than Freebase, especially to search and browse the entries.
MusicBrainz “is a user-maintained community music metadatabase. Music metadata is information such as the artist name, the release title, and the list of tracks that appear on a release. MusicBrainz collects this information about recordings and makes it available to the public“. Part of the data are public domain, the rest are under a Creative Commons license (see license).