Thursday, 4. October 2007

Now Queryable and open linked data: U.S. Census/Congress datasets: 1 billion triples

and its fast!

As you can't blog enough about it, I am copying a story from this announcement email:

(following Text by Josh Tauberer)

Hi, everyone. (This is a revised/combined reannouncement for what was
originally posted on the Linking Open Data list.)

Last November, Chris Bizer wrote, "[T]he DBLP server increases the size
of the Semantic Web by around 10 percent ;-)" [1] Based on the same
logic, I have recently increased the size of the semantic web by 200%!
(in terms of the number of triples; and of course I'm also just joking
here w.r.t. size of the semantic web)

I'm announcing here a new U.S. 2000 Census dataset of 1 billion triples,
accessible over SPARQL and browsable by linked data [2] principles, and
re-announcing my U.S. Congress dataset which is newly browsable with
linked data principles. These two datasets are interconnected, and the
Census dataset is linked up via owl:sameAs to Geonames [3].

I like the Census data set a lot for three reasons--- first, if you live
in the U.S. it has something for you, since it has detailed statistics
on geographic entities down to the level of small towns/villages, and
everyone lives somewhere; second, it meshes up with two other data sets;
and third, it's rich enough on its own to support a wide array of
interesting and real-world useful queries (if, say, you were doing

The OpenLink guys were kind enough to host the data set previously, but
I wanted to push the limits of my own semweb C# library [4] and I wanted
to be able to revise the data set as needed, so I've wanted to host it
myself, which only recently I was able to do (even though I've had the
triples laying around for nearly a year).

A complete description of the data set and how it was constructed and
exposed is here:

Some features of the data set:

Data on 3,200 U.S. counties, 36,000 "towns", 16,000 "villages", 33,000
ZCTAs (something like zip-codes), and 435 congressional districts.

Each of those locations contains around 10 thousand population
statistics, as well as a dc:title, a basic hierarchical structure
between regions, and latitude/longitude.

Very basic geographic/name/lat-lng data (1 million triples) can be
downloaded in N3.

All of the 1 billion triples are accessible via SPARQL. See: which has a few sample
queries. An example query is "List the states in the United States that
have more students in dorms than prisoners."

The URIs for the geographic regions are dereferencable http: URIs. (The
URIs for the predicates in the data set will be updated to be
dereferencable in the future.) For example, you can visit the URI for
New York State:

(Some URIs return very large pages that take Firefox quite a while to
render. That one's OK.)

The dereferencable URIs return 303's to SPARQL DESCRIBE pages describing
those URIs.

There is a sitemap.xml file based on the latest draft circulated [5],
referenced from robots.txt:

And, source code to generate the triples from the Census download files
are posted. It's too large for me to provide the whole RDF myself, for
now at least.

The U.S. Congress data set, which I originally made SPARQL-accessible in
December 2005 but is now revised to follow the new linked data
principles, has 12 million triples containing brief biographical data
for all members of Congress, and mainly data for federal legislation and
voting records going back a number of years. Here are two example
dereferencable URIs:
(= Senator John McCain)
(= a bill in Congress)

Some example Congress-related queries are posted here:
And dump files are here:

An example I like to use is that one could fairly easily create a table
using SPARQL aligning votes on a particular bill by congressmen with,
for instance, the median commuting time to work of their constituents,
as reported by the Census.

Thanks to those who gave feedback on the LOD list --- I haven't been
able to address all of it yet (like how to deal with backlinks on the
dereferenced pages).


- Josh Tauberer
QR barcode by
george22 - 19. Nov, 07:51

Hi. I've been taking the Mirafit religiously since I received it but I don't know whether it's doing anything or not. The only thing I feel is gassy and constipated. Do I need to take it longer to get results? Of course, I just got off some medications that could have been causing the other things too. I will continue tolong trench coats for men / / short trench coats for women / / lightweight leather jacket / / kids leather coats / / boys designer jackets / / kids designer jackets / / womens designer leather jackets / / childrens winter jackets / /

Foaf VS Walled Communities

The web 2.0 is getting richer based on the proprietary walls.
There is an often asked question: why is FOAF out there for years but nobody made a business model out of it? Easy: because when you open your service to be replaceable, you are replaceable. Imagine a world where you could switch your community website like you can already switch your newsreader today (thanks to OPML): Venture capitalists would sweat like a drunken grad student in final exam.

This post is partly inspired by this. There is a mix up of cause and effect: FOAF as a standard is making Web 2.0 services interoperable and standardized. Orkut, facebook, linkedin, studivz, (++) would all have the same API and an extensible data format if they used FOAF and RDF.

But wait - who is the venture capitalist behind these web 2.0 walls? Maybe you. So you suddenly realize that when your little nerds in the computer room switch the lever towards standards, your money may go down the well, because then your precious closed community of people, and that is what you sell and own: data about people, will be open for anyone else to copy. So you would do your best not going for standards but instead making the BEST social service EVER so that everyone DIGGS it and invites all HIS FRIENDS into closed walls. Capitalism is ok, but we have to name it what it is.

Like Dick Hardt from SXIP said in his well-known keynote: its your data, not theirs. FOAF and RDF is a way to get back your data from the web 2.0 companies that own you at the moment, make them give you back the data you have entered. Switching from Flickr to another photo community should be as easy as switching newsreader (thx to OPML) or Office Application (thx to OpenDocument).

Microsoft does not go crazy for a standardized OpenDocument format, which would help us get free from the monopoly that takes your money when buying a computer and invests it to sell XBoxes to your kid. Why should it be different with web 2.0? But thanks for pointing us to the VC view.
QR barcode by

semantic weltbild 2.0

Building the Semantic Web is easier together

and then...

foaf explorer

Geo Visitors Map
I am a hard bloggin' scientist. Read the Manifesto.
lebard's photos More of lebard's photos
Skype Me™!



Users Status

You are not logged in.

I support

Wikipedia Affiliate Button


October 2007


route planning
Subscribe Weblog