Where can I get large datasets open to the public?

physcab · on April 5, 2011

Asking "What datasets are available to me?" is sometimes the wrong question. A better way of going about the problem is asking something more specific like "How can I create a heat-map of U.S poverty?" The reason why the latter is better is that it not only focuses your attention on something do-able but it actually teaches you more about data analysis than just searching for datasets.

For example, to solve the question above you are going to be asking yourself the following followup questions:

1) Where do I get a map of the U.S?

2) How do I make a heat-map?

3) How do I feed in my own data into this heat map?

4) What colors do I use?

5) Can I do this real-time? Do I need a database? What language do I use?

6) Whats a FIPS code?

7) How do I find a poverty dataset with FIPS codes?

8) This poverty dataset doesn't have FIPS codes, but I can join it with this other dataset that does have FIPS codes.

buddydvd · on April 5, 2011

Open datasets are hard to come by. It's potentially easier to find problems to solve by looking at the available datasets than seeking datasets for the problems you wish to solve.

physcab · on April 5, 2011

When you want to create a new website do you start by looking at a bunch of clip art and images? Browsing through datasets such as the ones listed in this thread leaves me overwhelmed. I've never had any difficulty finding open datasets. If there is a dataset I need that doesn't exist or costs money, I find a way to create it from scratch.

spiffytech · on April 6, 2011

Datasets aren't always quested for. Sometimes you want a specific dataset to solve a specific problem, and sometimes you'll take any dataset just to see what you can learn, or to make interesting infographics. Lists like this help the latter situation.

epo · on April 5, 2011

It may sometimes be the right question, it all depends on what the enquirer wants to with the data. This comment is just an exercise in answering a question that no one asked, I fail to see why it merited any upvotes.

physcab · on April 5, 2011

It is just an example of an alternate viewpoint. My point is that sometimes you will have more success finding data for questions that interest you than looking at datasets that interest you and trying to come up with questions.

lars512 · on April 5, 2011

I likewise feel that this type of list of data sources fits a data collecting/hoarding mentality, rather than a problem solving one. It doesn't help me think of interesting things to do with the data.

machinespit · on April 5, 2011

data.gov and other US gov data sites are getting severe cuts even though they're saving money (http://www.federalnewsradio.com/?nid=35&sid=2327798)

Very upsetting for fans of open / accessible (government) data.

FWIW, petition at http://sunlightfoundation.com/savethedata/

tybris · on April 5, 2011

Google or Amazon should offer to sponsor it and make the data accessible in their respective cloud computing platforms. There's tons of potential for data analysis / consultancy companies to work on this data and it's too big to process anywhere else.

wizard_2 · on April 5, 2011

The major cost of the data is gathering it not distributing it. But I agree this is something that needs to be archived.

buckwild · on April 5, 2011

I have a feeling this might be a bluff. This scare might just be politics as usual. It just doesn't make sense to cut this, especially if they're saving money.

eli · on April 5, 2011

Worth noting that nothing has been cut yet

iamelgringo · on April 5, 2011

Hackers & Founders SV is hosting a hackathon[1] in two weeks at the Hacker Dojo in Mountain View. It's going to be geared towards working with Factual's open data API.

Factual's[2] goal is to provide an API to connect all those available data sets, and they have a fairly impressive list of data sets available. Factual is very interested in hearing what datasets you want to work with, and they are willing to bust ass to get them available before the hackathon.

We still have around 40 RSVP slots open. You can register here: http://factualhackathon.eventbrite.com/

</shameless plug>

[1] http://www.hackersandfounders.com/events/16535156/

[2] http://www.factual.com/

[3] http://factualhackathon.eventbrite.com/

spoiledtechie · on April 5, 2011

how does factual make their money?

iamelgringo · on April 6, 2011

It's free for developers, but if you want premium access, or if you're a large corporation, then they have a paid version.

bigiain · on April 5, 2011

http://jacquesmattheij.com/Free%2C+Public+Data+Sets And discussion: http://news.ycombinator.com/item?id=2165497

shii · on April 5, 2011

http://www.reddit.com/r/datasets/

bOR_ · on April 5, 2011

http://www.hiv.lanl.gov/content/index

For sentimental value: HIV sequence data (and other data) from 1980 till now. Did my thesis on these ;-).

In general, there is an enormous amount of gene sequence data around, not just HIV.

http://www.ncbi.nlm.nih.gov/sites/

Whole genome sequences of eukaryotes (including humans): http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi

Anon84 · on April 5, 2011

Is there any HIV sequence data indexed by patient? I mean, sequences of strains extracted from the same patient at time points in time?

I would email you directly about this, but you don't have any contact information :(

svag · on April 5, 2011

Previous discussions:

http://news.ycombinator.com/item?id=2165497 http://news.ycombinator.com/item?id=764982 http://news.ycombinator.com/item?id=1024966

espeed · on April 5, 2011

Linked Data Sets http://www.w3.org/wiki/TaskForces/CommunityProjects/LinkingO...

Web Services Directory http://www.programmableweb.com/apis/directory/1?sort=mashups

raghus · on April 5, 2011

Also, check out http://aws.amazon.com/datasets

drblast · on April 5, 2011

Edit: Whoops, I thought this was an "Ask HN." The below post still stands for anyone who finds it useful.

The U.S. Census has an extremely well-documented large data set:

http://www2.census.gov/census_2000/datasets/

And the documentation is here:

http://www.census.gov/prod/cen2000/doc/sf1.pdf

The software that they provide to go through the data is crappy, however (90's era).

I have an equally crappy but more useful to a computer scientist Common Lisp program that will pull out specific fields from the data set based on a list of field names. If you want that, I can dig it up for you.

Also, before you start parsing this, it's worthwhile to read the documentation to find out how the files are laid out, and what each field really means. These files are not relational databases, so if you're looking at it through those lenses, confusion will result. In particular, some things are already aggregated within the data set.

barefoot · on April 5, 2011

How many of these allow me to create for-profit websites with them?

Maro · on April 5, 2011

There's a startup called kaggle.com that is all about hosting data mining competitions around datasets, like netflix.

buss · on April 5, 2011

http://aws.amazon.com/publicdatasets/ which includes my former advisor's dataset (UF sparse matrix collection) which includes a matrix or two from my research.

coderintherye · on April 5, 2011

At http://build.kiva.org there are some nice datasets in the "data snapshots" section. I have high hopes we will be releasing a lot more data.

latch · on April 5, 2011

I believe Steven Levitt used the Fatality Analysis Reporting System (FARS) from the national highway traffic safety administration (NHTSA) for his seatbelts vs carseats work:

ftp://ftp.nhtsa.dot.gov/fars/

brandnewlow · on April 5, 2011

On that topic, anyone have any suggestions for the easiest way to prepopulate a directory of local businesses in the U.S.?

jbermudes · on April 5, 2011

Yelp has an API that returns business data in a given geographic area. You could probably get a list of zipcodes from wikipedia and then just loop through that.

achompas · on April 5, 2011

Wouldn't be close to a list of local businesses--only those that are customer-facing. Yelp has little coverage for B2B-focused businesses.

arethuza · on April 5, 2011

UK Government data sets: http://data.gov.uk/

shafqat · on April 5, 2011

We provide API access to more than 20 million articles (headlines, excerpts). People have done all sorts of interesting things with it - http://platform.newscred.com.

kordless · on April 5, 2011

Infochimps?

thesuperformula · on April 5, 2011

You can find many large datasets here, http://beta.fcc.gov/data/download-fcc-datasets , some are over a gigabyte.

plannerball · on April 5, 2011

Freebase?

mrzerga · on April 5, 2011

microsoft azure - they have some large datasets...