How To Get Experience Working With Large Datasets

I think I have been lucky that several of the projects I been worked on have exposed me to having to manage large volumes of data. The largest dataset was probably at MailChannels, though Livedoor.com also had some sizeable data for their books store and department store. Most of the pain with Livedoor’s data was from it being in Japanese. Other than that, it was pretty static. This was similar to the data I worked with at the BBC. You would be surprised at how much data can be involved with a single episode of a TV show. With any in-house generated data the update size and frequency is much less dramatic, even if the data is being regularly pumped in from 3rd parties.

Those Humans
The real fun comes when the public (that’s you guys) are generating data to be pumped into the system. MailChannels’ work with email, which is human generated (lies! 95% is actually from spambots). Humans are unpredictable. They suddenly all get excited about the same thing at the same time, they are demanding, impatient, and smell funny. The latter will not be so much of problem to you, but the other points will if you intend to open your doors to user generated information.

Need More Humans
Opening your doors will not necessarily bring you the large volumes of data you require. In the beginning it will be a bit draughty, and you might consider closing those doors. What you need is more of those human eyeballs looking through their glowing screens at your data collectors. Making that a reality is the hard part. If you do get to that point, then you will probably not have “getting experience with large datasets” at the fore-front of your mind. Like sex, working with large datasets is most important to those not working with large datasets. So we will look elsewhere.

Data Hunting
The Internet is really just one giant bucket of data soup. Sure, we all just see the blonde, brunette or redhead, but it’s actually just green squiggly codes raining down in uniform vertical lines behind the scenes. What you need to figure out is how to get those streams of data that are flying around in the tubes of the Internet to be directed through your tube, into your MongoDB NoSQL database, Solr search engine, Hadoop distributed file-system or Cassandra cluster.

“All I see now is blonde, brunette, redhead”
– Cypher, The Matrix

Data Sources
The trouble with most sources of data is that they are owned and the data is copyrighted or proprietary. You can scrape websites and, if you fly under the radar, you will get a dataset. Although, if you want a large dataset, then it will take a lot of scraping. Instead, you should look for data that you can acquire more efficiently and, hopefully, legally. If nothing else, collecting the data in a legal manner will help you sleep better at night and you have a chance at going on to use that data to build something useful.

Here’s a list of places that have data available, provided by my good friend, Geoff Webb.

Some others I think are worth looking at are Wikipedia, Freebase and DBpedia. Freebase pulls its data from Wikipedia on a regular basis, as well as from TVRage, Metacritic, Bloomberg and CorpWatch. DBpedia also pulls data from Wikipedia, as well as YAGO, Wordnet and other sources.

You can download a dump of the Freebase dataset here. More information on the these dump files can be found here.

Make It Yourself
The quickest way I’ve found to get a good feed of data is to generate it. Create an algorithm that simulates mass user sign-up, 10,000 tweets per second, or vast amounts of logging data. Generating timestamps is easy enough. Use a dictionary of words for the content. Download War And Peace, Alice In Wonderland or any other book that is now out of copyright if you need real strings of words for your fake tweets.

Get Loopy
I have built a load-testing system, in which I recorded one days worth of data coming into the production system, then played it back at high-speed on a loop into my load-testing system. I could simulate a hundred times throughput or more. This could also be done with a smaller amount of data. If you can randomize the data a little so it’s slightly different each time around the loop, then even better.

Play In The Clouds
I would recommend you do not download the data to your home or work machine. Fire up an Amazon EC2 machine and store you data on a EBS volume. You’ll be able to turn it on and off so that you are only paying for the machine when you have time to play with it. The bandwidth and speed you can download that data will be much quickly. If you want to scale the data across a small cluster of machines it will be easy to setup and tear-down. The data will be in the cloud, so if you do get an idea of what to do with it then it will not take you two weeks to upload that data from your home machine to the cloud. It will already be there. On Amazon EC2 you can play with the same data on a cluster of small machines with limited RAM, or on a “extra large” machine with up to 64Gb of RAM. You can try all these different configurations very quickly and inexpensively. I mention Amazon EC2, because that is where my experience is, but the same applies for Rackspace and other cloud infrastructure providers.

Conclusion
There are data sources out there, but which data source you choose depends on which technology you wish to get experience working with. The experience should be of the technologies you are using, rather than what the data is. Certain datasets pair better with certain technologies. Simulating the data can be another approach. You just need a clever way of generating and randomizing your fake data. Thirdly, you can use a hybrid approach. Take real data and replay it on a loop, randomizing it as it goes through. Simulating the Twitter fire-hose should not be too hard, should it?

Please Leave A Comment
If you have other links to sources of good datasets, have any code for simulating datasets or any ideas on this topic, then please leave a comment on this blog post. I will answer any questions as best as I can and intend to write much more on this topic in future blogposts. Your feedback will help guide the content of these posts.

Related Posts On This Blog

Comments

Chris Hemedinger
December 10, 2010 at 5:47 am

Another great source is Kiva.org, with tons of loan and lender data related to microfinance transactions.

I’ve written about this at my blog for World Statistics Day (last month):
https://blogs.sas.com/sasdummy/index.php?/archives/208-World-Statistics,-FTW!.html
Phil Whelan
December 10, 2010 at 10:33 am

BTW, Pavan Yara left this comment on highscalability.com, where this post is also featured, and gave some really great data sources.

The Physics arXiv Blog from Technology Review lists 70 online database sources:
https://www.technologyreview.com/blog/arxiv/26097/

Stanford’s SNAP library also offers a large collection of network datasets useful for variety of purposes:
https://snap.stanford.edu/data/index.html
Jeremy Weiss
December 10, 2010 at 2:13 pm

How large does a dataset have to be, before it’s considered ‘large’?
1. Phil Whelan
  December 10, 2010 at 2:36 pm
  
  Hi Jeremy,
  
  Great question.
  
  While there is definitely no official standard on what a “large” dataset is and everything being relative, I’d say 1Tb is probably a good size.
  
  Generally, if you find it’s going to hard to process the data you have on a single machine and you start scratching your head and looking at technologies such as Hadoop, Cassandra, MongoDB and other NoSQL technologies then I’d call it “large”.
  
  For me, it’s generally about the throughput of data. I like playing with data that’s large, but also data that is coming into the system in serious volumes. The difficulties in handling all those requests and inability to write to disk quickly enough becomes interesting and leads you to have to understand clustering, NoSQL technologies, queuing (such as ActiveMQ and RabbitMQ) and make hard decisions about exactly what it is your trying to do with the data and how intend to access it.
  
  Obviously it’s difficult to find really large datasets and possibly hard justify where you’re realistically going encounter those scenarios in your career.
  
  Phil
Brad
December 15, 2010 at 3:04 pm

Here’s another source:
https://data.stackexchange.com/
1. Phil Whelan
  December 15, 2010 at 3:19 pm
  
  Thanks Brad. That’s a great resource. I see you can get a 2Gb torrent from ClearBits of this data.
  
  “Stack Overflow trilogy creative commons data dump, to start of Nov 2010. Includes – https://stackoverflow.com – https://serverfault.com – https://superuser.com – https://meta.stackoverflow.com – https://meta.serverfault.com – https://meta.superuser.com – https://stackapps.com And any other public (non-beta) website and its corresponding meta site at https://stackexchange.com/sites“
Phil Whelan
December 22, 2010 at 10:23 pm

Andrew Clegg posted this comment on highscalability.com, where this blog post is also featured. Thank Andrew!
https://highscalability.com/blog/2010/12/8/how-to-get-experience-working-with-large-datasets.html

There’s dozens and dozens of life-sciences databases listed here, most of which can be downloaded for free:

https://en.wikipedia.org/wiki/List_of_biological_databases

Gene sequences, protein structures, interactions, functional annotations etc.

If biology isn’t your thing, try:

https://infochimps.com/

https://mldata.org/

https://www.kaggle.com/
kencochrane
March 2, 2011 at 4:44 am

Another source I’m surprised someone didn’t mention yet is Amazon’s Public data sets: https://aws.amazon.com/datasets
Yildirim Mungan
December 7, 2011 at 7:36 am

Great blog Phil! I’m reading everything you put. Thank you very much indeed.
Although I’ve searched google, I could not find a big sample data for telco CDR (Call Detail Records).
I am planning to apply some analysis on CDR files, and provide a solution for telco big data analysis problem.
Can you recommend a web address where I may find what I am looking for?

Cheers!
1. Phil Whelan
  December 10, 2011 at 10:49 am
  
  Thanks for the nice comment on my blog, Yildirim!
  
  I do not have any info on the Telco CDR data that you’re looking for. I’d just Google it myself. You could try posting your question to some forums on the topic e.g. https://www.linkedin.com/groups/Big-Data-Low-Latency-3638279
dodgy_coder
February 20, 2012 at 9:53 pm

> “Download War And Peace, Alice In Wonderland or any other book that is now out of copyright if you need real strings of words for your fake tweets”

Nice post, and this above is a good idea but in doing so don’t forget about non-english languages such as Chinese and Japanese which have different character sets…
Travis Reeder
February 20, 2012 at 9:59 pm

Great post Phil. I hope this will encourage people that want to get into the big data space to just get out there and do it. The best experience someone can get is to make something, whether it’s for work or just a project for fun/learning. When I’m hiring, I put a heavy weight on people that have done their own projects just for the sake of learning.

Subscribed.
Patrick Dobson
February 21, 2012 at 4:42 am

Voter files are also freely available. You could get millions of real records from every state. Here is Ohio:

https://www2.sos.state.oh.us/pls/voter/f?p=111:1:3229553117069219
Aquecedor
February 21, 2012 at 4:45 am

I have always been fascinated by large datasets. I hope to play with this as soon as I have time. Great and inspiring post.
Seamus Abshere
February 21, 2012 at 9:14 am

We’ve got some curated datasets here:

https://data.brighterplanet.com/
JJ
June 20, 2012 at 8:17 pm

Thanks for the great post. I have been spending some time trying to get creative with data set sources. If you could have access to any data set out there, what would choose? Its fun to dream.
Sudhakar Singh
May 5, 2014 at 1:14 am

From where can I find Transnational Datasets? I want to apply Association Rule Mining algorithms on it.

See more posts

Published 8 Dec 2010

Software EngineerPhil Whelan on Twitter