Working With Large Data Sets

For the past three and a half years I have been working for a start-up in downtown Vancouver. We have been developing a high performance SMTP proxy that can scale to handle tens of thousands of connections per second on each machine, even with pretty standard hardware.

For the past 18 months I’ve moved from working on the SMTP proxy to working on our other systems, all of which make use of the data we collect from each connection. It’s a fair amount of data and it can be up to 2Kb in size for each connection. Our servers receive approximately 1000 of these pieces of data per second, which is fairly sustained due to our global distribution of customers. If you compare that to Twitter’s peak of 3,283 tweets per second (maximum of 140 characters), you can see it’s not a small amount of data that we are dealing with here.

So what do we do with this data? We have a real-time global reputation network of IP addresses that send email. We use this is identify good and bad senders. The bad being spambots. We also pump this data into our custom-built distributed search engine, built using Apache Lucene, which is no small task when we aim to get search results live within 30 seconds of the connection taking place. This is close enough to real-time for our purposes.

I recently set out to scientifically prove the benefits of throttling, which is our technology for slowing down connections in order to detect spambots, who are kind enough to disconnect quite quickly when they see a slow connection. Due to the nature of the data we had, I needed to work with a long range of data to show evidence that an IP that appeared on Spamhaus had previously been throttled and disconnected, and then measure the duration until it appeared on Spamhaus. I set a job to pre-process a selected set of customers data and arbitrarily decided 66 days would be a good amount to process, as this was 2 months plus a little breathing room. I knew from my experience it was possible that it might take 2 months for a bad IP to be picked up by Spamhaus.

I will not go into the details of the results here, as they can be found on the MailChannels blog post entitled Comparing Spamhaus with Proactive Connection Throttling, but the cool things about this was the amount of data that needed to be processed. I extracted 28,204,693 distinct IPs, some of which were seen over million times in this data set. Here the graph of the results I found. I thought the logarithmic graph looked perfect.

spamhaus detection graph

Graph taken from the MailChannels analysis of Spamhaus Vs Throttling

Have you worked with large amounts of data that does not get the same fanfare attention as the Twitter firehose by bloggers such as Mashable? I’d really like to hear your stories.


  1. SL

    Very cool stuff. I’ve always been hesitant to blog about anti-spam for fear of spammers picking up on it and modifying their techniques. Are you concerned that spammers might do that in this case?

    I’m just starting a tech company and hope to be dealing with large data sets. It’s a balance between planning for the future and acting now. If there’s no action now, there’s not future but without planning the future will be disastrous.

  2. Alan

    @Phil excellent post. I appreciate the problems you face when dealing with extremely high numbers in a short space of time. Your distributed search with Lucene sounds interesting. Would love to hear more about that in future.

    One method we use with extreme effectiveness is the use of the data structure, BloomFilter. This data structure is an extremely effective tool for determining if you have seen something before without any expensive database or search lookup. It is the technology a lot of the big hash map implementations use internally.

  3. Antti Siira

    If you want to see bloom filters used in spam processing, you could check out gross[1] that uses them for greylisting. It has proven[2] itself quite efficient way to block large amount of spam.


  4. Michael Fever

    Thats awesome, congratulations. It’s really refreshing to see these kinds of innovations coming out of Canada, even it if it Vancouver and not Toronto lol =)

  5. Montreal Web Design

    You might want to look at s4 by Yahoo. They recently open-sourced it.