Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This article is the equivalent of "horse drawn carriages are perfectly adequate for most journeys, and much more pleasant and commodious to boot." Good luck with that, buddy.

You're not going to know what correlations are important and which are not until you study the data. Telling people to just collect the "important data" is like telling someone who has lost his keys just to go back to where he left them.

It's also more than a little insulting to FB and Yahoo to insist they are not web scale. The problem of small jobs on MR clusters is real, but even with small jobs, Hadoop turns out to be a lot more cost-effective than various other proprietary solutions which are your only real enterprise alternative. The problem of small MR jobs is being solved by things like Cloudera Impala, which can run on top of raw HDFS to perform interactive queries.



The point was that not everyone needs or has big data. That's hardly controversial. Even some instances where you think you have big data that you think needs to be handled in parallel by a cluster could easily be handled by a single server or even laptop. Again, nothing controversial.

The most important thing is knowing what data you have, how best to collect it, and what it can (and can't) tell you. Just because you find correlations doesn't mean that they are real. It takes people with real expertise to help here, and just running your data on a cluster isn't going to help you. In fact, it could even hurt.

I didn't see anything wrong with the article at all.


Telling people to just collect the "important data" is like telling someone who has lost his keys just to go back to where he left them.

He doesn't tell anyone to collect the "important data," and he doesn't insist FB or Yahoo are not web scale.

His concluding paragraph is relatively weak, but the main thesis -- most businesses can ignore the Forbes/BI crap and analyze their data sufficiently using normal tools -- is true and sound.


This is such a naive response. It is the sort of response that portends to show how it's object of criticism is naive, yet it misses the point entirely. First of all, you misunderstand that the author is claiming that people are treating the terms "big data" and "analysis" synonymously and that that is erroneous.


The problem is your that your ability to explore the data and the data volume are inversely correlated.You are far more likely to find interesting things exploring an in-memory dataset using something like ipython and pandas than throwing pig jobs at a few dozen TB of gunk. Big data is great if you know exactly what you are looking for. If you get into a stage where you are trying to explore a huge DB looking for relationships your need to be very good at machine learning and statistical analysis (spurious correlations ahoy!) to come out significantly ahead.Its also an enormous time sink. In summation the bigger the data the simpler the analysis you can throw at it efficiently.


Very true. Wouldn't the typical approach to this involve probabilistic methods like taking large-ish (but not "Big") samples from your multi TB data and doing your EDA with those?


That would work very well if our random sample accurately reflected the superset of data,which it almost always does but you also want to consider the following...

Imagine our data was 98% junk with 2% of the data consisting of sequential patterns. We may be able to spot this on a graph relatively easily over the whole dataset but our random sampling would greatly reduce the quality of this information.

We can extend that to any ordering or periodicity in the data.if data at position n has a hidden dependency of data at position n+/-1 random sampling will break us.


Do random sampling plus n lines of surrounding context.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: