Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
NoSQL Is for the Birds (gigaom.com)
70 points by thesethings on Nov 6, 2010 | hide | past | favorite | 56 comments


I think that the data model of some new database is more suitable for different use cases compared to relational DBs.

Also it's not just a matter of scalability: latency is a very important constraint.

Finally I see a lot of small sites, forums, blogs, that can't scale easily to small numbers using relational DBs, since with the relational model and current implementation you have the illusion that can design a few table and then invent your queries, and instead this does not work and is dramatically slow.

No need to have very large numbers to see how hard traditional DBs sucked. And the proof is the huge adoption of NoSQL in small companies: it's not just hype, people want to go home and sleep well and will not trow away working SQL solutions because there is the new hype. The reality is that they are experiencing serious problems even at the scale a small and mid-size company operates.


Finally I see a lot of small sites, forums, blogs, that can't scale easily to small numbers using relational DBs.

When MySQL was in growth stage, about 50% of the downloads accounted for users who were downloading a database for the first time in their lives! (got that piece of info from early MySQL folks). If a small site, forum, or blog can't scale easily on a relational database it is because the developer doesn't know what they're doing. You can make an argument that relational solutions aren't user friendly because they have defaults not suited for small, high-performance sites, or too many features that cripple performance if not used correctly, but IMO that's a silly argument. People don't blame a programming language because their program is slow since they used a O(2^N) algorithm where O(N log N) suffices. Why do people blame a DBMS when they write nonsensical queries that couldn't possibly evaluate in real-time merely for giving them an option to write such queries?

Also it's not just a matter of scalability: latency is a very important constraint.

Many relational databases have sophisticated schedulers that reorder events to maximize queries per second, unless an event has been in the pipe for too long, in which case it is given higher priority and pushed through faster. Most NoSQL products do not even begin to approach this level of sophistication. Even MySQL gives excellent latency characteristics if you know what you're doing - just ask the Facebook folks, they're very serious about latency. Good latency performance is certainly not restricted to NoSQL stores.

Most SQL vs. NoSQL debates usually have the same resolution: traditional DBMSes let you shoot yourself in the foot, while NoSQL solutions usually don't. With a traditional DBMS you better know what you're doing, while with a NoSQL solution you can usually get away with just jumping into it. Nine times out of ten, that's the value proposition. Most of the time it isn't performance, latency, a better/different data model, or anything of the sort.


I'd wager that people you see with "small sites, forums and blogs that can't scale easily to small numbers using relational DBs" are doing something wrong and they're probably going to have a long road in the NoSQL world as well. Just because you can make a query return the result set you want doesn't mean you've written the right query/schema design.

I assure you that there are plenty of performant sites running quite well on traditional SQL solutions that are FAR, FAR larger than those small blogs serving small traffic numbers that are having issues.


I'm using Redis as a backend to store course availability information at my university. If the set contains the course ID it has an open seat. Data is scraped and processed continuously and there are ~7200 courses so Redis is a good fit.


Is there a matrix of features or use case scenarios for the various NoSQL options out there?

e.g. i have a lot of person data with various attributes like applications, schedules, choices, demographics, courses, etc. (university). At certain times of the year, a lot of transactions would occur and at other times, it'll be getting very low traffic/transaction requests. MySQL tends to be a PITA if we try to modify certain things to add or remove tracking and so on while live hours and as things are run, unfortunately a lot of changes come in unplanned.

How should I go about finding the best fit for this scenario?


Hmm, Twitter is really big in using Scala.

The important message is this: "focus on the process to build a great company and great products... They didn’t become successful because they built these systems. They built these systems because they became successful."

The next unproven startup that uses NoSQL solution is just wasting their time. I know this because that happened to one of my previous workplaces (we use HBase).


At the last startup where I worked, we built our own document store (in Scheme!), with exactly the same reasoning that causes other startups to over-engineer for scale before we had any users.

Premature optimization is a kind of over-engineering. Thankfully, it became so frequently hammered into new coders' heads that premature optimization is a Bad Thing that it's no longer much of a worry. It seems to have been replaced with premature scaling in the past few years, though. Anyone that could solve the problem of rampant over-engineering could potentially wield the awesome power whispered about in legends.


The problem is that over-engineering often has its roots in the very same drives that brought many of us into technology in the first place: the opportunity to work on something "really cool." As a developer turned manager, I've seen both sides of the coin, and it is a delicate balancing act. You don't want to completely stifle innovation (and morale) by being hard-headed about over-engineering. Conversely, you want to direct that energy appropriately so that you don't create a culture of "under-engineering" and unhappy developers.


I totally agree. I may be totally wrong on this point, but it seems to me that over-engineering (in the forms of feature-creep, premature optimization, premature scaling, over-specification, planning for too many unlikely use cases, and all its other myriad forms) is often a tunnel-vision problem, a feature of the blessing/curse of engineers' tendency to focus really hard on a single thing.

If you look at the single-page source code to Plan 9's implementation of the cat command, and then compare it to GNU's, you can almost see it directly. The Plan 9 coders were looking more broadly at the OS and knew they'd need a cat command, while the GNU coders were thinking just about the Unix userspace, so when Plan 9 had cat, they continued hacking by moving to the next thing, but when GNU had it, they continued hacking by stuffing more things into cat.

Hackers love hacking, and this is great for the hackers and the users, but one rarely hears "This program is sufficient. I will consider it complete and write a different one until it becomes clear that the program is insufficient." (I don't exempt myself here.) Or maybe it's just too apparent when the designers forget it, so it stands out.

It's even more apparent on the web, where the default mode of a site is to see itself as a site rather than a component of the web at large.


At the same time though, you're saying you failed solely because you used HBase? Sure twitter's architecture evolved out of necessity but starting out with some NoSQL solution doesn't automatically set you up for failure.


That does assume that the company's non-SQL choices don't actually save time during development.

In building a system the design of which is under constant flux, there can be a lot of time lost dealing with schema rigidity.


That's one of the tough myths to overcome when building on top of a relational database. Changing the Database has traditionally been Hard, so everybody has been trained to think that way, therefore nobody is allowed to change the database because Changing the Database is hard. Try it at your bigco and the old guys will do everything in their power to stop you (thus making changing the database hard to do).

Ignore that rule and build yourself an environment where changing the database is easy. I make schema changes to my stuff all the time, and seldom push a release live that doesn't do so. The tools are in place to ensure that it's No Big Deal, so it just works.

If you live in a world where changing your SQL database is easy, it sort of takes the wind out of the "start with NoSQL, because changing the database is easy" argument. You get all the speed advantages of being schema-flexible, and you can write ad-hoc queries when you want to, so you're flexible in that direction too.


Let me clarify - I'm a startup hacker, not a bigco guy at all :)

And I still use MySQL all the time, right alongside so-called NoSQL solutions where they are better fit to a given purpose: Membase for high-availability collections on the order of billions-of-records in a social game; MySQL for defining the game world itself; mongoDB for any and all data for which eventual-consistency doesn't matter (e.g. analytics). I've streamlined my MySQL dealings in precisely the ways you outlined. I have change scripts for every schema change, and I have YML-driven schema auto-generation in Symfony and declarative. Schema changes are pushed out through staging to production - lazily when possible, and actively when possible.

But despite all the process improvements in the world, the dev time savings that go along with a smart, document-oriented model layer are not a myth. I assure you, they are very real. No longer does every new feature have to start with schema design (no matter how streamlined your schema alteration process may be, it has a nonzero cost, and one which definitely increases with scale). You can instead just get right to the code, and start setting and getting the properties your new feature will need.

The fact remains: SQL is not a one-size-fits-all solution any more than the recent "massively scalable" data stores are. A modern backend engineer should know a lot about a variety of datastore solutions, and should think long and hard about which data should be stored in which manner(s).


Would you mind telling some of the tools and techniques that make this easy for you?


All you really need to pull it off are:

- Change scripts for every schema change, stored in source control.

- An automated build/deploy that pulls down new change scripts and executes them in order.

- (optional) a good way to generate your backend CRUD by looking at the existing database schema, or as a lesser option an ORM that does the same thing.

So your workflow is: script out your schema change, check it in, apply it to your dev environment, regenerate the CRUD from your local schema, fix any compile errors that you've introduced. (and optionally make sure all your unit tests still pass).

You'll notice that all that stuff above is just the basic workflow you should have in place anyway.


This is exactly the pattern used by Rails apps with database migrations and Capistrano deployment.


What is "schema rigidity"? I'm genuinely curious, do the NoSQL types really believe that you can't add a column to a table on the fly in a RDBMS?


The problem is not whether you can, but how long does it take. At least with mysql, alter table for a large table will take a long time (hours, if not more), during which the table cannot be written to.


That's not representative of RDBMSs in general. In Oracle we don't think twice about adding a column to a table during production hours. The only issue is if the new column has a default value and you have millions of rows that you need to "backfill" but that's just a big transaction; it's nothing remarkable in and of itself, if you could do a transaction that big anyway, you'd just go ahead and do it. And of course, in Oracle readers don't block writers and writers don't block readers, we have MVCC.

Once again, NoSQL is shown to be a reaction against MySQL, not RDBMSs in general.


That was but one example out of many. Oracle is a great database. It isn't great at everything, however. Geographically distributed, fault tolerant, scale out architectures as often required for big online services? Not a great fit. Multi-petabyte complex analytics processing? Not a great fit.

There is a set of relational database folks fixated on the false dichotomy of relational databases OR non-relational databases. In practice, they are often combined in a variety of ways. Insisting on One True Database is like insisting on One True Operating System or One True Programming Language. Stop obsessing over tools and build useful stuff!


Oh indeed, right tool for the job, I'm 100% with you there.

The issue is, from the NoSQL camp we hear about "schema rigidity". We hear "SQL doesn't scale". These things simply aren't true! It's as if someone had only ever written .BAT files on DOS and thought that all its limitations applied to Python as well (and went and told experienced professional Python devs that!).

There have been "object repositories" such as Versant for a long time. The NoSQL types seem oblivious to these too.


Schema rigidity is a canard originating with the same folks who think a hash is a type system. I encourage you to dismiss the folks who say things like that, rather than dismissing some very useful technologies.


yes, but oracle is expensive, especially since it requires people who really know about oracle if you want to be up to speed relatively quickly. So as always, it is a tradeoff: it seems that in some cases, not having the usual RDBMS guarantees is ok because there are less admin costs, etc...

So sure, some people don't understand those tradeoff and make stupid choices. But people who choose technology without properly assessing the risks/advantages are bound to fail anyway.


If your major expense is people who know what they're doing, then Oracle is not even the most expensive platform... During any web boom I bet LAMP guys were billing higher hourly rates than people doing Oracle!


Other databases do not have this problem, MySQL is just a particularly bad example of a RDBMS


Google and Facebook have all the money and talent required to deploy epic Oracle systems. Instead they use GFS and BigTable and Cassandra and HBase and Scribe and Hadoop and MySQL and a host of other systems. Amazon has massive Oracle deployments, so plenty of money and knowledge on the topic, but still built Dynamo and S3.

There are billions of dollars riding on this for them. Instead of insisting they are wrong, ask yourself why they might be right.


Google and Facebook aren't good examples, because they have no hard transactional requirements for their main applications. If a web page isn't included in one search result but is in the same search executed on a different node 5 minutes later, who would notice or care? If a status update gets dropped, it might be annoying, but you can always just resend it.

If you want to compare like for like, ask why Visa isn't using MongoDB for authorizations, or why American Airlines isn't using Redis for reservations.


So you are aware, yours is the traditional response. Somehow, doing what Google and Facebook do is "easy" because not everything requires transactions. This is false both because scale like that makes almost everything difficult, and because, as Google recently published, they are using transactions for their main application. NoSQL does not imply lack of transactions and transactions do not imply relational databases.

http://research.google.com/pubs/pub36726.html

"Databases do not meet the storage or throughput requirements of these tasks"

And Mongo or Redis? Come on, scro.


The post you are replying to does not contain the word "easy" at all, so far as I can discern. I think you might be attacking a straw man.


I doubt that any DB that stores data physically in a row oriented format doesn't have issues with this particular type of change. But of course there are many other ways to work around that. For instance, you could just create a new table and a view to join the two. Or, if a particular table changes all the time, redesign it to store key value pairs instead. RDBMS can easily be used in a schemaless fashion if needed for particular scenarios.


Whether it saves time or not depends on a lot of things. If users actually determine the structure of your data then you're right. If it's just about schema evolution by developers I don't think that moving schema constraints into procedural code makes things simpler or more flexible.


RDBMS does require a bit of planning (not too much, and not too rigid) and previous experiences building good data model; both traits don't exist in hotshot hackers/startups these days because they either don't have the patience or much deeper experience.


I seriously doubt that this is true. SQL databases are extremely easy to adjust. Hell you can add columns on your live database in the middle of the day if you want. Removing data is harder but normally you don't have to delete columns just to launch your new code.

Another thing to remember is that most SQL systems are mature and there is pretty much always very good tools available to do any kind of changes you may need to do.


The existence of great tools for relational databases is a compelling argument for using them. As I said in the article, starting off using a single, monolithic relational store is a successful approach employed by many, successful companies. I would suggest, though, that the rest of your comment indicates a lack of experience with relational databases at large-scale. One metric provided by Twitter in a presentation was that an ALTER TABLE command took 2 weeks to run on a previously centralized relational database. Perhaps someone from Twitter can add some color to that anecdote.

tl;dr - Scale breaks your assumptions.


Scale just breaks MySQL.


If the answer is "Spend millions on Oracle", people will be motivated to ask a different question.


Like I say, in business there is no cheap or expensive. There's worth the money, or not.

Yes, Oracle costs money. But so does "rolling your own". How much of their VCs cash has say Twitter spent doing that? Yes, Oracle "locks you in". But your own legacy code locks you in whatever platform you've built it on.


In the specific case of FlockDB, that's actually not the case as the shards are modular. SQLShard is one implementation. There was an experimental Redis shard implementation, as well. You might be both overestimating the cost of building things like this and underestimating the enormous drag of using closed source components in these systems. If you've personally built online services this big with Oracle, well done. What did it cost?


Once you have hash joins, shards start to look awfully restrictive compared to partitions... MySQL (et al, I don't know about FlockDB) can freely shard because they're not losing functionality they don't have in the first place.

Maybe I'm not a good example because I've mainly worked in financial services, but nearly every Oracle project I've worked on has been wildly profitable, and most have been at the level (in terms of transactional throughput, and in terms of features that were cheaper to buy than build) at which there are only really two choices, Oracle or DB2.


Vendors highly optimize for your case, and you find that their solutions work for you. Not a surprise. Different constraints and different requirements produce different solutions. Nobody is saying you are doing financial services storage and processing wrong. Why are you so insistent experts in a totally different field are doing their jobs wrong?


I'm not. I'm insistent that the assertions that RDBMSs don't scale and that RDBMSs are too inflexible for rapid, iterative development are false.


Absent a metric for "scale" it isn't false, it's meaningless. Here is some context to help you distinguish how things work for online services as opposed to financials:

As you are not printing money, you care about cost efficiency. This means you are biased towards using white box hardware, and having as little variation as possible. In the financial world, they use name brand hardware and whatever configurations make sense for a specific application, which is good, because Oracle and IBM are not interested in supporting their products on white boxes. You're Twitter, you have 15 billion edges in this database. Each one is conservatively 24 bytes. 360GB of raw data. Everything you do depends on that data being always available, extremely tolerant of component failures, accessible with very low latency, so assume it all has to be in RAM on a bunch of machines. The Oracle answer is a cluster of fat, named-brand servers (totally unlike all the rest of your hardware) with a dedicated interconnect, a lot of license fees, support contracts, and dedicated DBAs (plural, since you'll need them on-call). Special hardware, headcount, license and support costs. Hit a bug? Call Oracle. Need a feature? Call Oracle. Wish that precious headcount was filled by engineers writing code instead of DBAs carrying pagers? Too bad.

When scaling includes having to fit into the same model as all the rest of your infrastructure, and that model is not the Oracle model, then it becomes a bit clearer why engineers might dismiss Oracle. They are left with those inferior options like MySQL, which you've already agreed has various scaling issues (though Facebook somehow manages). So, they invest a few months of a few engineers and they have a purpose built system that does what they need, fits the rest of their model, for which they have the source, and for which the maintenance costs are likely to be far lower.

Like I said in the article: absent a specific problem for the business, SQL vs NoSQL is just noise. Something like FlockDB is far simpler and cheaper than throwing Oracle at the problem. That's not a technical argument, it's a business argument. If you want to argue that the big players don't know how to run their businesses, I will not try to stop you.


You can, right now, go to Dell's website and buy, off the shelf, a server with 144G of RAM. Stick Red Hat on it and Oracle considers it fully supported. The days of "vanity" brands that you'd stick behind a glass partition and take your investors on a tour of the datacentre to see are loooong gone.

And the reason you need devs and DBAs to be separate isn't one of different skills at all, both speak PL/SQL fluently. It's just if the regulator of your industry requires that the code be developed and deployed by different people. If not, it's normal for the two camps to have significant overlap.


Yes, you can go buy name-brand hardware (Dell, IBM, Sun, HP) in a special configuration not used anywhere else in your infrastructure and install an OS on it which you don't run on any other system, all just so you can have the honor of paying Oracle. That was my point.


Eh? Dell is very much not a brand like IBM or HP, they are pioneers in the field of cheap generic hardware. That's why I chose them as an example! And Red Hat is hardly an obscure OS these days either... And Oracle runs happily on Windows or many other common OSs too...


You don't have to spend millions on Oracle. You can spend thousands on SQL Server or get PostgreSQL for nothing too.


What scale? How many concurrent users of your specific site break MySQL, and how many concurrent users do you have? Not answering or considering those questions is what leads to premature scaling.


Since I cannot edit my original comment (no idea why), I'd put my update here:

Looks like the people that replied my post got into deep technical discussion without mentioning a startup in the scale of Twitter. Exactly like what the author pointed on: no successful product yet.

I hope people can see a pattern here: Twitter: FlockDB, and varieties of supporting infrastructure Facebook: Cassandra and in-house tools Google : GFS, BigTable, and in-house tools LinkedIn: Voldermort + varieties of supporting tools

Each successful startup build NoSQL solution based their own needs.

Better make sure your needs align with them sir.


Again this topic brings out the "Always use relational databases" folks.

Few say "Always use NoSQL stores".

Most of us who use NoSQL, use it in conjunction with other stores, some of which are relational databases and some aren't.

Lunch isn't free, blah blah, etc. However with NoSQL solutions, we can optimize the things we care about (reads and writes) over the things that happen rarely (changes to the schema, indexing, etc) if we want to. Sometimes our apps demand it, sometimes before the cash exists to even try to use oracle or the people you have to hire to properly support it.

I don't get why people steadfastly assume everything fits the paradigm they're most comfy with. It doesn't. People who are just as comfy with your paradigm found stuff that works better, easier for them.


Again this topic brings out the "Always use relational databases" folks.

You show that strawman who's boss!

I don't get why people steadfastly assume everything fits the paradigm they're most comfy with.

Indeed.


Gaius at least is going all rdbms all the time all over the thread. I'm not dreaming that up am I? (I didn't realize it was one guy at first).


Great article. But it only sheds light on the performance and scale aspect. Having used nosql in my last project, I found some other aspects:

  - Flexibility
  - Replication
  - License agreements
Nosql allows easy and painless schema changes, because there is no hard-coded schema. Using a morphing library every object can be stored without any additional coding.

Some nosql DBs support two-way replication for nodes that can be offline for a long time (for instance mobile clients). This can be very handy if you need such a feature.

Some nosql DBs come in a very liberal apache license.

If find it great to have so much choices today.


Relational databases have as much or as little schema as you like and the article did point that out. NoSQL varies widely in terms of approaches to schema so I don't think a blanket statement about NoSQL schema evolution is even possible.

The issue I see is that moving the kinds of guarantees that schemas provide into application code doesn't necessarily make the system as a whole more flexible or easier to modify. The exact opposite could be the case.

Postgres has a very liberal license as well by the way.


Indeed. You could build a relational database like

  create table nosql (id number primary key, stuff blob);
Hey presto, generic key-value store. The point of a RDBMS is that you can do this and lots more too.


NoSQL and RDBMS are not competing tools occupying the same solution space. It's not an either-or, and anyone who thinks that it is has no idea what they're talking about.

NoSQL doesn't give you a tool that scales better, it gives you a different tool that might fit what you're trying to better.


Some NoSQL products geared towards rapid prototyping, and not for scalability, like for example CouchDB. With CouchApps you can create entire Web App in Javascript, sitting in CouchDB.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: