Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Why NoSQL (rxdb.info)
90 points by typingmonkey on Oct 6, 2021 | hide | past | favorite | 120 comments


NoSQL are super easy to start with. Then you have a pile of heaping unstructured data that's hard to query, hard to report and gain insight on.

Every data task becomes a programming problem which needs a team to work on to get something that might be stale. With SQL there's a lot of work. But you have a DB that everyone knows. 3rd party tools work. Developer on-boarding is easy. Your boss asks you how many users used feature X in the past month you can actually tell him right away...

Performance is actually good and consistent. Your data is also consistent and respects acid principles. The whole write performance of NoSQL DBs is a bit of a crock that doesn't stand the test of scaling. With proper cashing SQL is as performant when done right and the profiling tools are better. Screw no-SQL. They are just Object Oriented DBs (which were a disaster in the 90s) in new clothes.


Actually they're even older.

When I talked to my dad about RDBMS he was like "weeeelll, sure, there's things like DB2 UDB that do relational but performance wise, nothing beats reading the data straight by key in exactly the format you need.".

DB2 UDB: "Initial release: 1987; 34 years ago". I.e. what he would rather use and is sort of the NoSQL equivalent is _even older than that_.

Background: he started off with 360 assembler and worked all his working life on IBM Mainframes and the various technologies in and around it. They had it all and they had it before it came to "us". We're just re-inventing most of these things on much cheaper and more open hardware and software.


what I fail to fathom today is, when I think back to the late 90s, I had various production rdms, SQL Server 6 (or 7, can't remember), running on something like a Pentium 4, with millions of rows of data in various tables... and the performance was... fine. not an issue.


Today you need a Cassandra cluster fronted by a Kubernetes cluster to get similar performance. The good news is that you don't need a couple boxes under your desk, you can pay AWS tens of thousands a month for it.


Single machine postgres is mostly fine when you're working with less than hundreds of million of rows, less than a few hundred gigs of data, don't need HA, and don't need more than a few thousand QPS. Outside of those constraints is why new things exist imo.


Of course, the overwhelming majority of all businesses (including web-centric businesses) work fine within those constraints.


Sure, small businesses are fine using postgres. I would argue that by developer market share though most (or perhaps just many) developers have to worry about these issues, since the companies that have scale have to higher many more developers.

There's also the argument that businesses hoping to scale might be better off assuming they will need it to save themselves work down the road.


Computers are very fast and fast storage is very cheap these days.

Your Data Isn't That Big https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html


> Your Data Isn't That Big

This is an extremely important lesson. You don't have Big Data, unless you're one of a handful of huge companies.

Years ago I joined a mid-size public company and inherited one particular service. Management was adamant that it had to migrate away from MySQL onto Cassandra due to data volume. But there were no metrics being tracked, it was driven by "we have to do it because everyone is doing it".

I started tracking actual usage, growth and performance and it didn't seem to point to a problem that needed solving. So I halted that project (to their credit, they listened to engineering input so I was able to).

A few years later when I left, it was still perfectly fine on MySQL. Nowhere close to running out of capacity. Today it's been almost ten years and while I'm not there, what I hear is that it's still fine.


"But my data is more than 5TB! Your life now sucks - you are stuck with Hadoop. You don't have many other choices (big servers with many hard drives might still be in play), and most of your other choices are considerably more expensive."

My point is that large companies that do need to scale hire many developers. FAANG likely hire north of several hundred thousand engineers alone, and according to this article[1] there are only something like 1.5 million software engineers in the US.

[1]: https://www.daxx.com/blog/development-trends/number-software...


Even combined, FAANG is probably only around 150k programmers. A vast majority of the worlds 24-26M working software programmers do not work at them, or even at the top 100 such companies.


PostgreSQL handles 100k QPS without any issues unless your queries are very complex and I know many people who run multi-TB databases in PostgreSQL without any issues either. But it all depends on your workload of course.


Yeah, I probably could have bumped both those numbers an order of magnitude.


Well, millions of rows of data is tiny these days: if you have 100_000 IOT devices, each generate on average 10 telemetry points a second, that's already 1M writes per second. Storing these telemetry for a week and you already have 600 billions rows.

Of course that's a oversimplification, but the data generated today is mind-boggling large compared to the 90s.


If you have 1M writes per second, you have a data collection problem, no matter the technology. Doing it on a single node is not a great idea just because it'd be a huge SPOF.

In most cases, you have colossally fewer than 1M writes per second, and you usually distribute them over several boxes. the IoT avalanche of writes in particular is naturally sharded, that is, all these writes can occur independently, and then you can replicate statistics into an analytical RDBMS.

It's a large write load that needs to be transactionally ordered, like a stream of purchases in a large online store, that can be a hard(er) problem for an RDBMS.


Yup, I agree with you. Most businesses can work fine with traditional RDBMS, even on TB scale if you have a large box with lots of RAM and SSDs.

My example was just one use-case where one should consider using a more unconventional DBs like Cassandra or HBase.


> but the data generated today is mind-boggling large compared to the 90s.

For what percentage of companies would you say your statement is true?


:s/these days/at your particular job/g


It is not and issue today either. Modern RDBMs have become better optimized and hardware has become faster so a lot of NoSQL is premature optimization where people optimize for having 100+ TB databases when their database actually is 100 GB.


The original Unix DBM file was basically NoSQL in a SQLite type package and it was developed in 1979.

https://en.wikipedia.org/wiki/DBM_(computing)


I worked on DB2 UDB in the 90s, and learnt a lot about DBs through all the improvements in isolation levels. Those were the days when you'd actually have to make some judgement calls about setting specific isolation levels to improve performance. Nowadays you just chuck an SSD and several TBs of memory into the works (or talk to someone about some massively horizontally scaled solution, and distract from the real problem at hand).


Oracle introduced RDBMS in 1979 based off Codd's paper in 1969. Mainframes were all flat file not object-relational, but you could finagle them to be sort of that. Object-Relational didn't come into play until the 90s.

The main issue with object relational as the "main" database is its hard to run reports on. Reports (for biz) is half the reason the system exists. I'm not saying ORDBMS don't have a place, but RDBMS is much better suited for real-time data analysis / reports. I think ORDBMS is popular because it's easy for non-database people (who don't know 3rd normal form or SQL) to store stuff in.


Let's not erase the work of Stonebraker et al whose efforts through the 70s were a lot closer to Codd than Oracle's. Ingres's query language, Quel, springs to mind: being directly based on the relational calculus it was far more readily composable than SQL (which might be charitably described as "borrowing ideas from the relational model") was or is.


You may be overlooking System R from 1974:

https://en.wikipedia.org/wiki/IBM_System_R


Well, even plain old filesystems are some weird hybrid between key-value stores and graph data bases, too.


it's sometimes very helpful to think of filesystems as a streamable blob store with a hierarchical key


That's one valid perspective. But don't forget that you can also seek and mmap etc. And hard-links are a thing, too.

Things like fuse also show that it's easy to view other databases as filesystems, too. As a silly example, you can 'explode' /etc/passwd into individual files per entry. Or you could expose even a relational database as a filesystem.

For the latter, you could do something simple and expose tables as directories and rows as files. Or you could do something dynamic, and eg represent (read-only or idempotent) queries as file-names, and their results as file contents.

That wouldn't be so much a traditional filesystem, but would be accessible via a filesystem interface.


I hear AWS DynamoDB is basically just a "key value store interface" to S3 file storage.


This is not the right way to approach this. NoSQL and SQL both have use cases they are suited for, and saying one is better than the other is wrong. Faster writes of NoSQL does help with scaling - but if your app needs transactions than obviously NoSQL won't work.

For reporting, unstructured data is actually not bad. If your report requires you to scan your whole database, then SQL won't necessarily be faster, you have to think about disk performance as well as how complex of a join(s) do you need.


SQL databases can do several levels of transaction “strictness”, so their write speed can be just as fast.


I use Rails and Mongo at work. There is a schema, enforced declaratively at the application layer (mongoid). With Rails Console querying the database ad hoc is even easier than with SQL, thanks to mongoid api and Ruby’s expressiveness. There are no transactions, but single document writes are atomic. Performance is pretty good; the one hardship I run into is that the mongoid api is so high level that you don’t see when you’re firing way too many queries at the dB (but that’s an ODM issue, not mongo’s, other ODMs might not do that).


> schema, enforced declaratively at the application layer

Nobody has just one app forever. Any org that gets big enough is going to be uncertain about how many apps depend on a particular database, much less whether they’re all in the same language and bundle the same schema rules. Customer service builds weird support tools, accounting and compliance build weird reports …


Make sure each datastore has a clear owner, and access it via that API rather than directly. If you really must share a datastore, use a schema registry. Sharing between multiple applications doesn't really work for SQL database either (e.g. if you have multiple applications writing to your database in transactions then you're virtually guaranteed to get deadlocks).


> (e.g. if you have multiple applications writing to your database in transactions then you're virtually guaranteed to get deadlocks).

That assumption is only true if the DB is continually used by different applications. The reality is more likely a large number of applications used by a relatively small set of users, so most of the time there won't be any application writing, and it would be extremely rare for multiple applications to write at the same time.


Are rare deadlocks better than frequent ones? That sounds like a good way to not know you have problems.


This only works in theory.

In reality sometimes access via an API is way too slow or is lacking flexibility.

And I don't get this: "(e.g. if you have multiple applications writing to your database in transactions then you're virtually guaranteed to get deadlocks)"

Isn't this why transactions exist in the first place?


depends on what level of isolation and how it's done, MVCC avoids deadlocks, but if you manually lock any row/table, you can still run into deadlocks. but the DBMS' transaction manager will notice and timeout one of the transactions, failing it.

While MVCC avoids deadlocks, you'll still get a lot of failed transactions if they all operate on the same data at the same time, which you'll have to retry, but I'll take that over data inconsistency


Quite often you have several applications that look up each other's data, but mostly write to their own tables.

(But indeed, with uncoordinated development, deadlocks can occur. Still better than data corruption.)


I feel like the issue with nosql is that it's largely unnecessary when newsql databases exist.


This may continue to work, as long as performance is not your primary consideration (you have little data and few queries), you don't have to evolve the schema much, and no programming errors or ad-hoc updates introduce data that fail to follow the schema at the DB level.

So, it might be adequate for a small and carefully maintained project, likely like many alternatives.

Or do you use a high-load, distributed, replicated / sharded Mongo configuration? Then I'd gladly listen to more details.


What's your SQL db of choice where you can scale it up seamlessly beyond a single instance?

EDIT: to add ... and with HA/DR and where you keep all the good guarantees, queries etc.?


CockroachDB, I heard the dev experience of spinning up nodes can be dumbed down to 1 or 2 docker commands, and it's Postgres compatible, continues to have global transactions, although you wont get the same transaction performance as something GPS/clock hardware synced like Spanner, it's still a good self hosted solution.

If you're interested, CRDB writes a lot about how they do their transactions on their blog, it's good to know what guarantees you actually get with your DB's transactions.


For what reason do you need to scale it beyond a single instance?

If all you need is to alleviate the high read load (many selects), nearly every SQL database (even SQLite!) has a free and supported way to create read replicas, usually out of the box.

If you need distributed transactional updates, MySQL / MariaDB has Galera (GPL), and Postgres has Citus (proprietary with a few AGPL parts).

And, of course, there is CockroachDB that prioritizes reliability over speed, but is truly distributed out of the box.


Right now I (well, we, my team) uses Hbase and in a previous project I used Cassandra. These are very high scale systems where we need replicated writes (for durability and availability), we need horizontal scalability and performance (write heavy or mixed read/write workloads) trumps transactionality. There aren't a lot of writeups about performance benchmarks between NoSQL and SQL but as far as I know it's not even close to the same ballpark: https://www.scylladb.com/2021/01/21/cockroachdb-vs-scylla-be...

With something like Cassandra you get read after write guarantees that I think you don't get with read replicas and SQL databases but I'm not too familiar with the latest there. Also once you start breaking up your SQL databases into shards you lose all SQL goodness across shards (like joins and other properties).

So no free lunches basically. I'd love for there to be something fast, scalable, queryable, transactional etc. etc. Honestly I haven't looked in a while but every open source SQL solution I've seen in the past that claimed to be durable/reliable/scalable doesn't actually do that under failure modes like NoSQL does and pretty much loses either performance or other guarantees or both. Ofcourse NoSQL doesn't solve some of these problems, but at least you know what you're getting into ;)


I agree! I've seen Cassandra in production, and learned why it was put there. With it, you still get some of the SQL goodness, but most guarantees are only within one shard.

With that, you let go of the global ordering, because global ordering / single-point serialization just becomes technically infeasible. At that point you usually don't exactly need them, though. You only need local order within a transaction.


Seeing most people go for the cloud what percentage of projects will outgrow say Aurora? and beyond Aurora there is a ton of options like CockroachDB, TiDB, Yugobyte etc.


If the data couldn’t be fully sharded (I.e. independent db’s and replicas), I’d personally choose Cockroach DB for this, I admittedly haven’t had the pleasure of using Cockroach, but I’ve seen alternative approaches and knowing what I wouldn’t do again…


No love left for POET? :)


> But you have a DB that everyone knows. 3rd party tools work. Developer on-boarding is easy.

Not reliably. All the different DBs have subtle misbehaviours and mismatches with the tools.

> Performance is actually good and consistent.

Lol no. Indexed joins and unindexed joins look exactly the same but have vastly different performance characteristics. And even if you have a query that currently performs well, you never know when or why the query planner will change its mind.

> Your data is also consistent and respects acid principles.

Only because you drop data on the floor if it's inconsistent, which is rarely a good enough solution in the real world. E.g. someone tries to add a customer but the city has been deleted in the meantime -> they get an error, and the system throws away everything they input about the new customer. You end up having to implement your own dead letter queues etc. and do all the work you would with a NoSQL datastore.

> With proper cashing SQL is as performant when done right and the profiling tools are better.

Caching makes your performance even more arbitrary and incomprehensible - it works great right until it doesn't. Profiling is great but it's a poor substitute for being able to reason about what the performance of your queries will be.

> Screw no-SQL. They are just Object Oriented DBs (which were a disaster in the 90s) in new clothes.

On the contrary - the problems with SQL databases are much the same as the problems with object databases.


If you're throwing everything the customer have input then you have a poor written application.

Not inserting bad data is much better than trying to make sense of inconsistent data because you don't know what is true or not.


Validating consumer input is good. Throwing it on the floor when it fails validation is bad. You want to separate "accept the consumer's input" from "validate it", which is something that's very hard to do in RDBMSes.


> Only because you drop data on the floor if it's inconsistent, which is rarely a good enough solution in the real world. E.g. someone tries to add a customer but the city has been deleted in the meantime -> they get an error, and the system throws away everything they input about the new customer. You end up having to implement your own dead letter queues etc. and do all the work you would with a NoSQL datastore.

What. Who's letting people delete random shit out of the DB? Why is the origin of this data not validating against the actual data from the DB before sending data?

"Hey we have to ensure that customers are selecting valid cities.

We have a list of cities in the DB. Let's ensure we cache the cities in the api server so that we're not hitting the DB for every page load for data that will very rarely, if ever, change.

Especially since we revoked Dave's DELETE permissions after that last time he deleted stuff from a production database."


> Why is the origin of this data not validating against the actual data from the DB before sending data?

It's on the client side, so it can't hold an open transaction with the server. It can validate against a previous version of the list of cities, but there's always the possibility of a delete happening in parallel.

(And if your answer is to use a delete flag rather than actually deleting a row, then you're undermining the referential integrity constraints that were supposed to be the whole point of using a relational database)


>It can validate against a previous version of the list of cities, but there's always the possibility of a delete happening in parallel.

If a "city" is a) required attribute of an entity and b) required to exist as an entity itself, then deleting a city would require deleting all other entities that refer to that city.

If city is a nullable attribute, then your validation issue you raised earlier is voided - an invalid "city" can just be replaced with null.

Assuming you're wanting to delete cities, then your client side shouldn't be firing off data and assuming success.

> It can validate against a previous version of the list of cities, but there's always the possibility of a delete happening in parallel.

Yep, caching is hard, but your argument that "RDBMS aren't that great" is predicated on a very... ...convoluted system.


> If a "city" is a) required attribute of an entity and b) required to exist as an entity itself, then deleting a city would require deleting all other entities that refer to that city.

Right, so usually what people implement (and what the database enforces by default) is that you can't delete a city if there are any customers in that city. But they don't think about what happens when two people try to make changes in parallel, because the whole point of ACID is that each query runs in its own transaction.

> Yep, caching is hard, but your argument that "RDBMS aren't that great" is predicated on a very... ...convoluted system.

All the use cases for referential integrity - and transactionality - are like this though. You can't actually use them for anything, not with a web app where the client can't hold a transaction open.


> But they don't think about what happens when two people try to make changes in parallel, because the whole point of ACID is that each query runs in its own transaction.

You do know that the D in ACID stands for consistency? That is in case a transaction try to modify something that was already changed in a parallel transaction that was already committed, it will fail and will be rolled back (given proper transaction levels).

So I really don’t understand your point. Also, you can hold a transaction open for a client..


> That is in case a transaction try to modify something that was already changed in a parallel transaction that was already committed, it will fail and will be rolled back (given proper transaction levels).

It will fail and be rolled back and you will lose the user's data that was written in it, unless you effectively build your own datastore on top of what the database offers.

> Also, you can hold a transaction open for a client..

For a client that's potentially gone offline for an indefinite period? Certainly not. Even if you're heartbeating that the client still has the form open somehow, that seems like a very suspect architecture.


> It will fail and be rolled back and you will lose the user's data that was written in it,

The point of relational integrity is ensuring all your data is in a valid state. If you don't want to lose a client's data because you deleted a city, there are multiple engineering approaches open to you.

1) Don't delete cities, just don't. The semantics of doing so don't make any sense in the first instance.

2) Further, deleting entities from an RDBMS is pretty damn final. What if a business analyst wanted to do some historical analysis? Well, that city is gone now, so it won't be there. This is where "soft delete" comes in (And no, soft delete doesn't destroy referential integrity) - you've marked the entity as no longer valid for use, but you still retain your data surrounding it.

3) Don't respond with a 2xx until the data store acknowledges successful insertion.

4) Use Debezium or similar to broadcast entity changes to a Kafka topic keyed on entity id. If you want to retain all changes (e.g., for event streaming), then use an appropriate retention strategy. If you want to retain only the latest state, use log compaction - it will retain the latest record with a given key.

This makes it easy for an app being spun up to obtain the entity state without touching the DB, and by continuing to subscribe to the topic, update its cache to match the source of truth.

Of course, you don't have to use Kafka, it's just one approach, but many other systems will allow you to do the same.

Basically, your objections say nothing about the usefulness of a relational datastore, but rather, the need for careful engineering when building overly distributed systems.

And through all this, I'm wondering how you imagine a non-relational datastore solving these problems any better.


> The point of relational integrity is ensuring all your data is in a valid state.

Right, and doing that at the lowest level of your datastore is a fundamentally misguided approach, because it means you can't ever store data that's in an invalid state (almost by definition). So when data does become invalid, you're effectively forced to destroy it.

> And no, soft delete doesn't destroy referential integrity

Yes it does. It destroys the property you described above - that you ensure all your data is in a valid state.

> Don't respond with a 2xx until the data store acknowledges successful insertion.

So the user sees an error. That doesn't actually help much, because what can they do with that error?

> Use Debezium or similar to broadcast entity changes to a Kafka topic keyed on entity id. If you want to retain all changes (e.g., for event streaming), then use an appropriate retention strategy. If you want to retain only the latest state, use log compaction - it will retain the latest record with a given key.

Yes, now keep tugging on the thread of that thought. If you try to write code to reconstruct the state of your relational database based on those changes, you've got two copies of your logic that will inevitably get out of sync, and you'll have bugs that you only discover when you try to actually do it. (And if you only record differences between writes that actually made it into the relational database, you haven't solved the original problem of data being lost because writes are rejected). What you want to do is instead make those Kafka events the primary "source of truth" and construct the "current state of the world" based on that, i.e. event sourcing.

You don't, and can't, make use of transactions with that approach - you get (eventual) consistency because each log is ordered, which gives you the properties that you want with less of the downsides (deadlocks), but writing and reading the eventual results of that write is a fundamentally async process (which is good in the long term - it forces you think about your dataflow and avoid loops - though it might involve more work upfront). And you don't really have relationality - if you need a relation in your live dataflow, you'll generally join at the event pipeline level and make the joined thing its own stream (much like a materialized view in SQL-land). You can build secondary indices etc. but those are explicitly a secondary thing layered on top of your primary datastore.


What does the form have to do with anything? That’s just part of an application, nothing special. What about storing the results in memory until you can make sure it is successfully stored? Like, this has nothing to do with databases, that’s just application architecture. Also, what language/framework have you been using because these things are routinely handled by any that worth anything.


> What does the form have to do with anything? That’s just part of an application, nothing special.

To be able to actually use database transactions you'd need them to extend across the form's lifetime. You can do that with a local application, but not with a webapp with the web request cycle.

> What about storing the results in memory until you can make sure it is successfully stored?

That has all the same problems as storing your data in memory and not using a datastore. In this scenario the data can never be stored because it now violates referential integrity.

> Also, what language/framework have you been using because these things are routinely handled by any that worth anything.

Many of the big-name Python or JVM-based ones, but it's a fundamental problem that can't be solved at that level. Which language/framework are you claiming handles this well?


> All the use cases for referential integrity - and transactionality - are like this though. You can't actually use them for anything, not with a web app where the client can't hold a transaction open.

You should probably tell the millions of companies actually using them for things that they're wrong. I'm sure they'll feel very silly.


The actual businesses generally don't. It's funny how RDBMS fans always say "businesses x, y and z were built on MySQL" and forget to mention that they were actually using MyISAM-based versions of MySQL that had no ACID and barely any referential integrity constraints. Most of the "mongodb is web scale" crowd is coming from the same ideology as the people who were hating on MySQL 10 years ago.


I feel like I'm missing something in this article. Several of the points just don't make any sense to me.

* Their point about how you can just rewrite relational queries manually and not lose any performance does not seem true the moment that you need to join tables on a condition.

* Their section on reliable replications can be resolved by keeping a log of the changes made to the database rather than a log of the queries. I don't understand how their NoSql model of "Download the latest doc during a change" is any better. It basically becomes "last writer wins" in a distributed system.

* The downgrading to NoSql argument makes little sense to me as well. If your frontend is sqlite then you can have Postgres, or MariaDb in your backend. You need to account for having different queries on the front and the back, but the backend is presumably also operating under very different constraints than the frontend.

* I'd actually make an argument that NoSql databases are a type of relational database heavily optimized towards not needing to do joins. The corollary to this is that over time you may find that, oops, you do want to do joins, and now you have to pay a tax on not being set up to do so in the first place.

* Also, is it just me or is the Hybrid Logical Clock they mention a type of Lamport Clock where the linked article doesn't ever state that?


> Their point about how you can just rewrite relational queries manually and not lose any performance does not seem true the moment that you need to join tables on a condition.

You can do the same thing a database would do - filter the first results on that condition before firing off the second query, or do the join "backwards" if you think that's going to be better / there's an index available for that.

> Their section on reliable replications can be resolved by keeping a log of the changes made to the database rather than a log of the queries.

You'd need to define a representation of that log, which ends up being equivalent to writing a NoSQL datastore.

> I don't understand how their NoSql model of "Download the latest doc during a change" is any better. It basically becomes "last writer wins" in a distributed system.

You at least get consistency. And NoSQL gives you the option of building something better like e.g. Riak does.

> The downgrading to NoSql argument makes little sense to me as well. If your frontend is sqlite then you can have Postgres, or MariaDb in your backend. You need to account for having different queries on the front and the back, but the backend is presumably also operating under very different constraints than the frontend.

This is kind of the same as the log replication problem - you want the protocol for what's replicated from frontend to backend to be something simple that you could implement on top of approximately any backend datastore. "SQL queries" aren't that, whereas a simple K/V store protocol could be (e.g. MySQL actually implements the memcached query protocol, or did at one point - that would be very hard to do the other way around).

> I'd actually make an argument that NoSql databases are a type of relational database heavily optimized towards not needing to do joins

Um, WTF? Can you flesh this out at all?


> Um, WTF? Can you flesh this out at all?

Ha, sure.

Let's say that you have a very-basic NoSql schema for a chat app:

chat_app/person/<person id>/msgs/<chat documents>

so for each <person id> you stick chat messages in the <chat documents> separated by date or something. Conversations with different people end up in different <person id> paths. Chats have metadata like timestamps in addition to the message content

You can also represent this as a SQL table with something like:

Person_Id (uint) | Timestamp (timestamp) | Message (string)

Now the classic argument for NoSql is that NoSql is the best representation for this since you don't really care about doing complicated queries over the chats. You just want to be able to index in and get a persons chats. Maybe sort by the most recent chats so the <chat documents> have the timestamp in the key. That's what I mean.

Where I get frustrated with NoSql is that you inevitably end up needing to redo a lot of SQL logic anyways. A basic example is mapping the person_id -> visible name. The NoSql way is to have a meta file under <person id>, like chat_app/person/<person id>/<meta file> where you would have the user name under there. Then you can look up easily what to display when showing chats for <person id>.

But then you add group chats. Now you need to also have a group chat section and you need to go back to the person_id path to resolve the visible names. Having a relational table with:

Person Id | Visible Name

seems like a more elegant solution to me. Once you need to start gathering stats on the data as well, relational tables become more easy to gather information on how often you are talking to various people or lengths of conversations with people. Most of the functionality for statistical gathering and other use cases can be replicated in NoSql but it involves extra effort that SQL just gives you.


Sometimes data is relational, sure. But even in SQL-land you generally end up wanting to denormalize it for performance. I guess "optimized for not doing joins" is sort of true, but even on an SQL database that's something you optimize for in your live dataflow. Meanwhile not being able to have collection columns or sum types is a much bigger and more common data modelling problem in my experience - yes PostgreSQL has JSON columns but they're non-standard, not supported by all the drivers, not compatible with your in-memory test database, and ultimately still fairly clunky.

> Once you need to start gathering stats on the data as well, relational tables become more easy to gather information on how often you are talking to various people or lengths of conversations with people. Most of the functionality for statistical gathering and other use cases can be replicated in NoSql but it involves extra effort that SQL just gives you.

Now this I do sort of agree with - the one thing SQL databases are good at is semi-ad-hoc statistics/reporting/aggregation. Several systems I've worked on had an ETL pipeline that ended with dumping everything in an SQL or SQL-like datastore for the data folks to play around with. But you can't do that kind of querying in your primary live datastore once you have any sort of scale - at best you'll slow down your main system, quite possible you'll deadlock or worse. So even when I've worked on systems that were all-SQL, we ended up with much the same architecture of an ETL pipeline dumping data out of the primary datastore into a separate reporting datastore.


> > Their section on reliable replications can be resolved by keeping a log of the changes made to the database rather than a log of the queries.

> You'd need to define a representation of that log, which ends up being equivalent to writing a NoSQL datastore.

I think this covered by relational DBs redo/rollback/write-ahead log; whatever you want to call it.


Right, but they generally keep it as a hidden implementation detail, and when you do get access to it it's not in any standardised way. For a lot of use cases you're better off unbundling it and newer datastores tend to be better at that.


Hybrid logical clocks are distinct from Lamport clocks in that they include a notion of local timestamp while also preserving partial order. They are well-established and used widely outside RxDB - see e.g. https://sergeiturukin.com/2017/06/26/hybrid-logical-clocks.h...


> Their section on reliable replications can be resolved by keeping a log of the changes made to the database rather than a log of the queries

I do love Debezium for easily streaming changelogs to Kafka.


NoSQL is just a poor term. It lumps together a number of radically different approaches. Imagine a term like NoCar that would lump together airplanes, bicycles, boats, trains, and scooters, just because they are means of transportation which are not a car.

Things like Redis, Kafka, Consul, FoundationDB, RRDTool, git, S3, and plain files are all NoSQL databases of sorts. They all are useful in certain areas, all have very different features and guarantees, and each would be a poor substitute for each other or for an SQL database. (Likely even MongoDB can be useful in some areas, even though I have a hard time imagining that.)

I wish the whole "NoSQL" moniker would go away, replaced by a few terms that make more sense.

That said, the original article is a good one, laying out the upsides and downsides of a document database, and why it makes sense as a local, per-app database.


> radically different approaches

Exactly. Relational databases are fantastic general tools, but various use cases can make more specialized data stores the best choice for a particular job.

Document stores, key-value stores, column-oriented databases, graph databases can all be more suitable for a given tasks than relational databases.


> Document stores, key-value stores, column-oriented databases, graph databases can all be more suitable for a given tasks than relational databases.

Can related features be add-ons to RDBMS rather than leaving the RDBMS world for good? Any non-trivial system will need at least some of what RDBMS offer. We should try not to throw out the baby with the bathwater.

I'd like to explore specific use-cases for NoSql to see what features they need that RDBMS currently lack and if it's impossible to practically add those features to RDBMS.

Dynamism of "structure" is about the only thing I can think of right now, and there are possible remedies, per "Dynamic Relational" mentioned nearby.


At some point when designing a data store, decisions must be made with regards to memory layouts, integrity, transaction guarantees, ease of replication, etc. These choices typically involve reliability vs performance tradeoffs for various use cases.

I mean, you _can_ store graph data in a relational database, but its' not typically easy to insert or query, there are serious performance penalties, etc. Purpose-built data structures will always out-perform generic ones.


I recently learned DynamoDB (a non-relational database on AWS). It was fairly easy to learn, basically a one day read through the documentation. It made me grateful that I first learned SQL databases and spent years working with Oracle and Postgres. Now I can easily choose the best tool for each job. But, if I had first learned a non-relational database, I could see myself looking at the daunting task of learning about database schema design, normalization, sql, pl-sql, triggers, indices, keys, constraints, etc. and throwing my hands up with the shear weight of it all. It certainly isn't a 1 day learn like non-relational databases. I'd probably try to force non-relational databases to fit every use case, even where they aren't the best fit.


> A new client downloads all current documents and each time a document changes, that document is downloaded again

Those who do not remember Lotus Notes are doomed to reinvent it.


Lotus Notes, if you look past it having one of the ugliest UIs ever made, was a pretty neat tool. When I look at a tool like Notion, I see it as what Notes should have become.

I worked in a Notes/Domino shop pre-2000, and after some "geniuses" in the company read a Gartner report that SAAS was going to be a thing within a couple of years, my company blew a lot of resources to hack together a multi-tenant project collaboration app using Domino that was delivered to web browsers.

Of course, it didn't go anywhere, because it took many more years before SAAS eventually became mainstream. It's funny looking back at what we made around 98-99, because JS on the browser was far less mature than what it is today.


Hearing this repeated, and the comments, makes me start to feel a little bit old. Haven't we re-hashed this out, thousands of times, at this point?

There are use-cases for both! There are certainly mis-uses of both, too!

DB admins are a real thing; I haven't met many, but the few I have, they are worth their weight. Truly great programmers, I feel the same.

They are not mutually exclusive -- solve your domain problem, iterate, and improve from there.... It's that "simple".


The author ignores a huge amount of research done in the last 10 years on query evaluation.

A series of filter operations and successive sub-selects can't reach the same performance a worst case optimal multiway join can reach (and not even the performance of a series of two way joins that are smarter than linear scans, e.g. a hash join).

Similarly, there is no reason why incremental view maintenance should be possible for NoSQL, but not for SQL. Materialize has shown that incremental view maintenance for PostgreSQL is very possible and very fast.

This article also seems to very narrowly limit NoSQL to Document Store, which is extremely narrow minded.

All in all it does NoSQL a huge disservice.


Could you elaborate on your second paragraph?


TL;DR Your intermediary result sets can be significantly bigger than your ouput set. By instead of combining whole relations but performing a depth first search through the variable assignment space, while peforming constraint propagation, you can achieve a runtime that is much better than what classical pairwise joining systems can achieve. Since the proposed solution for joins in RxDB is an especially naive kind of pairwise join, it will be especially bad.

https://cs.stanford.edu/people/chrismre/papers/paper49.Ngo.p...


I thought nosql would make my life easier but it doesn't. Use RDMS with an upfront schema really is simpler. Yeah changes can be awkward but its worth it.


The schema is basically just static typing for your data.

So the same pros and cons as for static vs dynamic typing in programming languages apply.

(As an interesting aside, I used to work with a system that had relations as in-memory datastructures. They were very pleasant to work with compared to eg Python dicts, because no single one key was privileged, like you have to do with a dict.)


I think the author didn't cover a lot of fundamental research behind NoSQL. If you are reading this comment, and want to dive into SQL/NoSQL DB fundamentals, I highly recommend checking CMU DB (Andy Pavlo) lectures [1] [2].

[1]: https://db.cs.cmu.edu/seminar2020/

[2]: https://www.youtube.com/playlist?list=PLSE8ODhjZXjagqlf1NxuB...


The slightly inflammatory title makes more sense in the context of their database, which sits on top of PouchDB, with adapters for IndexedDB in the browser and a variety of stores on the server side.

Based on the series of blog posts / documentation opinion pieces that have been posted so far, I'm quite interested in playing around with this despite being mostly in the relational SQL camp. Everything I've read is thoughtful, well reasoned, and rather practical and the author is exploring a rather interesting problem space.

I'd love to see a mashup of RxDB[1] and absurd-sql[2] that brings a distributed SQL datastore to the browser.

1) https://rxdb.info/

2) https://github.com/jlongster/absurd-sql


DynamoDB is pretty cool tho. You need to know a few limitations and design your data in a certain way but in turn you get like literally unlimited scaling and will never have to worry about overnight success bringing your services down.

No wonder AWS has moved most of it's DB usage over to DynamoDB.

For me its Postgre -> DynamoDB if working on an MVP and then maturing or if the access patterns are well known to begin with then DynamoDB.

One particular thing i love about DynamoDB is the 1 click global tables. I've yet to see anything similarly easy in the SQL world for going global with a database. (Like one AZ in US and one AZ in EU so that every customer gets a low latency)

Also, most of the time if Dynamo is used correctly it costs way less than an RDS instance running Postgre.


> never have to worry about overnight success bringing your services down.

But the chances of becoming a viral sensation are very small compared to the probability of longer-term maintenance headaches you'll likely have under NoSql. NoSql is thus similar to meteor insurance. I suspect egos are not making realistic estimates.


DynamoDB has way less maintenance compared to MySQL/Postgre running on even a managed service


This article regards write-intensive usage scenarios (OLTP) rather than analytics-focused usage scenarios (OLAP).


Use the right tool for the job. SQL databases are not always the right tool. NoSQL databases are not always the right tool. If you think an RDBMS is the best choice for every problem, then maybe you should stop and ask yourself if it's just the only tool you're comfortable with. There is a reason that the number of NoSQL options continues to expand - if nobody was using them successfully then they probably wouldn't continue to grow.


SQL = structured query language.

NoSQL typically means non-traditional non-relational database. How many of those actually have truly unstructured data where SQL would genuinely be useless?

It doesn't always mean that the vendor of a NoSQL database shouldn't or couldn't implement SQL on top of it.


Been using MongoDB for 7yrs now. Best thing I've ever done. NoSQL db like MongoDB cuts down on your joins a lot! Also taking business logic out of the DB and putting it all in the application layer makes things way easier with versioning and sharing code across projects. I love that the mongo database is just used as a black box. It saves data and it queries it, nothing more. This part is rarely talked about. The installation and having multiple versions running side by side is so easy. You can't beat just copying a folder of exe's and then run it and it works. I have systems that have been running for years and have never lost data. I don't use a RDB unless the client demands it.


There’s a reason pretty much any real tech company doesn’t use nosql except in very clear explicit cases. Every single “problem “ you mentioned is not a real problem to begin with or also exists on both sides and you didn’t even realize you were dealing with them. Like migrations.


> There’s a reason pretty much any real tech company doesn’t use nosql except in very clear explicit cases.

That is such a bold claim that is obviously not true.


Can you give a single example of a major tech company using nosql as its primary db for any application?


MongoDB hss 26500 customers worldwide (and we are one NoSQL vendor).

These customers include:

Bosch : https://www.mongodb.com/customers/bosch

HSBC : https://diginomica.com/hsbc-moves-65-relational-databases-on...

SEGA Hardlight : https://www.mongodb.com/blog/post/sega-hardlight-migrates-to...

HMRC : https://www.mongodb.com/blog/post/mongodb-microservices-help...

DWP : https://www.mongodb.com/customers/department-for-work-and-pe...

Liberty Mutual : https://www.mongodb.com/blog/post/liberty-mutual-iac-mongodb...

MetLife : https://gigaom.com/2013/05/07/with-300m-earmarked-for-tech-i...

There is a more complete list here : https://www.mongodb.com/who-uses-mongodb

That list is limited is just the customers that are willing to be public references.

Every mature NoSQL vendor has a similar list.


Pretty much all of AWS and Amazon runs on Dynamodb.

Google has Firestore as one of their approved databases for internal use.

Those are ones I have first hand experience with.

I know from friends that Microsoft runs things on Cosmos.

Netflix is pretty invested in Cassandra.


stripe


A key aspect that RDBMS lack compared to NoSql is dynamism. Dynamic Relational is a draft RDBMS idea that intends to fill that niche. Unlike the NoSql movement, it tries to keep as many "traditional" RDBMS idioms as possible when adding dynamicness. It also allows one to "lock down" the schema incrementally as projects mature. https://stackoverflow.com/questions/66385/dynamic-database-s...


"Everything can be downgraded to NoSQL"

I am just loooking on our ERP system which has around 580k tables, 10M table columns and around 5.6M programs accessing these tables.

For all practical purposes, I really wonder how this should work using unstructured data.


Are these actual numbers? If so, you've got me curious! How does it work? Do you really have billions of lines of code? How many of those 580k tables are auto generated?


This is a standard SAP system. I just checked the numbers before I put them down here. SAP has been around since the late 80ies and is open source and very little of the code has been officially retired, so the amount of code and number of tables are monotonically growing for 4 decades. Lot's of redundant programs as well. So the code base is huge and the database is of course not completely normalized.

SAPs program environment comes with a built in editor and a relational database, and that the metadata of every table is stored in the database and the program is not stored on the file system but in the database as well. (Which makes tools like github superflous. When you edit code it gets a lock on the database).

So conveniently in SAP there is a database tables which holds all database tables and another table for all fields. Likewise there is a table for all programs and SAP stores all program lines in another database table. So you just need count the number of entries in these tables.

I am not aware of any generated tables, but maybe are some. Maybe 5%. I'd estimate that maybe half of the programs are actually generated code. Whereas "program" probably is not accurate, some of the programs are modules or function pools which are invoked by other programs.


I made some further research today because i wonderd myself. I think the number of generated programs is higher then my estimation. I assess that the number of not generated programs are in the ballpark of 500k and the number of transparent database tables around 300k. The rest of the are views defined on the database but not transparent table.

The numbers are smaller than initially given, but still big enough for not being handled with unstructured databases.


> Yes, there are SQL databases out there that run on the client side or have replication, but not both.

Yes, there are. https://zumero.com for instance.


NoSQL == NoCare about data integrity and consitency concerns.

NoSQL allows you to start coding fast (no need to learn any awkward 'legacy' concepts).

More importantly it allows you to kick the can down the road for all those problems old people told you about, but which you don't really believe exist.

If your project is a one off, NoSQL could be a good choice.

But if the data has any value, or especially if it is expected to outlive the original application it was coupled to, SQL is the way to go.


My dream is a database where I just add nodes to physical machines by running a docker container and I query using whatever mechanism.

From there it would just auto heal, scale and do all the ops by itself. CouchDB is sort of like this in theory (definitely not in practice).


The commercial version of this is Snowflake, except that it's completely seamless, scales invisibly, and its fast as nuts. At least 100-1000x faster on common workloads compared to a mid 2000s IBM SQL DB solution for the same data at a really large enterprise.


Couchbase does that


The article didn't answer the question for me of how to query for the absence of a property, without getting an entire document collection first.

I was really hoping there was a secret to search against null or undefined.

Is there a NoSQL solution that does allow for such queries?


With CouchDB you can do it. You create a view filters on the absence or prescense of that field. I'm not sure if that answers your question, as I don't know how other nosql databases handle indexing.


I'll investigate the constraints of CouchDB, thanks!


If you want to include Postgres jsonb columns in that then a partial index of the expression of a NOT of the jsonb contains jsonb operator.


An argument I heard the other day: lots of people are realizing the benefits of typescript. Those same people will eventually realize a table schema will make their lives similarly easier.


the argument is correct, but alas: "eventually" can take a long time. But types and NoSQL can be friends too.


Maybe it is because I've only ever worked on web applications so I can't imagine the use cases, but when would you ever want your database to be replicated to a client's computer?


One of the use cases for RxDb is "local first" data. Rather than spend time doing roundtrips to a remote server for each piece of data, store the database locally on the client and keep the local/remote in sync.


IndexedDB for the client, web workers to check that things are in sync and events for when things aren’t.. best of both worlds


Testing and prototyping.


Headline is the question, the answer is "it doesn't"




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: