Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: DoltgreSQL – Version-Controlled DB, Like Git and PostgreSQL had a baby (github.com/dolthub)
162 points by Hydrocharged on Nov 2, 2023 | hide | past | favorite | 63 comments
From the company behind Dolt—the world's first fully versioned database—comes DoltgreSQL, which implements PostgreSQL's variant of SQL.

DoltgreSQL is at a very early stage, and we have quite a lot of work left to do, but we'd love to hear all thoughts and opinions! You can read more in the announcement blog post: https://www.dolthub.com/blog/2023-11-01-announcing-doltgresq...



> In 2019, when we were conceiving of Dolt, MySQL was the most popular SQL-flavor. Over the past 5 years, the tide has shifted more towards Postgres, especially among young companies, Dolt's target market.

Really? I was under the impression that PostgreSQL had already won by then.


Among one-off toy projects shared on HN, probably. In the industry, MySQL is still the dominant system in my experience. Just to check though, Statista has it as dominant [1] as well as 2 other sources that appear to be geared towards sales & marketing research [2] [3].

[1] https://www.statista.com/statistics/809750/worldwide-popular...

[2] https://6sense.com/tech/relational-databases/mysql-market-sh...

[3] https://www.datanyze.com/market-share/databases--272


MySQL is still dominant in the PHP world, and even if HN will laugh at you for using PHP it's still really popular


> Statista has it as dominant [1] as well as 2 other sources

I'm pretty sure that's the Wordpress effect. How many Wordpress users would buy a database engine that is 99% compatible with MySQL, because it supports branching?


The engine is free and open-source. You can buy support, but these are free products.

https://github.com/dolthub/dolt

In fact, we've got a blog post about Dolt and Wordpress.

https://www.dolthub.com/blog/2023-08-04-wordpress-on-dolt/


> The engine is free and open-source. You can buy support, but these are free products.

Sure, but my question was about the feasibility of selling support to Wordpress users vs PostgreSQL users. The latter usually care more about advanced features such as the ones Dolt offers.

Don't get me wrong; I think Dolt is a great product. I was simply replying to the 2023 stats posted by k1ns.


Yeah 2019 was definitely Postgres world at least by 5 years.


Back then, when we were surveying the landscape, PostgreSQL was definitely in widespread use, but we were seeing MySQL being used in more professional/non-hobby spaces. Nowadays that's not necessarily the case, and we are seeing shops that were once MySQL-only now adopt PostgreSQL for some of their newer projects. With Dolt supporting MySQL and DoltgreSQL supporting PostgreSQL, we're hoping to appeal to both audiences.


Let's put it this way - someone who would use Dolt would have been using Postgres in 2019.


The adoption of Dolt has not pointed to this being the case. Many of our (paying) customers have not had PostgreSQL setups, and only around last year have we started to run into potential customers who truly needed a PostgreSQL variant of Dolt. Dolt and DoltgreSQL are open source and will forever be as we believe in true open source, however as a business, we prioritize the features that our customers need.

Either way, with DoltgreSQL now in development, we will now have solutions for both sides! (Other popular databases with their own syntax, such as Oracle and Microsoft, are not planned for the future at this time)


> The adoption of Dolt has not pointed to this being the case.

That's incredibly self-fulfilling. Counter status quo signals self-select out before even talking with status quo vendors.


This reply suggests you are able to see point-in-time among your self-selected subset of the world, but not able to see trend-through-time across the overall.

It's as if driving by rear view mirror based on noticing cars around you, rather than having a traffic helicopter high overhead looking across all the roads and ahead.

Inability to see difference in trending versus point in time is how "incumbents" end up having their lunch eaten. Looking at point in time instead of trend, the challenger isn't a threat until they've crossed over, and by then, well, you've lost 5 years. Of course only looking ahead, you might never get anywhere most people want to go today, and you have to get paid. So the trick is always considering both!

All that said, strong pivot now and we'll be checking your take out.


probably has more to do with people counting the number of mature projects vs greenfield development

by sheer numbers, MySQL is probably still dominant but if projects are starting today it's likely postgresql


Why a fork? Couldn't you just add a separate front-end? Will the underlying storage format stay the same so one can flip over or do you want to get closer to PG semantics?


This isn't a fork of PostgreSQL, it's a completely bespoke database solution. Its only tie to PostgreSQL is that we've chosen to appear as a PostgreSQL server to clients. If a user didn't use any versioning features, then the goal is that they should be unable to tell that they're not on an actual PostgreSQL server.

The versioning features are an important distinction though. Dolt (production ready, MySQL protocol) and DoltgreSQL (pre-alpha, PostgreSQL protocol) are built specifically to address the lack of versioning support in databases, and gaining these versioning features is as easy as swapping out the database you are using for Dolt and DoltgreSQL (once it's finished). MySQL and PostgreSQL are written using C/C++, while Dolt and DoltgreSQL are using Go, so there is no shared code. The storage format is implemented using prolly trees (https://docs.dolthub.com/architecture/storage-engine/prolly-...), which are based on merkle trees (used by Git and Bitcoin), so there is no overlap with any existing database solutions.


Hmm, when choosing database solution for a new project I'm not selecting sql dialect (like postgresql sql), but stability and ecosystem. So having this as an extension to postgresql, and possibility to combine it with other extensions it would be great, but reimplementation is a no go here.

And I can not use it for existing projects, because again extensions, and I surely don't want to findout how your implementation differs from the mainstream postgres...


Knowing that extensions are very important to you is great feedback for us. As we hear more about what users' requirements are, it helps us better plan for the future.

Regarding any differences from mainstream Postgres, you can look to how we've handled Dolt, which is production-ready. It targets MySQL, just as DoltgreSQL targets PostgreSQL, and it recently achieved 99.99% correctness according to a set of roughly 6 million tests (https://www.dolthub.com/blog/2023-10-11-four-9s-correctness/). This test is not a definitive stance that we are exactly 99.99% the same as MySQL, but it's a good general guide to how we approach our compatibility, and how serious we are in that regard.

Sadly though, the full versioning capabilities would not work as an extension to Postgres. We looked into it before we settled on our current approach. I talk a bit about it in the blog post as well. To truly allow versioning in the same capacity that Git does for source code, it required us to either fork Postgres and spend years reimplementing all of the work that we've done in Dolt just to get to where we are today, or choose the path that gets something out quickly, and allows us to have the very conversation that we're having right now.

That was actually the very reason for deciding to host an announcement that we're working on it. Many people have said that they'd like Dolt but for Postgres, however they've not said whether they need the Postgres binary specifically, the ecosystem, the syntax/wire protocol, etc. This announcement gives us the opportunity to receive that feedback.


Where are the 0.01% differences? When I'm trying to commit transactions? During select? Just not supporting some esoteric stored function syntax?

And most importantly, how does Dolt compare under heavy load, at the limits of server memory or bandwidth or CPU or disk thoroughput? You can assume an SSD and a multicore processor for purposes of answering.

Thank you very much. You are competing in a field where trust is extremely difficult to acquire - and the consequences to a lead dev for choosing Dolt[greSQL] could end his career. Nobody ever got fired for choosing the incumbent, as variations on the saying go.


The sqllogictests are better used as a judgement of how accurate our results are to MySQL's results given a very large range of statements (around 6 million tests). They are not, however, indicative of the entire feature set. That's to say, there are a few things that are not yet supported in Dolt, however they are the more esoteric features. Things like transactions, etc. are what I would consider "core" SQL, and we are very keen on making sure those work exactly as you'd expect

There are two key points regarding performance that I'd like to mention, and that's that Dolt uses higher-than-expected disk, and we also use more memory than MySQL by comparison. The disk usage is due to our optimization of speed at the sacrifice of disk accretion via temporary storage, which we've decided is a fair trade off considering disk is very cheap (and we're working on making this tunable so users can decide on speed vs disk efficiency). Memory usage is a bit more complex, and I'm not the employee to comment too much about it, but both of these issues are being worked on to reduce their impact. With that in mind, our performance is comparable to MySQL as long as the machine limits are not being reached, however I'd expect us to be a bit slower once those limits are reached, simply due to the extra complexity that we're managing.

Lastly, trust is something that can only be built over time. In 10 years, I'm sure that there will be no doubt of our stability, and at that time I can see Dolt and DoltgreSQL becoming the de-facto databases used for relational storage. Of course, I may be a bit biased :)

Trust is a


It sounds like question is: why fork Dolt to make DoltgreSQL?


Dolt was built with MySQL in mind, and we're creating DoltgreSQL with PostgreSQL in mind. We've gotten interest in a "Dolt for Postgres", and so we're finally starting development on that exact thing.


Is there any compatibility between the underlying storage systems of Dolt and DoltgreSQL? Yes, the interface is slightly different, but is the storage (partially) compatible? Why aren't the two SQL dialects different interfaces to the same underlying storage?

The reason that I am asking is because I have a hard time trusting a newcomer to the database competition. And for purposes off discussion, DoltgreSQL will be a newcomer for the first ten years of its existence. Sharing the underlying storage model with Dolt (still a newcomer, sorry) would greatly increase my confidence in it.


You can think of Dolt as having 4 layers to it. From the top-most interface to the bottom:

Vitess (Handles Wire Protocol)

go-mysql-server (GMS, Implementation of MySQL featureset)

Dolt GMS interface

Storage Engine

DoltgreSQL, right now, shares the storage engine, interface, and a portion of GMS. This is primarily to get something up and running so that we're able to further iterate on a "working" product (i.e. something we can set up testing for). I envision that GMS will split into three, with a MySQL & PostgreSQL core interacting with a base SQL core. Dolt's GMS interface would then adapt to the base SQL core interface. The storage engine is more-or-less set in stone.


I've been waiting eagerly for this.

Do you have a clear position on which PostgreSQL features not to support? I suppose there are more than just some things that won't make the cut because of the architectural decisions.

While I unnderstand the decision, I'm not sure it's the best way to go about it. If you only emulate a subset of PostgreSQL's syntax and features, few people will be compelled to switch because they might be afraid. For greenfield projects, most people would probably choose the MySQL syntax since it's the default.

I don't think this is about the necessity of running the PostgreSQL binary itself (although your approach already removes extensions which for many people is a downer). It's just that you can't trust an emulated system to be 100% equal in behavior (and people rely on implicit behavior of a system all the time, unfortunately) and that might be already enough for a lot of people to not use it.

Have you guys already encountered some things in the PostgreSQL engine that just behave a bit differently from Dolt's engine? If so, what was your approach to mitigate it?

Edit: just wanted to add: I'm really impressed by your work and I'm looking forward to trying this out. I don't mean to be mean, these are genuine questions I have. Congratulations on the launch.


> Do you have a clear position on which PostgreSQL features not to support? I suppose there are more than just some things that won't make the cut because of the architectural decisions.

While I unnderstand the decision, I'm not sure it's the best way to go about it. If you only emulate a subset of PostgreSQL's syntax and features, few people will be compelled to switch because they might be afraid.

Eventually, we'd like to support the entirety of PostgreSQL's feature set, even including features like extensions. Dolt (https://github.com/dolthub/dolt), our first product, is the same to MySQL and DoltgreSQL is to Postgres, and we're taking a no-compromises approach to what we support. That, of course, means that there are a lot of features that need to be implemented, but Dolt is already almost there. For the majority of customers, Dolt has implemented everything they need from MySQL.

I'd definitely recommended checking out how Dolt compares with MySQL to see how we're approaching compatibility. All behavior, implicit and explicit, is something that we aim to model, and any deviations are considered bugs that we need to fix. There are exceptions, but those are only used when we feel it's for good reason (an example being how MySQL handles collation cascading in some circumstances).

> Have you guys already encountered some things in the PostgreSQL engine that just behave a bit differently from Dolt's engine? If so, what was your approach to mitigate it?

With DoltgreSQL, it's at an extremely early stage. We're still working on getting the basic functionality working before we rigorously start testing to make sure that we match PostgreSQL's behavior. However, we can point to our approach with Dolt and MySQL for how we plan to handle DoltgreSQL and PostgreSQL. For every feature we implement, we compare the functionality with what is written in MySQL's documentation as a baseline. From there, we move on to comparing the output across a range of input statements. Sometimes the documentation differs from MySQL's own results, and we then try to find out why that's the case (Configuration? Out of date documentation? Bug? etc.).

We also use external benchmarks to measure our correctness versus MySQL. In one such benchmark, containing around 6 million tests, Dolt recently reached 99.99% compared to MySQL (https://www.dolthub.com/blog/2023-10-11-four-9s-correctness/).

I hope this answered your questions! Let me know if you have any more :)


That does answer my questions. It's an extremely ambitious undertaking and I wish you the best. I'll be following this closely.

Some people do performance optimisations based on PostgreSQL's inner working (e.g. trying to force data that isn't read often but not small into toast). How far do your ambitions go? Are you planning on modeling internal behavior like this as well? Do you think putting in an abstraction layer like this between Dolt will hurt performance?

Do you have a time frame (probably not) or a roadmap for Doltgres?


Thank you!

Performance optimizations are tackled a bit differently than correctness ones. For the most part, we'll try to use metrics to find weak points in the execution graph and optimize those, but we won't go so far as to try and model the internal performance behavior. In part because our storage format is so different that we'll have different performance characteristics by necessity.

We don't have a public roadmap for Doltgres just yet, but we're hoping to put one out quite soon! We have a lot of low-hanging roadblocks that we want to take care of before we can get a better look at the overall time frame. We should definitely have one up by the end of the month, but I don't want to commit to a time before that.


Can people who use Dolt explain their use case? Dolt competes with Flyway and Bytebase, but it requires you to run their forked database? Schema migration is unpleasant and not something I can imagine doing willy-nilly like a git commit.


We don't really compete with Flyway and Bytebase. Schema migrations are but one aspect of a versioned database. We version everything, from the schema to the data. You can read more here:

https://www.dolthub.com/blog/2022-08-04-database-versioning/

A lot of products have come out that attempt to tackle schema versioning, but none have tackled data versioning before Dolt (https://github.com/dolthub/dolt). In addition, our database isn't forked, it's a full, bespoke solution that can operate as a drop-in replacement for MySQL (Dolt) or PostgreSQL (DoltgreSQL). It's honestly quite exciting technology, so definitely feel free to ask any more questions if you're curious to learn more!

Here is a link to a few use cases as well: https://www.dolthub.com/blog/2022-07-11-dolt-case-studies/


One of Bytebase authors here. I learned Dolt a while back (memorable name).

I think Dolt is closer to Neon/Xata. But still there are differences.

IIUC, Dolt is bringing the database feature to Git, while Neon/Xata is bringing the Git feature to database.

Speaking of Bytebase, if Dolt is really good at versioning schema migration, Bytebase value proposition will be a bit less attractive, but not much. It's similar to the Git story, regardless how powerful Git is, people still need GitLab/GitHub for the developer workflow on top of the mere versioning.


I just learned of Neon a few weeks ago. From the looks of it, Neon supports branching, but it doesn't support merging. Xata supports both branching and merging, however it only applies to the schema.

Dolt (and eventually DoltgreSQL) handles everything. Branching, merging, diffing, cherry-picking, commits, and more. Working with both the schema and data. On top of that, we also have DoltHub (https://www.dolthub.com/) that's analogous to GitHub, and DoltLab (https://www.dolthub.com/#doltlab) that's analogous to GitLab. We are targeting the entire ecosystem from the bottom up.


While version control may be useful, many people already have a managed database, ruling out your solution. If only Dolt could be an extension. If I ever need to run my own MLOps Metadata database, I will consider Dolt though:

https://docs.dolthub.com/introduction/use-cases/data-and-mod...


I was once thinking about where could I use this tool, but nothing really came to my mind. Any interesting use cases or stories to share?


We've got a blog about a few use cases that you might find interesting.

https://www.dolthub.com/blog/2022-07-11-dolt-case-studies/


This seems so obviously useful and surprising it's taken industry so long to get here. But honestly I'm hesitant to go for something that isn't like a wrapper around the underlying database, just because those are notoriously difficult to implement correctly. Postgres has been around a long time, and has been battle tested, etc.


Yes, it's definitely an uphill battle to launch a new DB for this reason. Our current customers are people who need version control bad enough they're willing to take a chance on a younger implementation.

That said, we have 99.99% correctness according to sqllogictest:

https://dolthub.awsdev.ld-corp.com/blog/2023-10-11-four-9s-c...


I guess, you can query your data's history like how Wikipedia did. And you could revert it.


Do you know VMDS by GE (part of Smallworld)?

It has data versioning too, using a concept called ‘alternatives’. Do you know where DoltgreSQL and VMDS differ/overlap, functionally?

See https://en.m.wikipedia.org/wiki/VMDS


This is interesting, I've never heard of VMDS before. Someone else on the team may have, but I have not personally. From a quick glance at the linked Wikipedia page, it looks like this was created around the same time as SQL, and therefore has a different interaction model. It also predates a lot of modern version-control software like Git, SVN, and Perforce.

So it looks like VMDS is a rather unique take on versioning data, where data is viewed as objects. Dolt and DoltgreSQL use the table and row paradigms that the majority of modern relational databases use. Dolt is designed to be a drop-in replacement for MySQL, and DoltgreSQL for PostgreSQL, so that already determines the interface. Regarding VMDS' versioning functionality, I see that is supports merging and conflict resolution at the transaction layer, but I'm not seeing anything similar to the concept of branches, which is something that Dolt and DoltgreSQL supports.

Overall, for modern use cases, it doesn't look like they overlap much. VMDS seems focused more on the spatial case while representing data as objects, while Dolt and DoltgreSQL are traditional relational databases that support versioning all aspects of a relational database.


How is this better than existing migration solutions? More specifically, what does this give me that I'm not getting using the migration provided by an ORM?

Current migration implementations live on the software side and are essentially a folder full of sequential, date stamped, SQL commands to execute. These files are checked into git so they are versioned.

No offense to you or your team or this project, but I obviously trust the Postgres core team far more than you. However, everything is about trade-offs. So, what (briefly) makes this a risk worth taking?


From what I understand it's less about Schema history or migrations, then it is about data history. So if your data was stored in a VCS but it was also a Db. So you could query the database as it was last Thursday, just like you might look at an old version of a file in Git


Interesting. I don't think I've ever really needed that capability, but it is a pretty cool concept.

I HAVE needed an old version of a database to check/compare something, but nightly backups have always sufficed, and for disaster recovery I typically have point-in-time recovery set up. So if a migration goes boom I can just say, "ummmm lets go back to a couple minutes ago" which is a nice safety net.


hahn-kev made a really good comparison, but it can go a step further. Not only can you query the data at some previous point in time, but you can even run a join across old and current data. If you look at it from the perspective of commits and working sets, then the working set is just the "current" commit, and commands implicitly target the current commit. We give you explicit control over which commit is being referenced, so there's a lot of power in the model.

Recovery is a valid use-case for version control, but fully applying it to data opens up possibilities that haven't really been explored before.


Really exciting, I was interested in Dolt but I prefer postgres.

I'm curious if you have any plans for local first development. I love the idea of a version controlled database, but I want to use it for local first and distributed data syncing for users who are frequently without internet. It's not clear how you would do this without running the database locally. Do you have thoughts on this? DoltLite?


So you can do local first with the `dolt_clone()`, `dolt_remote()`,`dolt_fetch()`, `dolt_pull()`, and `dolt_push()` procedures:

https://docs.dolthub.com/sql-reference/version-control/dolt-...

as well as the `dolt_remotes` system table:

https://docs.dolthub.com/sql-reference/version-control/dolt-...

We have a ton of remote options:

https://docs.dolthub.com/sql-reference/version-control/remot...

Right now, DoltHub (https://www.dolthub.com/) will still work as a remote but all the SQL there will be the MySQL version. Over time the Doltgres storage format may become more bespoke and we'll have to figure out how DoltHub will work.


You can already use Dolt in embedded mode. Check this blog post from them: https://www.dolthub.com/blog/2022-07-25-embedded/. I use it like that and it works great.


Hmmm... I've been trying to extend a website I have with essentially wiki functionality.

But Dolt is 1.7x slower than MySQL, as is written in that Readme.

I'm looking for ways to have effortless and traceable changes to an entry or entries.

I've experimented with ArangoDB but it wasn't for me. Dolt is now the only other VCS for databases I'm aware of.

Are there any alternatives? So I could try more and decide which to settle with.


Dolt is the only version controlled SQL database.

1.7x MySQL is still very fast. You're talking about .5ms for a point lookup instead of .3ms, it's not something users will notice at typical scales. And we're still iterating on performance, the gap will continue to close over time.


Congrats - love this, especially joining across old and current data (or any point in time if i understood well?)

I recently stumbled upon the concept of "bitemporal modeling" (ie. rewinding data "as of") - thought it described well this use case.


You understood correctly! That's just one use case, there are many, many more. Some of which we haven't even considered yet, just because the model is that powerful.


How is the replication / master-master / scale-out for this database? We really would love a solution like this but it would need to scale for our use case.


At the moment, it does not exist for DoltgreSQL as it is in pre-alpha. Dolt (https://github.com/dolthub/dolt), however, may be a better fit for your use case.

https://docs.dolthub.com/concepts/dolt/rdbms/replication


Ah, I missed that! Thanks for pointing that out.


Great work. I will use this for the pet project I'm working on. Postgres rules. Thanks.


Just want to point out that we're announcing development on the project. It's absolutely not ready for mainstream use yet! We have Dolt (https://github.com/dolthub/dolt) which is production-ready and widely in use, but it uses MySQL's syntax and wire protocol. We are building the Dolt equivalent for PostgreSQL, which is DoltgreSQL, but it's only pre-alpha.


Question: what is the difference with database migration tools like alembic ?


This is a fully-versioned database. By versioning, think of how Git applies to text files, and all of the power that comes from that model. You have branching, diffing, merging, etc. It allows you to collaborate with each other, or you can create a brand new project by "forking" an existing one. Now, imagine that model, but applied to a relational database.

There are products that can do something similar for schemas, but DoltgreSQL (and Dolt, the production-ready DB based around MySQL rather than PostgreSQL) applies this to both schemas and data. This is unique to DoltgreSQL and Dolt, and what sets us apart from all other products. One example I like to use that I think really captures this power, is the ability to query data from any commit. We use Git's model of version control, so your database has a commit history and a working set. The working set can be viewed as the "current" commit. When you run queries, you implicitly select the working set, however we expose the ability to explicitly set which commit you're targeting. This means you can do things like run a join query across old and current data!

These situations, and more, and not possible via migration tools, backup tools, or any other kind of tools. This is unique to DoltgreSQL and Dolt, by fully embracing data versioning. This is but one example, and there are many, many more.


OK, i understand


Nice job guys! -Alec


Thank you! :D


No offence, but how to read the name?


The more I read on, the more was convinced I was reading an April fools article or a Tanenbaum textbook. Turns out Dolthub is hosted by Dolt and Doltlab is self hosted.


In person, we generally say Doltgres, similar to Postgres. For the full name, it's just going to be Doltgres-Q-L.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: