Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

While the technical terms may be unfamiliar, it's all pretty straightforward, and requires between 1-4kloc depending on how fast and fancy you want things to be. (This includes everything from the basic copy on write data-structures to the query language.)

Building an immutable path compressed radix tree is pretty straightforward and requires around 1-2kloc, and it's easy to keep track of the n-smallest hashes of leaf values, as well as the total count of leafs in the nodes. The sampling done by the min-hashes give you a good indication of two nodes overlap which the jaccard index is just a different name for.

The query engine itself is like 0.5kloc, and is just walking the different radix-trie indices simultaneously. The basic insight of worst case optimal (WCO) joins, is that it's a lot cheaper to join everything at once and treat it as a constraint propagation problem.

A LINQ style query parser takes up another 1-2kloc and is just a bunch of ol' boilerplate.

In total that's about as much code as your average large C++ codebase CMake file.

You could sketch the entire thing on a napkin and build it in a week if you've build something like it before.



"If you build something like that before". I think I can add that to the list of famous last words. ;)

Please, excuse my sarcasm again. But, tell me what to do if I didn't built something like this before? What if I built something like equality saturation engine, pseudoboolean optimization using SAT solver and/or beam search? Will it help me somehow?

You estimated code size at 4.5KLOC max. Given that C++ programmer delivers 20-25 debugged lines of code per hour in the long run (IBM's stats), it would take 225 hours of work. Given that PSP/TSP recommends planning for 4 hours-on-task per day, it will take 56 work days. My calculation suggests 2.5 work months to implement all that in the worst case of 4.5KLOC. Even the best case of 2.5KLOC would take a month and a half of work.

Yours' proposition is not a week's project. Not at all, you can't squeeze it that much.

Query planning is hard. Even if you try your best to avoid it.

And we have not even started talking about WHERE, GROUP BY and ORDER BY clauses' optimizations.

[1] shows the use of loop nest optimization combined with beam search. SQLite uses translation of joins into loop nests, and it transforms loop nests into a graph, the path in the graph represents a solution. The [1] shows simplified nesting graph that is linear, and in general it will be quadratic to the number of tables.

[1] https://www.sqlite.org/queryplanner-ng.html

I really like that approach. This is exactly an equality saturation [2] (saturate loop nesting through loop nesting commutativity) with the beam search as a selection phase.

[2] https://rosstate.org/publications/eqsat/

I think that equality saturation with beam search is a week long project if you already have built something like that. The difference? It will work for arbitrary jons.

Of course I am making fun of your statements.

Equality saturation requires quite careful planning and will not work for couple of months, you will keep finding something that fails. Beam search is just as hard. You can encode optimal solution selection problem as a pseudoboolean optimization problem, which is more straightforward than beam search, and [2] shows that Pueblo is no slouch there.

This is not to say that query planning is easy when you do it my way. Query planning is hard. NP-hard, actually. Sometimes you can get away with a simpler less general solution, but it will bite you sooner than you expect.


C++ is probably the slowest programming language in terms of development speed there is, but even if it takes you 3 months to build something like that from scratch it'll still be an order of magnitude less time than building a clone of another SQL database.

`WHERE, GROUP BY and ORDER BY` are relatively straightforward to tack on into the variable ordering.

Having written a sat solver would kinda help you, because modern sat solving algorithms are fundamentally very similar to worst case optimal join algorithms.

The query plan produced by combining binary joins is always going to be off by up to an exponential factor when compared to a WCO join. If you're fine throwing all that complexity onto your problem to generate sub-par query plans, be my guest.


And how modern SAT solving algorithms are fundamentally very similar to worst case optimal join algorithms?

There are at least two different types of them, conflict-derived clause learning and variants and stochastic search and variants.

Which one is more similar to WCO join algorithm?



The paper you provided has two important omissions: 1) it does not mention Tseitin transformation to encode Disjunctive Normal Form into a Conjunctive Normal Form and 2) does not look at decision diagrams, especially, zero-suppressed decision diagrams for set representation.

Tseitin transformation prevents exponential expansion in conversion from DNF to CNF.

The fact that paper's algorithm employs disjunctive normal form suggests the use of binary decision diagrams. ROBDDs represent DNFs naturally, for one example. ZDDs represent sets of sets and were relatively successfully used in non-trivial approaches to the SAT solving problem like [1].

[1] https://web.eecs.umich.edu/~imarkov/pubs/book/b002.pdf

Also, the paper you mentioned has this right in abstract: "However, there is still the quest of making the new worst-case optimal join algorithms truly practical in terms of (1) ease of implementation and (2) secondary index efficiency in terms of number of indexes created to answer a query."

WCO is hard to implement and it might be computation-wise prohibitive.

Yet, it's quite interesting, thank you very much.


>The query plan produced by combining binary joins is always going to be off by up to an exponential factor when compared to a WCO join.

Citation greatly needed. Why is it so?


The intuition is that if your join is a triangle you might be forced to join every relation only to discover that the result is empty on the last join, which requires m^n intermediary results, where m is the size of the relations and n is the number of relations. A WCO join algorithm will figure out quite quickly that the output is empty and is in practice much more dependent on the actual output size, than the size of the input relations.

See:

https://www.cs.stanford.edu/people/chrismre/papers/paper49.N...

https://justinjaffray.com/a-gentle-ish-introduction-to-worst...


Thank you, second link of yours starts with the explanation of benefits expected, which is great.

The difference is not exponential, if I may nitpick. It is sublinear in the case of the "triangles example" - WCO would produce O(n^(1.5)), binary join will produce O(n^2), the difference is O(n^(0.5)) or O(sqrt(n)). The difference is big, but not exponential.


That blogpost is a godsend, it provides a great explanation and intuition, and resolved a few questions that I was left with after reading a lot of WCO join papers.

Good nitpick, I think you can construct larger rings where the difference becomes larger than one, but I might be wrong. The dynamic variable ordering trick makes a huge difference in practice, especially when skew is in play. (https://arxiv.org/pdf/1310.3314.pdf)

In general you're right, WCO joins are a relatively young field of study and sometimes struggle with large constant factors, but they are maturing quickly and in a limited (triple) setting like the one OP "wished for", a lot more feasible than for the general case.

Thanks for the interesting discussions, looking forward to reading the references you provided in depth!

Edit: I just remembered this paper, which might be of interest to you. They seem to recover WCO bounds in a pairwise setting, by choosing very smart intermediary join representations: http://www.cs.ox.ac.uk/dan.olteanu/papers/co-tr16.pdf


Exactly what I needed, thank you again very much!

My old idea was to perform planning after some of the work has been done, because remaining statistics can be different. These papers are of great help!




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: