While the technical terms may be unfamiliar, it's all pretty straightforward, an...

thesz · on July 27, 2022

"If you build something like that before". I think I can add that to the list of famous last words. ;)

Please, excuse my sarcasm again. But, tell me what to do if I didn't built something like this before? What if I built something like equality saturation engine, pseudoboolean optimization using SAT solver and/or beam search? Will it help me somehow?

You estimated code size at 4.5KLOC max. Given that C++ programmer delivers 20-25 debugged lines of code per hour in the long run (IBM's stats), it would take 225 hours of work. Given that PSP/TSP recommends planning for 4 hours-on-task per day, it will take 56 work days. My calculation suggests 2.5 work months to implement all that in the worst case of 4.5KLOC. Even the best case of 2.5KLOC would take a month and a half of work.

Yours' proposition is not a week's project. Not at all, you can't squeeze it that much.

Query planning is hard. Even if you try your best to avoid it.

And we have not even started talking about WHERE, GROUP BY and ORDER BY clauses' optimizations.

[1] shows the use of loop nest optimization combined with beam search. SQLite uses translation of joins into loop nests, and it transforms loop nests into a graph, the path in the graph represents a solution. The [1] shows simplified nesting graph that is linear, and in general it will be quadratic to the number of tables.

[1] https://www.sqlite.org/queryplanner-ng.html

I really like that approach. This is exactly an equality saturation [2] (saturate loop nesting through loop nesting commutativity) with the beam search as a selection phase.

[2] https://rosstate.org/publications/eqsat/

I think that equality saturation with beam search is a week long project if you already have built something like that. The difference? It will work for arbitrary jons.

Of course I am making fun of your statements.

Equality saturation requires quite careful planning and will not work for couple of months, you will keep finding something that fails. Beam search is just as hard. You can encode optimal solution selection problem as a pseudoboolean optimization problem, which is more straightforward than beam search, and [2] shows that Pueblo is no slouch there.

This is not to say that query planning is easy when you do it my way. Query planning is hard. NP-hard, actually. Sometimes you can get away with a simpler less general solution, but it will bite you sooner than you expect.

j-pb · on July 28, 2022

C++ is probably the slowest programming language in terms of development speed there is, but even if it takes you 3 months to build something like that from scratch it'll still be an order of magnitude less time than building a clone of another SQL database.

`WHERE, GROUP BY and ORDER BY` are relatively straightforward to tack on into the variable ordering.

Having written a sat solver would kinda help you, because modern sat solving algorithms are fundamentally very similar to worst case optimal join algorithms.

The query plan produced by combining binary joins is always going to be off by up to an exponential factor when compared to a WCO join. If you're fine throwing all that complexity onto your problem to generate sub-par query plans, be my guest.

thesz · on July 28, 2022

And how modern SAT solving algorithms are fundamentally very similar to worst case optimal join algorithms?

There are at least two different types of them, conflict-derived clause learning and variants and stochastic search and variants.

Which one is more similar to WCO join algorithm?

j-pb · on July 28, 2022

DPLL, see https://arxiv.org/pdf/1912.12747.pdf

thesz · on July 28, 2022

The paper you provided has two important omissions: 1) it does not mention Tseitin transformation to encode Disjunctive Normal Form into a Conjunctive Normal Form and 2) does not look at decision diagrams, especially, zero-suppressed decision diagrams for set representation.

Tseitin transformation prevents exponential expansion in conversion from DNF to CNF.

The fact that paper's algorithm employs disjunctive normal form suggests the use of binary decision diagrams. ROBDDs represent DNFs naturally, for one example. ZDDs represent sets of sets and were relatively successfully used in non-trivial approaches to the SAT solving problem like [1].

[1] https://web.eecs.umich.edu/~imarkov/pubs/book/b002.pdf

Also, the paper you mentioned has this right in abstract: "However, there is still the quest of making the new worst-case optimal join algorithms truly practical in terms of (1) ease of implementation and (2) secondary index efficiency in terms of number of indexes created to answer a query."

WCO is hard to implement and it might be computation-wise prohibitive.

Yet, it's quite interesting, thank you very much.

thesz · on July 28, 2022

>The query plan produced by combining binary joins is always going to be off by up to an exponential factor when compared to a WCO join.

Citation greatly needed. Why is it so?

j-pb · on July 28, 2022

The intuition is that if your join is a triangle you might be forced to join every relation only to discover that the result is empty on the last join, which requires m^n intermediary results, where m is the size of the relations and n is the number of relations. A WCO join algorithm will figure out quite quickly that the output is empty and is in practice much more dependent on the actual output size, than the size of the input relations.

See:

https://www.cs.stanford.edu/people/chrismre/papers/paper49.N...

https://justinjaffray.com/a-gentle-ish-introduction-to-worst...

thesz · on July 28, 2022

Thank you, second link of yours starts with the explanation of benefits expected, which is great.

The difference is not exponential, if I may nitpick. It is sublinear in the case of the "triangles example" - WCO would produce O(n^(1.5)), binary join will produce O(n^2), the difference is O(n^(0.5)) or O(sqrt(n)). The difference is big, but not exponential.

j-pb · on July 28, 2022

That blogpost is a godsend, it provides a great explanation and intuition, and resolved a few questions that I was left with after reading a lot of WCO join papers.

Good nitpick, I think you can construct larger rings where the difference becomes larger than one, but I might be wrong. The dynamic variable ordering trick makes a huge difference in practice, especially when skew is in play. (https://arxiv.org/pdf/1310.3314.pdf)

In general you're right, WCO joins are a relatively young field of study and sometimes struggle with large constant factors, but they are maturing quickly and in a limited (triple) setting like the one OP "wished for", a lot more feasible than for the general case.

Thanks for the interesting discussions, looking forward to reading the references you provided in depth!

Edit: I just remembered this paper, which might be of interest to you. They seem to recover WCO bounds in a pairwise setting, by choosing very smart intermediary join representations: http://www.cs.ox.ac.uk/dan.olteanu/papers/co-tr16.pdf

thesz · on July 28, 2022

Exactly what I needed, thank you again very much!

My old idea was to perform planning after some of the work has been done, because remaining statistics can be different. These papers are of great help!