Sorry, I know the article is kind of sensational but it also has some good information and we're here to discuss the real substance in the thread.
Terark built a new storage engine for Database and Data Systems based on the Succinct Nested Trie data structure. Our technology enables direct search on highly compressed data without decompressing it. Thanks to that we obtain >200X faster performance and more than 15X storage savings (better than Google's LevelDB or Facebook's RocksDB).
We are a Y Combinator company (W17).
We also provide a free license of TerarkDB and you can download the exact scripts we used and run your own benchmarks with the configuration you want. We know the claim may sound outlandish, so we try to be as transparent as possible.
Can you please do some benchmarks against MySQL and PostgreSQL? The vast majority of your prospective customers will be using these two instead of RocksDB or MongoDB.
Sure! Our benchmark against MySQL is here: https://github.com/Terark/mysql-on-terarkdb/wiki/YCSB-on-9.1...
We used YCSB on 9.1Gb of movie data.
This benchmark is comparing MySQL with our product "MySQL on Terark". "MySQL on Terark" is basically MySQL configured with TerarkDB instead of InnoDB -- that way you can migrate your MySQL applications to Terark with virtually no modification in your code.
We do not have any benchmark against PostgreSQL though. It is not in our plans to adapt our storage engine to PostgreSQL, so we're not comparing it against it, but the gains are just as significative.
I hope that answer your questions, and feel free to reach us at business@terark.com
Also TerichDB calls itself open source but then includes this: "TerichDB is open source but our core data structures and algorithms(dfadb) are not yet."
If the core algorithms of TerichDB is not open source then is TerichDB even usable? Are you going to open source the core algorithms?
TerichDB is an experimental repo. We'll take it private to avoid confusion. Thanks for pointing this out.
Regarding the license of our products: the core of TerarkDB is a plug-in for RocksDB. It is loaded as a dynamic library for librocksdb.so and compliant with RocksDB’s license. All the code related to MongoDB/MySQL is open source (We use MongoRocks and MyRocks).
Making the core algorithms open source is a dilemma for us. At this stage, as a young startup, keeping the core algorithms proprietary gives us leverage on the valuation and insulates us from potential competitors. But this is something we may reconsider in the future in order to facilitate a wider adoption of our products.
It's a big debate, even for us internally. If anyone else here has been facing the same dilemma, we'd love to hear about your opinion, what you chose in the end and how things turned out.
The old quote from Howard Aiken may be relevant here: "Don't worry about people stealing an idea. If it's original, you will have to ram it down their throats."
In our situation, making it fully open source for wide adoption would oblige us to become a consulting company with revenue based on support. This would require a large headcount that we cannot support as a mostly bootstrapped startup.
So, for now, we decided to focus on the tech and deliver the best tech possible to the (largest) clients who need it the most. Think of it as Quality vs/ Quantity. For instance, as reported in the article, our largest client at the moment is Alibaba Cloud (the Chinese Amazon AWS). We're able to cater to their custom needs and even send our CTO and some engineers to their office to accompany them when needed. We solve a pain point for them on a huge scale, and we're able to make a decent revenue that makes us profitable and allows us to grow our business independently.
But we're open on the question. Our storage engine is also compatible with MongoDB and MySQL, so if we could partner with a large company providing support for MongoDB/MySQL on Terark (think something like Percona, for instance), and that open sourcing all our code was a must, we would consider it.
I have read some of your documentation and your blog post. Your tech seems strong. But what is it about it that you feel is so unique that you cannot share it with the world? Is it the succinct trie implementation? The tree traversal algorithms? Do you have patentable tech? Will you consider applying for a patent?
I'm not second-guessing your decisions, only asking to know what it is you think are protecting.
I really hope someone with vast knowledge of database internals will come here and comment on Terark claims. The blog entry mentioned in the comments is a better source of information than that article.
Thank you, we really appreciate your comment. As I mentioned in an other reply, we understand the claim may sound outlandish, so we try to be as transparent as possible:
- We provide several different benchmark results and detailed procedures: https://github.com/Terark/terarkdb/wiki/Benchmark
- We provide a free license of TerarkDB and you can download the exact scripts we used and run your own benchmarks with the configuration you want.
We're a bunch of geeks (I think the picture is worth a thousand words ^^) who had a scratch to itch and a lightbulb moment. We built a product around it and we're trying to make a sustainable business. Any feedback or comment is welcome.
We understand some people might be skeptical and we're happy to answer any question. And if you like it, we would be thrilled if you could help spread the word. We're not the bests at marketing... haha
Some of the basic assertions, such as the relative inefficiency of block compression in database engines, are true. I've seen material gains from using context/content-aware compression and some commercial OLAP databases exploit this extensively. They appear to be using many of the same kinds of techniques.
However, the assertions made around caching behavior, such as wasting memory due to double caching, are not generally true. While you will see this in simple/naive database engines, a sophisticated high-performance database implementation won't be designed this way.
Thanks. These assertions are here to give a basic background and overview of databases performance in general. The real game changer with Terark is our novel compression algorithm. It's more space efficient, that's one thing, but above all else we can search directly into the compressed data without decompressing it. That's the real breakthrough.
We do that by using a data structure called Succinct Nested Trie, and we've introduced concepts such as CO-Index (Compressed Ordered Index) and PA-Zip (Point Accessible Zip).
We were at first a compression company, and turned to storage engines and database as a domain of application for our algos, hence the analogy with Pied Piper :)
Is the compression geared towards any particular type of data? Seems like compression that would work well on, say, blog posts, may not work as well on, say, tick-level data from a stock exchange.
It works, but it's not the best scenario for us. Scenarios with financial data are most likely sequential read (gimme all data for the last 50 trading days) and write heavy (tick-level write). We're blowing away the rest of the pack when you have a huge haystack and you're looking for the needle in it, that's where you're gonna get a 200x boost in performance with using Terark.
That is correct. We simply built a storage engine technology. You run it yourself on your own servers. Everything is open source except for the core compression algorithms that are loaded as a proprietary dynamic library. Being a young startup, we made the choice not to open source this part for now (see my other comment higher in this thread).
Terark built a new storage engine for Database and Data Systems based on the Succinct Nested Trie data structure. Our technology enables direct search on highly compressed data without decompressing it. Thanks to that we obtain >200X faster performance and more than 15X storage savings (better than Google's LevelDB or Facebook's RocksDB). We are a Y Combinator company (W17).