Terark (YC W17) is a profitable database compression company based in Beijing

rockeetterark · on June 25, 2017

Sorry, I know the article is kind of sensational but it also has some good information and we're here to discuss the real substance in the thread.

Terark built a new storage engine for Database and Data Systems based on the Succinct Nested Trie data structure. Our technology enables direct search on highly compressed data without decompressing it. Thanks to that we obtain >200X faster performance and more than 15X storage savings (better than Google's LevelDB or Facebook's RocksDB). We are a Y Combinator company (W17).

CaveTech · on June 25, 2017

That's 200x performance in relation to what?

rockeetterark · on June 25, 2017

In 200x improvement in random read performance, compared to RocksDB or WiredTiger (MongoDB's storage engine). You can find benchmarks on our website: https://terark.com/en/index And here: https://github.com/Terark/terarkdb/wiki/Benchmark

We also provide a free license of TerarkDB and you can download the exact scripts we used and run your own benchmarks with the configuration you want. We know the claim may sound outlandish, so we try to be as transparent as possible.

desdiv · on June 25, 2017

Can you please do some benchmarks against MySQL and PostgreSQL? The vast majority of your prospective customers will be using these two instead of RocksDB or MongoDB.

rockeetterark · on June 25, 2017

Sure! Our benchmark against MySQL is here: https://github.com/Terark/mysql-on-terarkdb/wiki/YCSB-on-9.1... We used YCSB on 9.1Gb of movie data. This benchmark is comparing MySQL with our product "MySQL on Terark". "MySQL on Terark" is basically MySQL configured with TerarkDB instead of InnoDB -- that way you can migrate your MySQL applications to Terark with virtually no modification in your code.

We do not have any benchmark against PostgreSQL though. It is not in our plans to adapt our storage engine to PostgreSQL, so we're not comparing it against it, but the gains are just as significative.

I hope that answer your questions, and feel free to reach us at business@terark.com

javiramos · on June 25, 2017

How does it compare to Pied Piper's leading compression technology? [0]

/s

[0] http://www.piedpiper.com/

rockeetterark · on June 25, 2017

The day they offer a download, I promise you we'll run a benchmark ;)

continuations · on June 25, 2017

So there's TerarkDB: https://github.com/Terark/terarkdb

And there's TerichDB: https://github.com/Terark/terichdb

How are they related to each other?

Also TerichDB calls itself open source but then includes this: "TerichDB is open source but our core data structures and algorithms(dfadb) are not yet."

If the core algorithms of TerichDB is not open source then is TerichDB even usable? Are you going to open source the core algorithms?

All this is rather confusing.

rockeetterark · on June 25, 2017

TerichDB is an experimental repo. We'll take it private to avoid confusion. Thanks for pointing this out.

Regarding the license of our products: the core of TerarkDB is a plug-in for RocksDB. It is loaded as a dynamic library for librocksdb.so and compliant with RocksDB’s license. All the code related to MongoDB/MySQL is open source (We use MongoRocks and MyRocks).

Making the core algorithms open source is a dilemma for us. At this stage, as a young startup, keeping the core algorithms proprietary gives us leverage on the valuation and insulates us from potential competitors. But this is something we may reconsider in the future in order to facilitate a wider adoption of our products.

It's a big debate, even for us internally. If anyone else here has been facing the same dilemma, we'd love to hear about your opinion, what you chose in the end and how things turned out.

couchand · on June 25, 2017

The old quote from Howard Aiken may be relevant here: "Don't worry about people stealing an idea. If it's original, you will have to ram it down their throats."

rockeetterark · on June 25, 2017

In our situation, making it fully open source for wide adoption would oblige us to become a consulting company with revenue based on support. This would require a large headcount that we cannot support as a mostly bootstrapped startup. So, for now, we decided to focus on the tech and deliver the best tech possible to the (largest) clients who need it the most. Think of it as Quality vs/ Quantity. For instance, as reported in the article, our largest client at the moment is Alibaba Cloud (the Chinese Amazon AWS). We're able to cater to their custom needs and even send our CTO and some engineers to their office to accompany them when needed. We solve a pain point for them on a huge scale, and we're able to make a decent revenue that makes us profitable and allows us to grow our business independently.

But we're open on the question. Our storage engine is also compatible with MongoDB and MySQL, so if we could partner with a large company providing support for MongoDB/MySQL on Terark (think something like Percona, for instance), and that open sourcing all our code was a must, we would consider it.

ableton · on June 28, 2017

You guys could keep it proprietary then try to get bought out by one of the big guys who would open source it.

marcuslager · on June 27, 2017

I have read some of your documentation and your blog post. Your tech seems strong. But what is it about it that you feel is so unique that you cannot share it with the world? Is it the succinct trie implementation? The tree traversal algorithms? Do you have patentable tech? Will you consider applying for a patent?

I'm not second-guessing your decisions, only asking to know what it is you think are protecting.

polskibus · on June 25, 2017

I really hope someone with vast knowledge of database internals will come here and comment on Terark claims. The blog entry mentioned in the comments is a better source of information than that article.

rockeetterark · on June 25, 2017

Thank you, we really appreciate your comment. As I mentioned in an other reply, we understand the claim may sound outlandish, so we try to be as transparent as possible:

- We provide several different benchmark results and detailed procedures: https://github.com/Terark/terarkdb/wiki/Benchmark - We provide a free license of TerarkDB and you can download the exact scripts we used and run your own benchmarks with the configuration you want.

We're a bunch of geeks (I think the picture is worth a thousand words ^^) who had a scratch to itch and a lightbulb moment. We built a product around it and we're trying to make a sustainable business. Any feedback or comment is welcome. We understand some people might be skeptical and we're happy to answer any question. And if you like it, we would be thrilled if you could help spread the word. We're not the bests at marketing... haha

jandrewrogers · on June 25, 2017

Some of the basic assertions, such as the relative inefficiency of block compression in database engines, are true. I've seen material gains from using context/content-aware compression and some commercial OLAP databases exploit this extensively. They appear to be using many of the same kinds of techniques.

However, the assertions made around caching behavior, such as wasting memory due to double caching, are not generally true. While you will see this in simple/naive database engines, a sophisticated high-performance database implementation won't be designed this way.

rockeetterark · on June 25, 2017

Thanks. These assertions are here to give a basic background and overview of databases performance in general. The real game changer with Terark is our novel compression algorithm. It's more space efficient, that's one thing, but above all else we can search directly into the compressed data without decompressing it. That's the real breakthrough.

We do that by using a data structure called Succinct Nested Trie, and we've introduced concepts such as CO-Index (Compressed Ordered Index) and PA-Zip (Point Accessible Zip).

We were at first a compression company, and turned to storage engines and database as a domain of application for our algos, hence the analogy with Pied Piper :)

polskibus · on June 25, 2017

How does your technique compare to a typical column store?

scott00 · on June 25, 2017

Is the compression geared towards any particular type of data? Seems like compression that would work well on, say, blog posts, may not work as well on, say, tick-level data from a stock exchange.

rockeetterark · on June 25, 2017

It works, but it's not the best scenario for us. Scenarios with financial data are most likely sequential read (gimme all data for the last 50 trading days) and write heavy (tick-level write). We're blowing away the rest of the pack when you have a huge haystack and you're looking for the needle in it, that's where you're gonna get a 200x boost in performance with using Terark.

based2 · on June 25, 2017

https://terark.com/en/blog/detail/14

rockeetterark · on June 25, 2017

Yep, this is an article we published on our blog to give a bit of background information on databases performance. It's very general.

For benchmarks on TerarkDB's performance in particular, you can have a look here: https://github.com/Terark/terarkdb/wiki/Benchmark

est · on June 25, 2017

since Terark is a chinese startup

its founders answered some more questions here

https://www.zhihu.com/question/46787984

rockeetterark · on June 25, 2017

Thanks for mentioning it! Yes, for Chinese speakers, our CTO Lei Peng has answered a lot of questions on Zhihu (the Chinese Quora).

nerdwaller · on June 25, 2017

Maybe I'm alone in privacy concerns, but something behind the "great firewall" scares me a bit to trust.

richardw · on June 25, 2017

I don't think you send them your data. They send you their technology.

rockeetterark · on June 25, 2017

That is correct. We simply built a storage engine technology. You run it yourself on your own servers. Everything is open source except for the core compression algorithms that are loaded as a proprietary dynamic library. Being a young startup, we made the choice not to open source this part for now (see my other comment higher in this thread).

tluyben2 · on June 25, 2017

US SAAS companies inspire you with confidence? Question, not troll...

omginternets · on June 25, 2017

False dichotomy. Just because you're worried about China doesn't mean you're not worried about the US.