To my understanding, Tgres is really more of a "middleware" layer that collects ...

gtrubetskoy · on Aug 11, 2017

As the author if Tgres, I can chime in here - dr. mfreed is correct.

While Tgres is an application layer, the main motivation behind developing it was to answer the question "can TS be stored in Postgres efficiently, ideally without requiring an extension". Not that there is anything wrong with custom extensions, but I wanted to keep the requirements to the absolute minimum.

I always had issues with people saying that relational databases are fundamentally not suitable for TS storage, and Tgres debunks this by demonstrating that you can sustain very high rates of incoming data by simply organizing the data in a more creative way.

The graphite functionality that Tgres emulates is just there to prove the point - as in, look it does all these things, and it's all in the database.

Hypothetically Tgres could work on top of Timescaledb with a few changes, I just only have so much spare time to tinker with this experimental stuff that I haven't tried it.

Another interesting thing I came across is PgPointCloud [1], it's designed for LIDAR data but is perfectly suitable for time series as well. It is a C extension. It's performance advantage comes from storing large numbers of data points in a variety of compact/compressed formats.

[1] https://github.com/pgpointcloud/pointcloud

riku_iki · on Aug 11, 2017

It is still not very clear at this point if Tgres approach is sufficient enough, and if timescale adds significant performance advantage over native array based Postgres implementation.

mfreed · on Aug 11, 2017

I think the broader point is that Tgres is generally focused on computing regular aggregations. Once you do that aggregation, you lose significant information about the relation between data collected at the same time, which eliminates your ability to ask a variety of questions. For some basic dashboarding/monitoring applications you don't need this, for other applications you absolutely do.

So, it's pretty common in Timescale to store raw data in one hypertable (with a shorter data retention policy), and aggregations in a separate table (with a longer data retention).

Regarding native storage, the insert rate really goes down as your table gets large, e.g., Timescale gets 20x higher throughput than PG: https://blog.timescale.com/timescaledb-vs-6a696248104e

I don't see a reason this wouldn't apply to Tgres' use of native storage as well...but once you do aggregations (say, per minute), your tables are just much smaller (only 525K minutes / year), so it perhaps matters less.

riku_iki · on Aug 11, 2017

> Regarding native storage, the insert rate really goes down as your table gets large, e.g., Timescale gets 20x higher throughput than PG: https://blog.timescale.com/timescaledb-vs-6a696248104e

I actually looked at this benchmark briefly, but couldn't find what kind of PostgreSQL schema you used there. Did you use similar array based schema as was described by Tgress author in his post? https://grisha.org/blog/2015/09/23/storing-time-series-in-po...

RobAtticus · on Aug 11, 2017

We used a pretty basic approach for both: 12 columns - timestamp, hostname, 10 CPU metrics.

gtrubetskoy · on Aug 11, 2017

The not so obvious difference in the Tgres approach, and my blogs might not be doing a great job of explaining it, is that some time in Feb 2017 I significantly revamped the storage approach to what I dubbed "vertical" storage whereby a timeslot stores an array of points in which every array element represents an element of a(nother) series.

So it went from:

    series1, array[val1, val2, val3 ...] --> time direction
    series2, array[val1, val2, val3 ...]
    ...

to

    slot1, array[series1_val1, series2_val1, ...] |
    slot2, array[series1_val2, series2_val2, ...] |
                                                  ^ time dir

With this structure you can write a single row and insert data points for n series (where n is array length) in a single row insert.

Thus, if I have 10,000 series, and my arrays are 1000-long, I can insert a data point for each of the 10K series in only 10 row inserts. This only works if the data points for all series for the same slot arrive at approximately same time, which in a monitoring-like scenario they usually do, but in other situations might not be the case.

The flip side of this approach is that querying the data then becomes less efficient because to read one data point of a series you end up reading an array-length of data points you might not care about for this particular query.

Also, tgres takes the round-robin approach, versus the timed partition approach and that's completely apples and oranges when it comes to performance. The round-robin approach also works only if the data points are evenly spaced (or transformed to be evenly spaced on the fly, which is what tgres Go code does), and again, it's hard to judge whether that's fundamentally "good" or "bad".

I can see how readers of this thread my be looking for which technique is faster, but it's just not that simple, and very much depends on the what the actual requirements are. The round-robin versus timed partition is also not mutually exclusive, you can combine the two, which may or may not be faster, not sure, the devil is in the details.

riku_iki · on Aug 11, 2017

> to read one data point of a series you end up reading an array-length of data points you might not care about for this particular query.

But DBs and FSs operate on pages of data and not individual records, so you will be reading that row anyway, and likely much more.

gtrubetskoy · on Aug 11, 2017

> But DBs and FSs operate on pages of data and not individual records, so you will be reading that row anyway, and likely much more.

Yes, when dealing with database performance understanding this goes with the territory.

The "game" here is to organize data in such way that the stuff you read inadvertently is something that you will need eventually (as in in a few microseconds). This is where things like CLUSTER and BRIN indexes become important, and this is also why partitioning is a win.