Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

FWIW we went through a very similar process to that documented here by Github (~3 months ago). It was entirely due to operational reasons and nothing to with shortcomings in Redis itself. MySQL was the master record for 99% of our data while Redis was the master record for the other 1% (as it happens it was also a kind of activity stream). Having the single 'master' reference for our data reduced complexity to a degree that it was worth running a less computationally-efficient setup. We also have nowhere near Github's volume so we did not have to do such significant re-architecting to make unification possible.

Now we still use Redis for reading the activity streams and as LRU cache for all sorts of data, but it is populated like all of our specialised slave-read systems (elasticsearch, etc) by replicating from the MySQL log.

Hope that helps!



Yes, this helps and totally makes sense to me. Thanks. I would do the same... In this case however it looks like there were certain high volume writes that could be handled in a simpler manner with Redis, however it is totally possible that while this looks like an important use case, it accounted for a small percentage of all the data, so we are back to the consolidation thing of moving everything to a single system that is in general a good idea.


What method are you using to replicate from MySQL binlog to various other systems?


FWIW, I've used github.com/siddontang/go-mysql to successfully replicate from MySQL to DynamoDB. Currently not using GTIDs and looking into that next.


just asking for some info,

but how do you make sure that multiple of your db systems are in sync (specifically interested in MySql and elasticsearch)?

Hope it's alright to ask you that.


In the case of ES the short answer is; we don't. We have fault tolerance in our replication system to guarantee eventual consistency instead. I would say using ES as a consistent source of data isn't really playing to its strengths so we don't use it that way. The consistency you want is determined at read time: If you need consistency then hit MySQL, but for our use case that almost never happens as eventual consistency is usually instantaneous enough.

Our other tool is to decouple lookup (which objects to fetch) and population (what data to return for each object). You can mix and match, e.g. do a lookup against an inconsistent ES but still get consistent objects by populating from MySQL (or vice versa). As others have alluded to it depends entirely on the requirements for the result set.


Where I work we use several different MySQL replicas in production, where we don't expect them to be in sync.

So long as the source of truth (Master MySQL node) is up to date, it's okay.

For example, if we show a user how much money is in their account on every page, we can run query that on a replica, since it's fine if this is a few seconds delayed. However, immediately after an action changed their balance, on a confirmation screen, we'd want to show the value from Master.

It's entirely possible that any place elasticsearch is being used just don't need consistency.


There are actually a few strong solutions out there for Mysql, most starting with change data capture like: https://github.com/shyiko/mysql-binlog-connector-java (I link that one in particular because he links to alternatives right in his readme!)

Pgsql is a bit harder, but if I needed to start somewhere it would be with:

https://github.com/debezium/debezium

or https://github.com/confluentinc/bottledwater-pg

These are the start of pretty sophisticated solutions where you need super real-time elasticsearch indexes and can bring up infra like Kafka.

For many applications, queueing an update when something hits your ORM to update, with the hourly/daily refresh is pretty satisfactory.


If you need any kind of consistency guarantee, I think you would need to use some kind of distributed transactions.

If its not, you could tail the MySQL log and have a process making the same changes to elasticsearch. The elasticsearch may lag behind if there are problems.


I'm facing a similar challenge, although at a much(MUCH!) smaller scale.

We have nearly everything in Postgres, and redis serves as both caching layer (non-persistent), but also for rails session storage and Sidekiq (persistent).

Having one source of truth can make things like failover much easier. I can handle PG failover, and also redis, but I'd rather not have to deal with both. Especially if you consider the potential of things going slightly out-of-sync (think a job in sidekiq that relies on an id in PG, one of which loses a few microseconds of data during replication etc, just speculating a scenario here)

Did anybody face similar challenges and care to share their thoughts?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: