Some quick notes I wrote on the train about my experience with the World's Greatest Leader-Leader MySQL Clustering System.
Questions? Comments? Stuff I got wrong? Send me a note at email@example.com
In order to start accepting reads/writes, a Galera node has to join a cluster. You can configure Galera to start as a single node cluster, and under certain mysterious conditions that node will fail to start because it’s failing to join a cluster… with itself.
There are versions of Galera that could stop replicating DML statements while still replicating DDL statements. This should be impossible but oh boy it is not. This can produce terrible operator errors if you’re not expecting it and you use schemas matching across clusters as evidence that replication is working. AFAIK this behavior silently went away in some more recent version of Galera but no one knows why.
Related to above, when a Galera node rejoins a cluster and detects that it is out of sync with that cluster it will dump all of its data and get a fresh copy from a donor. This is *very very bad* if that node happened to have the correct data— a system can suddenly lose weeks of data, and can lose it in response to an operator trying to repair a cluster. This was much worse for Cloud Foundry because of the way it used Bosh to manage Galera automatically.
Row locks don’t replicate, but some applications use row locks as mutexes for the entity represented by the row. This can cause very bad problems if applications don’t realize they’re using Galera, two batch processors connect to different Galera nodes, and the batch job they’re running is something like “charge the bill represented by this row to a customer’s credit card.”
We wrote a proxy to attempt to force all applications to always connect to the same Galera node. The proxies node selection algorithm needed to be deterministic so flickers in network available couldn’t split the two proxies. We chose “always connect to the running node with the lowest instance ID.” This worked great… unless the node with the lowest instance ID had a lousy disk, and would crash when it first processed a read. At that point the Galera + proxy system would get stuck in a loop where Node 0 would crash, the proxies would redirect app traffic to Node 1, Node 0 would restart itself and run its data load process, come back up, the proxies would flip back to Node 0, and then immediately crash again. Once the team figured out what happened they were able to make the behavior robust to this scenario but this kind of thing is just constant if you’re trying to run Galera programmatically and you’re trying to protect yourself against its various failure modes.
There are also issues with certain kinds of queries being able to cause problems across the cluster — noisy neighbor problems basically — including one colorfully named “the Stupakov effect” but that came after my time so I don’t have command of the details.
I think it’s possible to successfully run a Galera cluster for a single application or a small set of applications that all know they’re running Galera and are basically well behaved & low traffic. It works fairly well as the database for Cloud Foundry’s control plane — the only failures I’ve ever seen involving that database were traceable to the SSO service which allowed apps on the platform to use it for login which somewhat breaks that “low traffic” rule, and even there the failure I’m thinking of was “we ran out of connections in the connection pool,” which could have happened with any database.