-
Notifications
You must be signed in to change notification settings - Fork 7.7k
Description
This is a bit vague report, but we had an issue when one node was put back to production after a disk failure and couldn't recover from peer replicas. What happened was we put the node back to production, it started receiving parts, but the connection always broke before the part transfer was finished. There was a lot of network and disk i/o but nothing finished successfully, so the disk usage wasn't going up and the instance never recovered. Another strange thing is, when we stopped the faulty instance, the peer replicas continued to read and transmit parts (despite the receiver was down), so I had to restart the peer replicas as well to stop the disk i/o.
We did two things to mitigate the issue:
- Switch from xfs to ext4, the xfs nodes experienced strange performance degradation under load where overall load went up 8x and disk i/o almost stalled. It was also impossible to recover from replicas on xfs due to connection breaks, while it progressed from nodes on ext4.
- Change the affected node configuration to only two replicas per shard, so it is forced to recover from only one replica. This works even with xfs nodes, but makes recovery tedious.
I suspect the core of the issue is that receiving replica cannot moderate senders when it's not able to keep up which leads to connection breaks, there is currently no way to throttle throughput or have any sort of backpressure - the sender replicas just keep sending. I'm not sure how are you dealing with this, but this also affects read query performance while the replication is in progress. The secondary issue is that xfs is not really recommended as a filesystem, we're still investigating what causes this issue.