Recovery after data loss doesn't work with more than 2 replicas per shard

This is a bit vague report, but we had an issue when one node was put back to production after a disk failure and couldn't recover from peer replicas. What happened was we put the node back to production, it started receiving parts, but the connection always broke before the part transfer was finished. There was a lot of network and disk i/o but nothing finished successfully, so the disk usage wasn't going up and the instance never recovered. Another strange thing is, when we stopped the faulty instance, the peer replicas continued to read and transmit parts (despite the receiver was down), so I had to restart the peer replicas as well to stop the disk i/o.

We did two things to mitigate the issue:

Switch from xfs to ext4, the xfs nodes experienced strange performance degradation under load where overall load went up 8x and disk i/o almost stalled. It was also impossible to recover from replicas on xfs due to connection breaks, while it progressed from nodes on ext4.
Change the affected node configuration to only two replicas per shard, so it is forced to recover from only one replica. This works even with xfs nodes, but makes recovery tedious.

I suspect the core of the issue is that receiving replica cannot moderate senders when it's not able to keep up which leads to connection breaks, there is currently no way to throttle throughput or have any sort of backpressure - the sender replicas just keep sending. I'm not sure how are you dealing with this, but this also affects read query performance while the replication is in progress. The secondary issue is that xfs is not really recommended as a filesystem, we're still investigating what causes this issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Recovery after data loss doesn't work with more than 2 replicas per shard #520

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Recovery after data loss doesn't work with more than 2 replicas per shard #520

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions