Hi, Pablo here
Fixing a Degraded ZFS Mirror: Reseat, Resilver, and Scrub
Part 3 of 3 in my "First ZFS Degradation" series. See also Part 1: The Setup and Part 2: Diagnosing the Problem.
The Game Plan
By now I was pretty confident about what was wrong: not a dying drive, but a flaky SATA connection. The fix should be straightforward. Just take the drive offline, shut down, reseat the cables, bring it back up, and let ZFS heal itself.
But I wanted to do this methodically. ZFS is forgiving, but I didn't want to make things worse by rushing.
Here was my plan:
- Take the faulty drive offline in ZFS (tell ZFS "stop trying to use this drive")
- Power down the server
- Open the case, inspect and reseat cables
- Boot up, verify the drive is detected
- Bring the drive back online in ZFS
- Let the resilver complete
- Run a scrub to verify data integrity
- Check SMART one more time
Let's walk through each step.
Step 1: Taking the Drive Offline
Before touching hardware, I wanted ZFS to stop trying to use the problematic drive.
First, I set up some variables to avoid typos with that long disk ID:
DISKID="ata-ST4000NT001-3M2101_WX11TN0Z"
DISKPATH="/dev/disk/by-id/$DISKID"
Then I took it offline:
zpool offline proxmox-tank-1 "$DISKID"
Checking the status afterward:
zpool status -v proxmox-tank-1
NAME STATE READ WRITE CKSUM
proxmox-tank-1 DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ata-ST4000NT001-3M2101_WX11TN0Z OFFLINE 108 639 129
ata-ST4000NT001-3M2101_WX11TN2P ONLINE 0 0 0
The state changed from FAULTED to OFFLINE. ZFS knows I intentionally took it offline rather than it failing on its own. The error counts are still there as a historical record, but ZFS isn't actively trying to use the drive anymore.
Time to shut down and get my hands dirty.
Step 2: Opening the Case
I powered down the server and opened up the Fractal Node 804. This case has a lovely design with drive bays accessible from the side, which I love. No reaching out into weird corners in the case, just unscrew a couple screws, slide the drive bay out and there they are, handy and reachable.
I located AGAPITO1 (I had handwritten labels on the drives, lesson learned after many sessions of playing "which drive is which") and inspected the connections.
Here's the honest truth: everything looked fine. The SATA data cable was plugged in. The power connector was plugged in. Nothing was obviously loose or damaged. There was a bit of tension in the cable as it moved from one area of the case (where the motherboard is) to the drives area, but I really didn't think that was affecting the connection to either the drive or the motherboard itself.
But "looks fine" doesn't mean "is fine". So I did a full reseat:
- Unplugged and firmly replugged the SATA data cable at both ends (drive and motherboard).
- Unplugged and firmly replugged the power connector.
- While I was in there, checked the connections on the other disk of the mirror as well.
I made sure each connector clicked in solidly. Then I closed up the case and hit the power button.
Step 3: Verifying Detection
The server booted up. Would Linux see the drive?
ls -l /dev/disk/by-id/ | grep WX11TN0Z
lrwxrwxrwx 1 root root 9 Jan 2 23:15 ata-ST4000NT001-3M2101_WX11TN0Z -> ../../sdb
The drive was there, mapped to /dev/sdb.
I opened a second terminal and started watching the kernel log in real time:
dmesg -Tw
This would show me immediately if the connection started acting flaky again. For now, it was quiet, showing just normal boot messages, the drive being detected successfully, etc. Nothing alarming.
Step 4: Bringing It Back Online
Moment of truth. I told ZFS to start using the drive again:
zpool online proxmox-tank-1 "$DISKID"
Immediately checked the status:
zpool status -v proxmox-tank-1
pool: proxmox-tank-1
state: DEGRADED
status: One or more devices is currently being resilvered.
action: Wait for the resilver to complete.
scan: resilver in progress since Fri Jan 2 23:17:35 2026
0B resilvered, 0.00% done, no estimated completion time
NAME STATE READ WRITE CKSUM
proxmox-tank-1 DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ata-ST4000NT001-3M2101_WX11TN0Z DEGRADED 0 0 0 too many errors
ata-ST4000NT001-3M2101_WX11TN2P ONLINE 0 0 0
Two things to notice: the drive's error counters are now at zero (we're starting fresh), and ZFS immediately started resilvering. It shows "too many errors" as the reason for the degraded state, which is historical, it remembers why the drive was marked bad before.
I kept watching both the status and the kernel log. No errors, no link resets.
Step 5: The Resilver
Resilvering is ZFS's term for rebuilding redundancy. Copying data from the healthy drive to the one that fell behind. In my case, the drive had been desynchronized for who knows how long (the pool had drifted 524GB out of sync before I noticed), so there was a lot to copy.
I shut down my VMs to reduce I/O contention and let the resilver have the disk bandwidth. Progress:
scan: resilver in progress since Fri Jan 2 23:17:35 2026
495G / 618G scanned, 320G / 618G issued at 100M/s
320G resilvered, 51.78% done, 00:50:12 to go
The kernel log stayed quiet the whole time. Everything was indicating the cable reseat had worked.
I went to bed and let it run overnight. The next morning:
scan: resilvered 495G in 01:07:58 with 0 errors on Sat Jan 3 00:25:33 2026
495 gigabytes resilvered in about an hour, zero errors. But the pool still showed DEGRADED with a warning about "unrecoverable error." I was very confused about this, but I solved that with some research. Apparently, ZFS is cautious and wants human acknowledgement before declaring everything healthy again.
zpool clear proxmox-tank-1 ata-ST4000NT001-3M2101_WX11TN0Z
This command clears the error flags. Immediately:
pool: proxmox-tank-1
state: ONLINE
scan: resilvered 495G in 01:07:58 with 0 errors on Sat Jan 3 00:25:33 2026
NAME STATE READ WRITE CKSUM
proxmox-tank-1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-ST4000NT001-3M2101_WX11TN0Z ONLINE 0 0 0
ata-ST4000NT001-3M2101_WX11TN2P ONLINE 0 0 0
Damn, seeing this felt nice.
Step 6: The Scrub
A resilver copies data to bring the drives back in sync, but it doesn't verify that all the existing data is still good. For that, you run a scrub. ZFS reads every block on the pool, verifies checksums, and repairs anything that doesn't match.
zpool scrub proxmox-tank-1
I let this run while I brought my VMs back up (scrubs can run in the background without blocking normal operations, though performance takes a hit). A few hours later:
scan: scrub repaired 13.0M in 02:14:22 with 0 errors on Sat Jan 3 11:03:54 2026
NAME STATE READ WRITE CKSUM
proxmox-tank-1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-ST4000NT001-3M2101_WX11TN0Z ONLINE 0 0 992
ata-ST4000NT001-3M2101_WX11TN2P ONLINE 0 0 0
Interesting. The scrub repaired 13MB of data and found 992 checksum mismatches on AGAPITO1. From what I read, checksum errors are typically a sign of the disk being in terrible shape and needing a replacement ASAP. That sounds scary, but I took the risk and assumed those were blocks that had been written incorrectly (or not at all) during the period when the connection was flaky, and not an issue with the disk itself. ZFS detected the bad checksums and healed them using the good copies from AGAPITO2.
I cleared the errors again and the pool was clean:
zpool clear proxmox-tank-1 ata-ST4000NT001-3M2101_WX11TN0Z
Step 7: Final Validation with SMART
One more check. I wanted to see if SMART had anything new to say about the drive after all that activity:
smartctl -x /dev/sdb | egrep -i 'overall|Reallocated|Pending|CRC|Hardware Resets'
SMART overall-health self-assessment test result: PASSED
5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0
197 Current_Pending_Sector -O--C- 100 100 000 - 0
199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0
0x06 0x008 4 41 --- Number of Hardware Resets
Still passing. The hardware reset count went from 39 to 41 — just the reboots I did during this process.
For completeness, I ran the long self-test. The short test only takes a minute and does basic checks, the long test actually reads every sector on the disk, which for a 4TB drive takes... a while.
smartctl -t long /dev/sdb
The estimated time was about 6 hours. In practice, it took closer to 12. Running VMs in parallel probably didn't help.
But eventually:
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 1563 -
# 2 Short offline Completed without error 00% 1551 -
# 3 Short offline Completed without error 00% 1462 -
The extended test passed. Every sector on the disk is readable. The drive is genuinely healthy — it was just the connection that was bad.
Lessons Learned
- ZFS did exactly what it's supposed to do: Despite 524+ gigabytes of desync and nearly a thousand checksum errors, I lost zero data and was back on action while keeping my VMs running. The healthy drive kept serving everything while the flaky drive was acting up, and once the connection was fixed, ZFS healed itself automatically. Also, I was operating for an unknown amount of time with only one drive. In this case it seems it was due to stupid me messing up cable management, but I'm very happy knowing if the disk had been genuinely faulty, services would have continued just fine.
- Physical connections matter: It's easy to not pay that much attention when building a new box. Well, it bites back.
- Monitor your pools. I only found this issue by accident, clicking around in the Proxmox UI. The pool had been degraded for who knows how long before I noticed. I'm already working in setting up a monitor to my Uptime Kuma instance so that next time the pool status stops being ONLINE I get notified immediately.
I'm happy I was able to test out recoverying from a faulty disk with such a tiny issue. I learned a lot fixing it, and now I'm even more happy than before having decided to go for this ZFS pool setup.
Quick Reference: The Commands
For future me (and anyone else who ends up here with a degraded pool):
# Check pool status
zpool status -v
# Watch kernel logs in real time
dmesg -Tw
# Check SMART health
smartctl -H /dev/sdX
smartctl -x /dev/sdX
# Take a drive offline before physical work
zpool offline
# Bring a drive back online
zpool online
# Clear error flags after recovery
zpool clear
# Run a scrub to verify all data
zpool scrub
# Run SMART self-tests
smartctl -t short /dev/sdX # Quick test (~1 min)
smartctl -t long /dev/sdX # Full surface scan (hours)
smartctl -l selftest /dev/sdX # Check test results
Thanks for reading! This was Part 3: The Fix. You might also enjoy Part 1: The Setup and Part 2: Diagnosing the Problem.