Hi, Pablo here

back to home

Fixing a Degraded ZFS Mirror: Reseat, Resilver, and Scrub

Part 3 of 3 in my "First ZFS Degradation" series. See also Part 1: The Setup and Part 2: Diagnosing the Problem.

The Game Plan

By now I was pretty confident about what was wrong: not a dying drive, but a flaky SATA connection. The fix should be straightforward. Just take the drive offline, shut down, reseat the cables, bring it back up, and let ZFS heal itself.

But I wanted to do this methodically. ZFS is forgiving, but I didn't want to make things worse by rushing.

Here was my plan:

  1. Take the faulty drive offline in ZFS (tell ZFS "stop trying to use this drive")
  2. Power down the server
  3. Open the case, inspect and reseat cables
  4. Boot up, verify the drive is detected
  5. Bring the drive back online in ZFS
  6. Let the resilver complete
  7. Run a scrub to verify data integrity
  8. Check SMART one more time

Let's walk through each step.

Step 1: Taking the Drive Offline

Before touching hardware, I wanted ZFS to stop trying to use the problematic drive.

First, I set up some variables to avoid typos with that long disk ID:

DISKID="ata-ST4000NT001-3M2101_WX11TN0Z"
DISKPATH="/dev/disk/by-id/$DISKID"

Then I took it offline:

zpool offline proxmox-tank-1 "$DISKID"

Checking the status afterward:

zpool status -v proxmox-tank-1
  NAME                                 STATE     READ WRITE CKSUM
  proxmox-tank-1                       DEGRADED     0     0     0
    mirror-0                           DEGRADED     0     0     0
      ata-ST4000NT001-3M2101_WX11TN0Z  OFFLINE    108   639   129
      ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0

The state changed from FAULTED to OFFLINE. ZFS knows I intentionally took it offline rather than it failing on its own. The error counts are still there as a historical record, but ZFS isn't actively trying to use the drive anymore.

Time to shut down and get my hands dirty.

Step 2: Opening the Case

I powered down the server and opened up the Fractal Node 804. This case has a lovely design with drive bays accessible from the side, which I love. No reaching out into weird corners in the case, just unscrew a couple screws, slide the drive bay out and there they are, handy and reachable.

I located AGAPITO1 (I had handwritten labels on the drives, lesson learned after many sessions of playing "which drive is which") and inspected the connections.

Here's the honest truth: everything looked fine. The SATA data cable was plugged in. The power connector was plugged in. Nothing was obviously loose or damaged. There was a bit of tension in the cable as it moved from one area of the case (where the motherboard is) to the drives area, but I really didn't think that was affecting the connection to either the drive or the motherboard itself.

But "looks fine" doesn't mean "is fine". So I did a full reseat:

I made sure each connector clicked in solidly. Then I closed up the case and hit the power button.

Step 3: Verifying Detection

The server booted up. Would Linux see the drive?

ls -l /dev/disk/by-id/ | grep WX11TN0Z
lrwxrwxrwx 1 root root  9 Jan  2 23:15 ata-ST4000NT001-3M2101_WX11TN0Z -> ../../sdb

The drive was there, mapped to /dev/sdb.

I opened a second terminal and started watching the kernel log in real time:

dmesg -Tw

This would show me immediately if the connection started acting flaky again. For now, it was quiet, showing just normal boot messages, the drive being detected successfully, etc. Nothing alarming.

Step 4: Bringing It Back Online

Moment of truth. I told ZFS to start using the drive again:

zpool online proxmox-tank-1 "$DISKID"

Immediately checked the status:

zpool status -v proxmox-tank-1
  pool: proxmox-tank-1
 state: DEGRADED
status: One or more devices is currently being resilvered.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Jan  2 23:17:35 2026
        0B resilvered, 0.00% done, no estimated completion time

    NAME                                 STATE     READ WRITE CKSUM
    proxmox-tank-1                       DEGRADED     0     0     0
      mirror-0                           DEGRADED     0     0     0
        ata-ST4000NT001-3M2101_WX11TN0Z  DEGRADED     0     0     0  too many errors
        ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0

Two things to notice: the drive's error counters are now at zero (we're starting fresh), and ZFS immediately started resilvering. It shows "too many errors" as the reason for the degraded state, which is historical, it remembers why the drive was marked bad before.

I kept watching both the status and the kernel log. No errors, no link resets.

Step 5: The Resilver

Resilvering is ZFS's term for rebuilding redundancy. Copying data from the healthy drive to the one that fell behind. In my case, the drive had been desynchronized for who knows how long (the pool had drifted 524GB out of sync before I noticed), so there was a lot to copy.

I shut down my VMs to reduce I/O contention and let the resilver have the disk bandwidth. Progress:

scan: resilver in progress since Fri Jan  2 23:17:35 2026
      495G / 618G scanned, 320G / 618G issued at 100M/s
      320G resilvered, 51.78% done, 00:50:12 to go

The kernel log stayed quiet the whole time. Everything was indicating the cable reseat had worked.

I went to bed and let it run overnight. The next morning:

scan: resilvered 495G in 01:07:58 with 0 errors on Sat Jan  3 00:25:33 2026

495 gigabytes resilvered in about an hour, zero errors. But the pool still showed DEGRADED with a warning about "unrecoverable error." I was very confused about this, but I solved that with some research. Apparently, ZFS is cautious and wants human acknowledgement before declaring everything healthy again.

zpool clear proxmox-tank-1 ata-ST4000NT001-3M2101_WX11TN0Z

This command clears the error flags. Immediately:

  pool: proxmox-tank-1
 state: ONLINE
  scan: resilvered 495G in 01:07:58 with 0 errors on Sat Jan  3 00:25:33 2026

    NAME                                 STATE     READ WRITE CKSUM
    proxmox-tank-1                       ONLINE       0     0     0
      mirror-0                           ONLINE       0     0     0
        ata-ST4000NT001-3M2101_WX11TN0Z  ONLINE       0     0     0
        ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0

Damn, seeing this felt nice.

Step 6: The Scrub

A resilver copies data to bring the drives back in sync, but it doesn't verify that all the existing data is still good. For that, you run a scrub. ZFS reads every block on the pool, verifies checksums, and repairs anything that doesn't match.

zpool scrub proxmox-tank-1

I let this run while I brought my VMs back up (scrubs can run in the background without blocking normal operations, though performance takes a hit). A few hours later:

scan: scrub repaired 13.0M in 02:14:22 with 0 errors on Sat Jan  3 11:03:54 2026

    NAME                                 STATE     READ WRITE CKSUM
    proxmox-tank-1                       ONLINE       0     0     0
      mirror-0                           ONLINE       0     0     0
        ata-ST4000NT001-3M2101_WX11TN0Z  ONLINE       0     0   992
        ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0

Interesting. The scrub repaired 13MB of data and found 992 checksum mismatches on AGAPITO1. From what I read, checksum errors are typically a sign of the disk being in terrible shape and needing a replacement ASAP. That sounds scary, but I took the risk and assumed those were blocks that had been written incorrectly (or not at all) during the period when the connection was flaky, and not an issue with the disk itself. ZFS detected the bad checksums and healed them using the good copies from AGAPITO2.

I cleared the errors again and the pool was clean:

zpool clear proxmox-tank-1 ata-ST4000NT001-3M2101_WX11TN0Z

Step 7: Final Validation with SMART

One more check. I wanted to see if SMART had anything new to say about the drive after all that activity:

smartctl -x /dev/sdb | egrep -i 'overall|Reallocated|Pending|CRC|Hardware Resets'
SMART overall-health self-assessment test result: PASSED
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
197 Current_Pending_Sector  -O--C-   100   100   000    -    0
199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
0x06  0x008  4              41  ---  Number of Hardware Resets

Still passing. The hardware reset count went from 39 to 41 — just the reboots I did during this process.

For completeness, I ran the long self-test. The short test only takes a minute and does basic checks, the long test actually reads every sector on the disk, which for a 4TB drive takes... a while.

smartctl -t long /dev/sdb

The estimated time was about 6 hours. In practice, it took closer to 12. Running VMs in parallel probably didn't help.

But eventually:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      1563         -
# 2  Short offline       Completed without error       00%      1551         -
# 3  Short offline       Completed without error       00%      1462         -

The extended test passed. Every sector on the disk is readable. The drive is genuinely healthy — it was just the connection that was bad.

Lessons Learned

I'm happy I was able to test out recoverying from a faulty disk with such a tiny issue. I learned a lot fixing it, and now I'm even more happy than before having decided to go for this ZFS pool setup.

Quick Reference: The Commands

For future me (and anyone else who ends up here with a degraded pool):

# Check pool status
zpool status -v 

# Watch kernel logs in real time
dmesg -Tw

# Check SMART health
smartctl -H /dev/sdX
smartctl -x /dev/sdX

# Take a drive offline before physical work
zpool offline  

# Bring a drive back online
zpool online  

# Clear error flags after recovery
zpool clear  

# Run a scrub to verify all data
zpool scrub 

# Run SMART self-tests
smartctl -t short /dev/sdX  # Quick test (~1 min)
smartctl -t long /dev/sdX   # Full surface scan (hours)
smartctl -l selftest /dev/sdX  # Check test results

Thanks for reading! This was Part 3: The Fix. You might also enjoy Part 1: The Setup and Part 2: Diagnosing the Problem.

back to home