Hi, Pablo here

Replacing a Failed Disk in a ZFS Mirror

If you've been following along, you know the story: I set up a ZFS mirror for my Proxmox VMs, then one of the drives started acting flaky, and I diagnosed and fixed what turned out to be a bad SATA connection.

Well, the connection wasn't the whole story. A few weeks after that fix, the same drive, AGAPITO1, started dropping off again. Same symptoms: link resets, speed downgrades, kernel giving up on the connection. I went through the cable swap dance again, tried different SATA ports on the motherboard, tried different cables. Nothing helped. The SATA PHY on the drive itself was failing.

I contacted PcComponentes (where I bought it), RMA'd the drive, and ran degraded on AGAPITO2 alone for about two weeks. Then the replacement arrived. This article covers the process of physically installing a new drive and getting it into the ZFS mirror, from "box on the desk" to "pool healthy, mirror whole."

The starting point

Before doing anything, this is what the pool looked like:

  pool: proxmox-tank-1
 state: DEGRADED
status: One or more devices have been removed.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 0B in 06:55:06 with 0 errors on Tue Feb 17 20:40:50 2026
config:

        NAME                                 STATE     READ WRITE CKSUM
        proxmox-tank-1                       DEGRADED     0     0     0
          mirror-0                           DEGRADED     0     0     0
            ata-ST4000NT001-3M2101_WX11TN0Z  REMOVED      0     0     0
            ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0

errors: No known data errors

DEGRADED with one drive REMOVED. The old drive (WX11TN0Z) was physically gone, shipped back to PcComponentes. AGAPITO2 (WX11TN2P) was holding down the fort alone.

This is the beauty and the terror of a degraded mirror: everything works fine. Your VMs keep running, your data is intact, reads and writes happen normally. But you have zero redundancy. If that surviving drive has a bad day, you lose everything. Two weeks of running like this was two weeks of hoping AGAPITO2 stayed healthy.

Before you touch hardware

Before doing anything physical, I wanted to capture the current state. When things go wrong during maintenance, you want to be able to compare "before" and "after."

Three things to record while the server is still running:

Pool status, the zpool status output above. You want to know exactly what ZFS thinks the world looks like right now.

SATA layout, which drive is on which port:

dmesg -T | grep -E 'ata[0-9]+\.[0-9]+: ATA-|ata[0-9]+: SATA link up'

In my case, AGAPITO2 was on ata4 and ata3 was empty (the old drive's port). This matters because after you install the new drive, you want to confirm it shows up on the expected port.

Surviving drive health, to make sure the drive you're depending on is actually healthy before you start:

smartctl -H /dev/disk/by-id/ata-ST4000NT001-3M2101_WX11TN2P

SMART overall-health self-assessment test result: PASSED

If this says anything other than PASSED, stop and deal with that first. You don't want to discover your only remaining copy of data is on a failing drive while you're in the middle of hardware work.

Once you've got your reference snapshots, shut down the server gracefully:

shutdown now

Physical installation

I won't write a hardware installation tutorial, every case and drive bay is different. But a few practical tips for homelabbers doing this for the first time:

Inspect your cables before connecting them. If the SATA data cable has been sitting disconnected in the case, check the connector pins. Bent pins or dust can cause exactly the kind of intermittent issues that started this whole saga.
Label the new drive. I labeled mine "TOMMY" with its serial number (WX120LHQ) written on a sticker. Yes, I name my drives. It makes debugging much easier than squinting at serial numbers.
Push connectors until they click. Both SATA data and power. Then do the wiggle test: grab the connector gently and try to move it. If it shifts at all, it's not fully seated.

Seat the drive, connect both cables, close the case, and power on.

Boot and verify detection

First thing after boot: did the kernel see the new drive?

dmesg -T | grep -E 'ata[0-9]+\.[0-9]+: ATA-|ata[0-9]+: SATA link up'

[Fri Feb 20 22:57:06 2026] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Fri Feb 20 22:57:06 2026] ata3.00: ATA-11: ST4000NT001-3M2101, EN01, max UDMA/133
[Fri Feb 20 22:57:07 2026] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Fri Feb 20 22:57:07 2026] ata4.00: ATA-11: ST4000NT001-3M2101, EN01, max UDMA/133

Both drives detected at full 6.0 Gbps: TOMMY on ata3, AGAPITO2 on ata4.

Next, verify it shows up with its expected serial in /dev/disk/by-id/:

ls -l /dev/disk/by-id/ | grep WX120LHQ

ata-ST4000NT001-3M2101_WX120LHQ -> ../../sda

And confirm identity with SMART:

smartctl -i /dev/disk/by-id/ata-ST4000NT001-3M2101_WX120LHQ

Device Model:     ST4000NT001-3M2101
Serial Number:    WX120LHQ
Firmware Version: EN01
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)

Correct model, serial, firmware, and running at full speed.

One more critical check: look for SATA errors in the kernel log.

dmesg -T | grep -E 'ata[0-9]' | grep -iE 'error|fatal|reset|link down|slow|limiting'

I saw ata1: SATA link down and ata2: SATA link down, which are just unused ports. Nothing on ata3 or ata4. If you see errors on the port your new drive is on, stop. A brand new drive throwing SATA errors on a known-good cable is likely dead on arrival.

Health-check before trusting it

A drive can be detected and still be dead on arrival. Before resilvering 1.3 terabytes of data onto it, I wanted to know it was actually healthy.

SMART overall health:

smartctl -H /dev/disk/by-id/ata-ST4000NT001-3M2101_WX120LHQ

SMART overall-health self-assessment test result: PASSED

Baseline SMART attributes, the important ones to check on a new drive:

smartctl -A /dev/disk/by-id/ata-ST4000NT001-3M2101_WX120LHQ | grep -E 'Reallocated|Pending|Offline_Uncorrect|CRC'

  5 Reallocated_Sector_Ct   ... -       0
197 Current_Pending_Sector  ... -       0
198 Offline_Uncorrectable   ... -       0
199 UDMA_CRC_Error_Count    ... -       0

All zeros. Reallocated sectors would mean the drive has already had to remap bad spots. Pending sectors are blocks the drive suspects are bad but hasn't confirmed yet. CRC errors indicate data corruption during transfer. On a new or refurbished drive, all of these should be zero.

Short self-test:

smartctl -t short /dev/disk/by-id/ata-ST4000NT001-3M2101_WX120LHQ
# Wait ~2 minutes...
smartctl -l selftest /dev/disk/by-id/ata-ST4000NT001-3M2101_WX120LHQ

# 1  Short offline       Completed without error       00%         0         -

Passed with 0 power-on hours, a fresh drive. If any of these checks fail, don't proceed. Contact the seller and get another replacement.

The replacement: `zpool replace`

This is the moment. One command:

zpool replace proxmox-tank-1 ata-ST4000NT001-3M2101_WX11TN0Z ata-ST4000NT001-3M2101_WX120LHQ

This tells ZFS "the drive identified as WX11TN0Z (currently REMOVED) is being replaced by WX120LHQ." ZFS starts resilvering immediately, copying all data from the surviving drive (AGAPITO2) onto the new one (TOMMY).

Checking status right after:

  pool: proxmox-tank-1
 state: DEGRADED
  scan: resilver in progress since Fri Feb 20 23:10:58 2026

config:

        NAME                                 STATE     READ WRITE CKSUM
        proxmox-tank-1                       DEGRADED     0     0     0
          mirror-0                           DEGRADED     0     0     0
            replacing-0                      DEGRADED     0     0     0
              ata-ST4000NT001-3M2101_WX11TN0Z  REMOVED      0     0     0
              ata-ST4000NT001-3M2101_WX120LHQ  ONLINE       0     0 7.73K
            ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0

Notice the replacing-0 vdev. That's a temporary structure ZFS creates during the replacement, showing both the old (REMOVED) and new (ONLINE) drive while the resilver is in progress.

The 7.73K cksum count on the new drive might look alarming, but it's expected during a resilver. Those are blocks that haven't been written yet. ZFS is aware of them and they'll clear up as the resilver progresses.

I monitored progress with:

watch -n 30 "zpool status -v proxmox-tank-1"

I also kept dmesg -Tw running in another terminal, watching for any SATA errors. The kernel log stayed quiet the entire time.

In my case, the VMs had auto-started on boot, so the resilver was competing with production I/O. It completed in about 3.5 hours: 1.34 terabytes resilvered with 0 errors. Not bad for a pair of 4TB IronWolf drives running alongside active workloads.

Post-resilver verification

The resilver finished. Time to verify everything is actually good.

Pool status:

  pool: proxmox-tank-1
 state: ONLINE
  scan: resilvered 1.34T in 03:32:55 with 0 errors on Sat Feb 21 02:43:53 2026
config:

        NAME                                 STATE     READ WRITE CKSUM
        proxmox-tank-1                       ONLINE       0     0     0
          mirror-0                           ONLINE       0     0     0
            ata-ST4000NT001-3M2101_WX120LHQ  ONLINE       0     0 7.73K
            ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0

errors: No known data errors

ONLINE. The replacing-0 vdev is gone and the mirror now has the new drive in place. The 7.73K cksum on TOMMY is a residual counter from the resilver, so let's clear it:

zpool clear proxmox-tank-1

Now for the real test. A resilver copies data to rebuild the mirror, but a scrub reads every block on the pool, verifies all checksums, and repairs any mismatches. This is the definitive integrity check:

zpool scrub proxmox-tank-1

This ran for about 3.5 hours across 1.34T of data:

  scan: scrub repaired 0B in 03:27:50 with 0 errors on Sat Feb 21 11:38:02 2026

        NAME                                 STATE     READ WRITE CKSUM
        proxmox-tank-1                       ONLINE       0     0     0
          mirror-0                           ONLINE       0     0     0
            ata-ST4000NT001-3M2101_WX120LHQ  ONLINE       0     0     0
            ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0

errors: No known data errors

Zero bytes repaired, zero errors, both drives at 0/0/0. Clean.

One last thing: a post-I/O SMART check on the new drive. After hours of heavy writes during the resilver and reads during the scrub, any hardware weakness should have surfaced:

smartctl -x /dev/disk/by-id/ata-ST4000NT001-3M2101_WX120LHQ | grep -E 'Reallocated|Pending|Offline_Uncorrect|CRC|Hardware Resets|COMRESET|Interface'

Reallocated_Sector_Ct    ... 0
Current_Pending_Sector   ... 0
Offline_Uncorrectable    ... 0
UDMA_CRC_Error_Count     ... 0
Number of Hardware Resets ... 2
Number of Interface CRC Errors ... 0
COMRESET ... 2

All clean. The 2 hardware resets and 2 COMRESETs are just from the server booting, perfectly normal.

The commands, all in one place

For future me and anyone else replacing a disk in a ZFS mirror:

# --- Before shutdown ---

# Record pool status
zpool status -v <pool>

# Record SATA layout
dmesg -T | grep -E 'ata[0-9]+\.[0-9]+: ATA-|ata[0-9]+: SATA link up'

# Check surviving drive health
smartctl -H /dev/disk/by-id/<surviving-disk-id>

# Shut down
shutdown now

# --- After boot with new drive ---

# Verify detection
dmesg -T | grep -E 'ata[0-9]+\.[0-9]+: ATA-|ata[0-9]+: SATA link up'
ls -l /dev/disk/by-id/ | grep <new-serial>
smartctl -i /dev/disk/by-id/<new-disk-id>

# Check for SATA errors
dmesg -T | grep -E 'ata[0-9]' | grep -iE 'error|fatal|reset|link down'

# Health-check the new drive
smartctl -H /dev/disk/by-id/<new-disk-id>
smartctl -A /dev/disk/by-id/<new-disk-id> | grep -E 'Reallocated|Pending|Offline_Uncorrect|CRC'
smartctl -t short /dev/disk/by-id/<new-disk-id>
smartctl -l selftest /dev/disk/by-id/<new-disk-id>

# --- Replace and resilver ---

# Replace old drive with new
zpool replace <pool> <old-disk-id> <new-disk-id>

# Monitor resilver progress
watch -n 30 "zpool status -v <pool>"

# Watch kernel log for SATA errors during resilver
dmesg -Tw

# --- Post-resilver verification ---

# Check final status
zpool status -v <pool>

# Clear residual cksum counters
zpool clear <pool>

# Run a full scrub
zpool scrub <pool>

# Post-I/O SMART check
smartctl -x /dev/disk/by-id/<new-disk-id> | grep -E 'Reallocated|Pending|Offline_Uncorrect|CRC'

The mirror degradation that started on February 8th is resolved. Two weeks of running on a single drive, an RMA, and one evening of work later, the pool is whole again. Full redundancy restored, zero data lost throughout the entire saga. ZFS did exactly what it was designed to do.

This is the fourth and final article in this series. If you're just arriving, start with Part 1: Why I Put My VMs on a ZFS Mirror, then Part 2: A Degraded Pool with a Healthy Disk, and Part 3: Fixing a Degraded ZFS Mirror.

back to home