Hi, Pablo here

A degraded pool with a healthy disk

Part 2 of 3 in my "First ZFS Degradation" series. See also Part 1: The Setup and Part 3: The Fix.

The "Oh Shit" Moment

I wasn't even looking for trouble. I was clicking around the Proxmox web UI, exploring some storage views I hadn't noticed before, when I saw it: my ZFS pool was in DEGRADED state.

I opened the details. One of my two mirrored drives was listed as FAULTED.

I was very surprised. This box and disks were brand new and didn't even have three months of running on them. I was not expecting HW issues to come at me that fast. I SSH'd into the server and ran the command that would become my best friend over the next 24 hours:

zpool status -v proxmox-tank-1

No glitch. The pool was degraded. The drive had racked up over 100 read errors, 600+ write errors, and 129 checksum errors. ZFS had given up on it.

  NAME                                 STATE     READ WRITE CKSUM
  proxmox-tank-1                       DEGRADED     0     0     0
    mirror-0                           DEGRADED     0     0     0
      ata-ST4000NT001-3M2101_WX11TN0Z  FAULTED    108   639   129  too many errors
      ata-ST4000NT001-3M2101_WX11TN2P  ONLINE       0     0     0

The good news: errors: No known data errors. ZFS was serving all my data from the healthy drive. Nothing was lost yet.

The bad news: I was running on a single point of failure. If AGAPITO2 decided to have a bad day too, I'd be in real trouble.

I tried the classic IT move: rebooting. The system came back up and ZFS immediately started trying to resilver (rebuild) the degraded drive. But within minutes, the errors started piling up again and the resilver stalled.

Time to actually figure out what was wrong.

The Diagnostic Toolbox

When a ZFS drive acts up, you have two main sources of truth: what the kernel sees happening at the hardware level, and what the drive itself reports about its health. This can be looked up with dmesg and smartctl.

dmesg: The Kernel's Diary

The Linux kernel maintains a ring buffer of messages about hardware events, driver activities, and system operations. The dmesg command lets you read it. For disk issues, you want to grep for SATA-related keywords:

dmesg -T | egrep -i 'ata[0-9]|sata|reset|link|i/o error' | tail -100

The -T flag gives you human-readable timestamps instead of seconds-since-boot.

What I saw was... weird. Here's an excerpt:

[Fri Jan  2 22:25:13 2026] ata4.00: exception Emask 0x50 SAct 0x70220001 SErr 0xe0802 action 0x6 frozen
[Fri Jan  2 22:25:13 2026] ata4.00: irq_stat 0x08000000, interface fatal error
[Fri Jan  2 22:25:13 2026] ata4.00: failed command: READ FPDMA QUEUED
[Fri Jan  2 22:25:13 2026] ata4: hard resetting link
[Fri Jan  2 22:25:14 2026] ata4: SATA link down (SStatus 0 SControl 300)

Let me translate: the kernel tried to read from the drive on ata4, got a "fatal error," and responded by doing a hard reset of the SATA link. Then the link went down entirely. The drive just... disappeared.

But it didn't stay gone. A few seconds later:

[Fri Jan  2 22:25:24 2026] ata4: link is slow to respond, please be patient (ready=0)
[Fri Jan  2 22:25:24 2026] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

The drive came back! At full speed! But then...

[Fri Jan  2 22:25:29 2026] ata4.00: qc timeout after 5000 msecs (cmd 0xec)
[Fri Jan  2 22:25:29 2026] ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[Fri Jan  2 22:25:29 2026] ata4: limiting SATA link speed to 3.0 Gbps

It failed again. The kernel, trying to be helpful, dropped the link speed from 6.0 Gbps to 3.0 Gbps. Maybe a slower speed would be more stable?

It wasn't. The pattern repeated: connect, fail, reset, reconnect at a slower speed. 6.0 Gbps, then 3.0 Gbps, then 1.5 Gbps. Eventually:

[Fri Jan  2 22:27:06 2026] ata4.00: disable device

The kernel gave up entirely.

This wasn't what a dying drive looks like. A dying drive throws read errors on specific bad sectors. This drive was connecting and disconnecting like someone was jiggling the cable. The kernel was calling it "interface fatal error", emphasis on interface.

smartctl: Asking the Drive Directly

Every modern hard drive has S.M.A.R.T. (Self-Monitoring, Analysis, and Reporting Technology) — basically a built-in health monitor. The smartctl command lets you get info out of it.

First, the overall health check:

smartctl -H /dev/sdb

SMART overall-health self-assessment test result: PASSED

Okay, that looks great. But if the disk is healthy, what the hell is going on, and where are all those errors that ZFS was spotting coming from?

Let's dig deeper with the extended info:

smartctl -x /dev/sdb

The key attributes I was looking for:

Attribute	Value	What it means
Reallocated_Sector_Ct	0	Bad sectors the drive has swapped out. Zero is good.
Current_Pending_Sector	0	Sectors waiting to be checked. Zero is good.
UDMA_CRC_Error_Count	0	Data corruption during transfer. Zero is good.
Number of Hardware Resets	39	Times the connection has been reset. Uh...

All the sector-level health metrics looked perfect. No bad blocks, no pending errors, no CRC errors. The drive's magnetic platters and read/write heads were fine.

But 39 hardware resets? That's not normal. That's the drive (or its connection) getting reset nearly 40 times.

I ran the short self-test to be sure:

smartctl -t short /dev/sdb
# Wait a minute...
smartctl -l selftest /dev/sdb

# 1  Short offline       Completed without error       00%

The drive passed its own self-test. The platters spin, the heads move, the firmware works, and it can read its own data just fine.

Hypothesis

At this point, the evidence was pointing clearly away from "the drive is dying" and toward "something is wrong with the connection."

What the kernel logs told me: the drive keeps connecting and disconnecting. Each time it reconnects, the kernel tries slower speeds. Eventually it gives up entirely. This is what you see with an unstable physical connection.

What SMART told me: the drive itself is healthy. No bad sectors, no media errors, no signs of wear. But there have been dozens of hardware resets — the connection keeps getting interrupted.

The suspects, in order of likelihood:

SATA data cable: the most common culprit for intermittent connection issues. Cables go bad, or weren't seated properly in the first place.
Power connection: if the drive isn't getting stable power, it might brown out intermittently.
SATA port on the motherboard: less likely, but possible.
PSU: power supply issues could affect the power rail feeding the drive. Unlikely, since both disks where feeding from the same cable tread, but still an option.

Given that I had just built this server a few weeks earlier, and a good part of that happened after midnight... I was beginning to suspect that perhaps I simply might not have plugged in the disk properly.

The Verdict

I was pretty confident now: the drive was fine, but the connection was bad. Most likely the SATA data cable, and most probably simply not connected properly.

The fix would require shutting down the server, opening the case, and reseating (or replacing) cables. Before doing that, I wanted to take the drive offline cleanly and document everything.

In Part 3, I'll walk through exactly how I fixed it: the ZFS commands, the physical work, and the validation to make sure everything was actually okay afterward.

Continue to Part 3: The Fix →

back to home