MCE hardware error finally resolved
My KDE Neon crashes intermittently for the past 2 years without a clear pattern or trigger. I suppose the only hint was that it often (not always) happens when I was using graphic intensive software, such as video player or remote desktop. But also happens during a cold boot up, or when the monitors wake up from sleep. It's mostly unpredictable.
What made the diagnosis confusing is the kernel log mostly shows error like this:
mce: [Hardware Error]: CPU 19: Machine Check: 0 Bank 5: bea0000001000108
mce: [Hardware Error]: TSC 0 ADDR 7f0d1a36c7e5 MISC d012000100000000 SYND 4d000000 IPID 500b000000000
mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1689465218 SOCKET 0 APIC 7 microcode a201025
Basically hints towards CPU being unstable and crashes. I have tried:
- Voltage bias (to simulate Windows settings), disabling
PBO
and Overdrive; - Disable D.O.C.P (XMP profile), running my RAM at low frequency;
- Updating
mesa
and kernel firmware
Nothing worked.
Recently after another fatal crash, I investigated the matter further and came across these 2 pages, where a group of people have been debugging this issue together for almost 2 years:
- https://bbs.archlinux.org/viewtopic.php?id=264997
- https://gitlab.freedesktop.org/drm/amd/-/issues/1481
Majority of users described the same issue that I have experienced, and they have made some amazing discoveries.
They have found the comment from AMD's Richard T:
bea0000000000108 means the thread has stopped executing…this is longest timeout, all other hardware fault timers would/should fire before this. […] this case has lots of possible causes…OS, App, voltage , temp, board hardware(power delivery cases), memory (are you running ECC memory ?)
And a genius named Leonardo Gates commented:
This issue has been prevalent across all the RX 5000 series cards it seems (I've seen this on the 5500XT/5600XT/5700/5700XT) as on Windows, it will trigger a WHEA-18 error (Cache-Hierarchy error) while on Linux it gives this MCE, bea0000000000108. I highly suspect this is either a hardware errata or due to faulty hardware (as even people I've spoken to that did RMAs, still got the error). I'm genuinely wondering if there will ever be a fix for this since it's been almost 2 years and this is pretty much the only problem I have with my GPU.
In other words, the root problem was probably my GPU - RX 5700XT itself, nothing to do with my CPU, RAM or motherboard.
And the solution was to upgrade to a different GPU, such as the RX 6000 or 7000 series.
So I picked up an RX 7600 today. Let's see how long this solution holds up.