Ranting about the recent x3D failures

NOTE: this is going to be very technical and this is my OWN opinion, fact check anything after this point yourself.
NOTE2: this is an evolving situation and I don't have the hardware to test and probe things myself but what I do have is experience with overclocking on past platforms and a decent understanding of the hardware and firmware side of things
NOTE3: a new GN video came out but I have had no time to watch it, this post predates it!

PLEASE WATCH THE GAMERS NEXUS VIDEO BEFORE READING THIS

the silicon perspective

The silicon failures are the hardest part to explain, the current theory is that the silicon gets damaged by current leakage between Vcore and VSOC and I think it's the theory that makes the most sense. This would imply that AMD does have some involvement in the failures we've been seeing but not in the way you might expect.

the root cause

The real issue is quite obvious to anyone that has actually looked into how modern boards handle ram support or anyone with RAM overclocking experience with DDR4 and DDR5 platforms. the modern way to handle whacky XMPs and unfortunate silicon lottery luck is to run the RAM controller and DDR PHY subsystems at voltages higher than spec. On Intel platforms the main voltages for this are VCCIO and VCCSA for AMD VSOC is used instead. As you can imagine this isn't great especially when the board has no real way to estimate how much voltage is needed to stabilise a memory kit because the biggest factors are hard to detect properly.

the motherboard perspective

Until now motherboard manufacturers got away with it, motherboards violating specifications or running dangerous voltages are nothing new so much so that in RAM overclocking circles automatic voltages are considered a meme. Similar issues have happened in the past but they were always isolated to those who ran high speed memory and after a longer span of time. In the past the dynamic was somewhat different because memory controllers were good enough to run most speed bins, training times were shorter and most importantly the failure point was different (the VCCIO rail runs the iGPU on Intel meaning problems would happen as soon as the iGPU saw a load, generally with QuickSync). Now it seems that the failure point is the bias between Vcore and VSOC which would also explain why the failures are more common on lower core count x3D chips as they run a lower Vcore.

the OCP not triggering on ASUS boards

I've seen a lot of concern about this, understandably so, OCP is a way to prevent an escalation of the fault but please understand that OCP will not save your CPU, it will just prevent the death of other components. ASUS was always known for being generous with OCP and while I do agree it is dangerous I don't think this plays a big role in the current situation.

my conclusions

Expandability is great for the end user but most people don't consider just how much annoyance it can really create. From this unfortunate situation we learned that what might seem like complete incompetence at first can turn into a much more complex situation. I personally think that AMD's biggest mistake was not enforcing the guidelines properly and that the current approach to ensure RAM compatibility kind of sucks. Sadly as far as I can tell every possible solution will inevitably anger someone. Here's some solutions I could think of and what their downsides are:

I personally believe that most people running the same CPU model have the same RAM capacity anyway and that upgradability in that sense has died down in the last decade but I don't have enough data to confirm or deny my consideration and it would anger consumers especially movements like right to repair wouldn't be happy about this approach maybe even foolishly not considering how many failures this approach could prevent.

PLEASE! read and watch these too, if you are interested research elsewhere the topics discussed.

here's your navigation buttons, as much as I find them silly.

Blog Index Main Page