Ranting about the recent x3D failures

NOTE: this is going to be very technical and this is my OWN opinion, fact check anything after this point yourself.
NOTE2: this is an evolving situation and I don't have the hardware to test and probe things myself but what I do have is experience with overclocking on past platforms and a decent understanding of the hardware and firmware side of things
NOTE3: a new GN video came out but I have had no time to watch it, this post predates it!

PLEASE WATCH THE GAMERS NEXUS VIDEO BEFORE READING THIS

the silicon perspective

The silicon failures are the hardest part to explain, the current theory is that the silicon gets damaged by current leakage between Vcore and VSOC and I think it's the theory that makes the most sense. This would imply that AMD does have some involvement in the failures we've been seeing but not in the way you might expect.

the root cause

The real issue is quite obvious to anyone that has actually looked into how modern boards handle ram support or anyone with RAM overclocking experience with DDR4 and DDR5 platforms. the modern way to handle whacky XMPs and unfortunate silicon lottery luck is to run the RAM controller and DDR PHY subsystems at voltages higher than spec. On Intel platforms the main voltages for this are VCCIO and VCCSA for AMD VSOC is used instead. As you can imagine this isn't great especially when the board has no real way to estimate how much voltage is needed to stabilise a memory kit because the biggest factors are hard to detect properly.

the DRAM IC revision is easily the biggest factor in determining how hard will it be for the memory controller to read and write memory at a given frequency but not all silicon is the same and some ICs are known to have a huge variance in silicon quality between samples. This SHOULD at least partially be accounted for in newer DDR revisions because the SPD specification now has programmable bytes used to define the DRAM IC vendor and IC revision. The bad news is DRAM sellers (who sells the sticks to the user) only programs the IC vendor and sometimes not even the correct one.
just like DRAM ICs memory controllers are silicon which implies there will be variance between CPUs, don't forget that the memory controller clock is tied to the memory clock and not all memory controllers will run the same clock range. This isn't even tested in QA, CPU vendors only certify JEDEC speeds which means there's no easy way to determine the silicon quality for memory controllers or PHYs.
but that's not all! DDR5 runs at such high speeds that even motherboard batches have started to have an impact on RAM stability, some users found that specific motherboard batches were simply better at clocking memory than others. This could be detected in a number of ways and accounted for but it would make firmware development a nightmare as it increases the number of boards to test firmware updates on.
as you can imagine, the firmware doesn't really consider these variables and the approach is calculating a generous voltage relative to the target DRAM frequency and if RAM training fails it will try training again with a even higher voltage, eventually downclocking memory or giving up completely.

the motherboard perspective

Until now motherboard manufacturers got away with it, motherboards violating specifications or running dangerous voltages are nothing new so much so that in RAM overclocking circles automatic voltages are considered a meme. Similar issues have happened in the past but they were always isolated to those who ran high speed memory and after a longer span of time. In the past the dynamic was somewhat different because memory controllers were good enough to run most speed bins, training times were shorter and most importantly the failure point was different (the VCCIO rail runs the iGPU on Intel meaning problems would happen as soon as the iGPU saw a load, generally with QuickSync). Now it seems that the failure point is the bias between Vcore and VSOC which would also explain why the failures are more common on lower core count x3D chips as they run a lower Vcore.

the OCP not triggering on ASUS boards

I've seen a lot of concern about this, understandably so, OCP is a way to prevent an escalation of the fault but please understand that OCP will not save your CPU, it will just prevent the death of other components. ASUS was always known for being generous with OCP and while I do agree it is dangerous I don't think this plays a big role in the current situation.

my conclusions

Expandability is great for the end user but most people don't consider just how much annoyance it can really create. From this unfortunate situation we learned that what might seem like complete incompetence at first can turn into a much more complex situation. I personally think that AMD's biggest mistake was not enforcing the guidelines properly and that the current approach to ensure RAM compatibility kind of sucks. Sadly as far as I can tell every possible solution will inevitably anger someone. Here's some solutions I could think of and what their downsides are:

Selling CPUs with RAM soldered to the substrate might be the best idea for speed and compatibility but it would cause outrage from consumers.
Removing EXPO and XMP support from motherboards might be a great idea for compatibility and it would let consumers keep the upgradability but I think this just wouldn't happen any time soon. CPU vendors might already not support non JEDEC DRAM but most people don't even know this is the case and I think it would leave so much performance to the table.
Implementing mechanisms to better predict voltages is certainly possible and it would alleviate the catastrophic faults but it would extend training times even more and possibly reduce compatibility compared to what we have today.
Ignoring the problems the current approach and keeping things like they are today might seem like the least radical approach but with DDR5 it's getting more and more unsustainable. RAM training is now a multi minute process, CPU+RAM compatibility has been lower than ever and motherboards are running dangerous amounts of voltage to even get to where we are today.

I personally believe that most people running the same CPU model have the same RAM capacity anyway and that upgradability in that sense has died down in the last decade but I don't have enough data to confirm or deny my consideration and it would anger consumers especially movements like right to repair wouldn't be happy about this approach maybe even foolishly not considering how many failures this approach could prevent.

PLEASE! read and watch these too, if you are interested research elsewhere the topics discussed.

here's your navigation buttons, as much as I find them silly.

Blog Index Main Page