One of the big questions coming out of AMD's CES announcements was if its new CPU design, codenamed matisse and which enables two chiplets and an IO die on a single package, would support one of those chiplets being graphics based in order to make an APU. In our discussions with AMD, we received confirmation that this will not be the case.

The new matisse design is the platform for AMD's next generation of desktop processors. The layout shown at CES this year represented the design as having a single IO die, about 122.6 mm2 and built on globalfoundries 14nm, paired with a chiplet die, about 80.8 mm2, containing eight cores and built on TSMC's 7nm. There is obviously space on that package for another CPU chiplet, and there has always been questions if the chiplet design is amenable to using a graphics.

AMD stated that, at this time, there will be no version of the current matisse chiplet layout where one of those chiplets will be graphics. We were told that there will be zen 2 processors with integrated graphics, presumably coming out much later after the desktop processors, but built in a different design. Ultimately apus are both mobile first as well as lower cost parts (usually), so different design decisions will have to be made in order to support that market.

Our contacts at AMD also discussed the TDP range of the upcoming range of matisse processors. Given AMD's definition of TDP, relating to the cooling performance required of the CPU cooler, the range of tdps for matisse will be the same as current ryzen 2000-series processors. This means we could see 'E' variants as low as 35W TDP, all the way up to the top 'X' processors at 105W, similar to the current ryzen 7 2700X. We were told that the company expects the processors will fit within that range. This should be expected on some level, given the backwards compatibility with current AM4 motherboards on the market with a BIOS update.

The conclusion was made that in a NUMA environment, windows' scheduler actually assigns a 'best NUMA node' for each bit of software and the scheduler is programmed to move those threads to that node as often as possible, and will actually kick out threads that also have the same 'best NUMA node' settings with abandon. When running a single binary that spawns 32/64 threads, every thread from that binary is assigned the same 'best NUMA node', and these threads will continually be pushed onto that node, kicking out threads that already want to be there. This leads to core contention, and a fully multi-threaded program could spend half of its time shuffling around threads to comply with this 'best NUMA node' situation.

One would expect this issue to come up in any NUMA environment, such as dual processors or dual-die AMD processors. It turns out that microsoft has a hotfix in place in windows for dual-NUMA environments that disables this 'best NUMA node' situation. Ultimately at some point there were enough dual-socket workstation platforms on the market that this made sense, pushing the 'best NUMA node' implementation down the road to 3+ NUMA environments. This is why we see it in quad-die threadripper and EPYC, and not dual-die threadripper.

Wendell has been working with jeremy from bitsum, creator of the coreprio software, in developing a way of soft-fixing this issue. The coreprio software now has an option called 'NUMA disassociator' which probes which software is active every few seconds and adjusts the thread affinity while the software is running (rather than running an affinity mask which has no affect).

AMD stated that they have support and update tickets open with microsoft's windows team on the issue. They believe they know what the issue is, and commends wendell for being very close to what the actual issue is (they declined to go into detail). They are currently comparing notes with bitsum, and actually helped bitsum to develop the original tool for affinity masking, however the 'NUMA disassociator' is obviously new.

The timeline for a fix will depend on a number of factors between AMD and microsoft, however there will be announcements when the fix is ready and what exactly that fix will affect performance. Other improvements to help optimize performance will also be included. AMD is still very pleased with the threadripper 2 performance, and is keen to stress that for the most popular performance related tests the company points to reviews that show that the performance in rendering is still well above the competition, and is working with software vendors to push that performance even further.