PotentialX

Dramatic improvement in Time to First Token on M5 iPad

Apple’s unified memory architecture is great for LLMS (and starting to show up in the PC world), but Apple had an achilles heel: the time it took to generate the first output token with a long prompt.

This was due to the lack of an accelerated matmul instruction, and that was allegedly addressed in the M5. I say allegedly, because the software support to expose wasn’t there until now.