what actually is an npu. like the m2 in my mac mini has one but like what is it. what kind of operations do they do
@kirakira the prediction I made turned out to be correct so here's my take on it
An NPU is supposed to be an "AI accelerator"
... Except of course the Comp Sci behind that is an evolving field, so they can't make a GPT ASIC for chatbots or a DCSS ASIC for graphics. Those algos will be out in who knows how short a time
So all they can do is think about various operations that all of these algorithms use. And the reason they were moved to GPUs was bc those incidentally process massive matricies
@kirakira so what if we wanted to optimize a chip *specifically for* a bunch of that stuff, without all the graphics part of the GPU
-- after all, with things like Tesla from NVIDIA, which had no video output, and "GPGPU" as a whitepaper term, things were already going in that direction somewhat
So we can do what we were planning, except for Attention Models we also add a hardware SoftMax operator
Ship it, and Bob Gets a Bonus ™️
@kirakira in fact, this is why I am slightly contrarian about the NPU
I still think it has great potential uses outside of "AI" bullshit. Matricies come up in all sorts of system modeling problems, and random PCs being better at that might have some benefits
And Intel was sorta encouraging that? Until they did those staff cuts, at least
https://github.com/intel/intel-npu-acceleration-library
(I have no proof but archived April 2025? A very coincident date...)
But anyway, that is my ramble answering your question
@kirakira generally, without referring to what that specific npu does, an npu usually accelerates a few specific matrix operations up to a certain size, and common activation functions or convolutions applied to the same variable-sized 2d matrix.
if this sounds a lot more generic than you expected, that’s because it is
@kirakira@furry.engineer there's a few of them at the lower end that, perhaps also surprisingly, can be programmed using embedded opencl or vulkan compute shaders rather than being a pile of black box firmware that inputs a compiled tensorflowlite blob file
@kirakira@furry.engineer you can coerce things like a blur shader or conway's game of life to something that tensorflow-lite devices will accept, and they are perfectly good at doing both of those. or for that matter, anything that resembles a convolutional kernel. a lot of stuff that has traditionally been done on DSPs in embedded devices can be done just as well on a small NPU, the programming model just looks rather different
the main distinguishing thing for a lot of them is supporting, sometimes exclusively, only small, low-precision operations - you won’t find NPUs doing 64-bit floats, and even 32-bit ones are going to be hard to come across. they’re going to be things bfloat16, int8, int16, maybe int32