Rendered at 17:24:45 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
lioeters 6 hours ago [-]
What an article, richly illustrated with diagrams, going deep into technical details to squeeze 4x performance improvement. The knowledge and experience that took to write this is worth gold.
It might benefit the world if there was philanthropic funding for people like this to do more public research and writing. Imagine there's so much information and wisdom in some people's brains, that deserve the chance to be written down and released in the public domain.
peterabbitcook 45 minutes ago [-]
Really cool visuals, thank you for sharing this.
I feel like the importance of padding is a bit understated on this page - BLAS and LAPACK require LDA and LDB parameters, and you can definitely tune these to the page size of a particular system/machine to improve performance.
When working with BLAS/LAPACK or other matrix libs, you can often apply a little linear algebra to reshape the problem rather than the input data to avoid a transpose altogether
Const-me 3 hours ago [-]
The AVX2 SIMD version is not ideal. Too many instructions, and it needs constant vectors. I would rather do it like that https://godbolt.org/z/cn6YKbfYd
asplake 10 hours ago [-]
Is transposition a common enough operation that it might be better to avoid it by having versions of the operations/functions that take matrices that do the necessary transpositions implicitly?
rhdunn 9 hours ago [-]
IIRC, libraries like numpy and pytorch can already do that as they store the matrices as 1D arrays with information on things like the stride length (advancing to the next row). That allows you to implement operations like transposition by editing the stride length and other parameters without manipulating the content of the matrix array.
threatripper 10 hours ago [-]
This is already done as much as possible by reordering and merging operations but transposition (explicit or implicit) is unavoidable for some operations.
yccs27 6 hours ago [-]
A good example for this is A + A^T; you can fuse the two operations but you cannot get around the access pattern of matrix transposition.
usernametaken29 6 hours ago [-]
The thing about linear algebra that strikes me as funny is that I faintly remember every instructor I ever have met says it is easy. Just learn the matrix cookbook and you’ll be good. At last, when you actually implement it, the ideas are simple in theory but in practice you’ll run into quadratic runtime almost all the time and it’s really hard to do well on CPUs. So is linear algebra hard? Maybe not on paper but in reality it’s really fucking difficult to get it to do what you want precisely and quickly
bsoles 4 hours ago [-]
The theory of (basic) linear algebra is pretty easy. The practice of proper linear algebra is very difficult. People spend entire careers on things like error propagation in matrix operations. There are entire books on inverting matrices using a litany of approaches using things like QR décomposition, SVD, etc. I think it is fascinating...
contravariant 5 hours ago [-]
I wonder if using a hilbert curve (or perhaps the simpler z-curve) access order would help things. If ought to work well regardless of cache size.
Actually I think I recall some GPUs storing textures that way, but I'm not entirely sure.
4 hours ago [-]
amelius 7 hours ago [-]
That last diagram almost looks like an FFT shuffle.
yccs27 5 hours ago [-]
It's the same topology! Just replace each shuffle/blend with multiplication by a root of unity and addition, and you get FFT!
amelius 7 hours ago [-]
Are there programming languages or optimizers that simplify this kind of plumbing?
mlochbaum 5 hours ago [-]
This is one of the best uses I've found for Singeli[0]. Here's how I implemented an AVX2 transpose kernel similar to that in transpose_Vec256_kernel, for generic type and vector/kernel size (I use unpack instructions rather than shuffle and blend, which I think is probably faster since it's just one instruction for each interaction):
The language is oriented towards compile-time array programming instead of managing a bunch of individual vectors. So you have runtime vec_select{} (docs at [1]), mirrored by compile-time select{}, and the indices generated by pairs{} can be used in either.
Does it use switch in place? x = x XOR y, y = x XOR y, x = x XOR y
adrian_b 41 minutes ago [-]
Switch in place is inefficient for data stored in memory, like a matrix.
The most efficient way to swap 2 values stored in memory is to use 2 load instructions and 2 store instructions, like "load X in R1; load Y in R2; store R1 in Y; store R2 in X".
Therefore, for swapping memory values the XOR trick has never been useful, in the entire history of automatic computers.
For swapping data that is stored in internal CPU vector registers or matrix registers there are special shuffle instructions, which implement various kinds of transpositions.
Switch in place was efficient mostly in the distant past, for swapping general-purpose registers in those CPUs that did not have dedicated exchange/swap instructions. Intel/AMD CPUs always had exchange instructions, so switch in place has never been useful on them in any circumstances, since the very launch of the IBM PC, 45 years ago.
Today, the XOR trick might have remained useful for swapping general-purpose registers in some microcontrollers, but in the most popular ISAs, like ARM-based or RISC-V, most GPRs are equivalent, so the need to swap them arises very rarely, only in certain kinds of loops, and even there swapping can frequently be avoided by unrolling the loops.
It might benefit the world if there was philanthropic funding for people like this to do more public research and writing. Imagine there's so much information and wisdom in some people's brains, that deserve the chance to be written down and released in the public domain.
I feel like the importance of padding is a bit understated on this page - BLAS and LAPACK require LDA and LDB parameters, and you can definitely tune these to the page size of a particular system/machine to improve performance.
When working with BLAS/LAPACK or other matrix libs, you can often apply a little linear algebra to reshape the problem rather than the input data to avoid a transpose altogether
Actually I think I recall some GPUs storing textures that way, but I'm not entirely sure.
https://github.com/dzaima/CBQN/blob/v0.11.0/src/singeli/src/...
The language is oriented towards compile-time array programming instead of managing a bunch of individual vectors. So you have runtime vec_select{} (docs at [1]), mirrored by compile-time select{}, and the indices generated by pairs{} can be used in either.
[0] https://github.com/mlochbaum/Singeli/
[1] https://github.com/mlochbaum/Singeli/tree/master/include#sim...
The most efficient way to swap 2 values stored in memory is to use 2 load instructions and 2 store instructions, like "load X in R1; load Y in R2; store R1 in Y; store R2 in X".
Therefore, for swapping memory values the XOR trick has never been useful, in the entire history of automatic computers.
For swapping data that is stored in internal CPU vector registers or matrix registers there are special shuffle instructions, which implement various kinds of transpositions.
Switch in place was efficient mostly in the distant past, for swapping general-purpose registers in those CPUs that did not have dedicated exchange/swap instructions. Intel/AMD CPUs always had exchange instructions, so switch in place has never been useful on them in any circumstances, since the very launch of the IBM PC, 45 years ago.
Today, the XOR trick might have remained useful for swapping general-purpose registers in some microcontrollers, but in the most popular ISAs, like ARM-based or RISC-V, most GPRs are equivalent, so the need to swap them arises very rarely, only in certain kinds of loops, and even there swapping can frequently be avoided by unrolling the loops.