salsa20/chacha20/blake/blake2/blake3 all utilize a 4x4 grid of words on which transformations occur row-wise and then column/diagonal-wise.
State size varies between 512 and 1024 bits based on word size (32 bit or 64 bit).
Using these alternating operations on rows and columns is by definition inefficient in respect to diffusion/confusion as there are simply more bits to mix.
Why not just take ARX row operation and base the entire cipher on that which would result in 128 bit or 256 bit permutation? A key schedule is required to not halve the security but that doesn't seem like a problem.