Consider that I have a data-table in .rodata section … Now in my function, I want to use that data-table, 3-4 times ... I have 2 options:
option 1 (less code-size):
mov rax, MY_DATA_TABLE
vpbroadcastb zmm2, BYTE [rax+64]
vpbroadcastb zmm3, BYTE [rax+128]
vpbroadcastb zmm4, BYTE [rax+192]
option 2 (more code-size but I think better performance (less latency I think)):
vpbroadcastb zmm2, BYTE [MY_DATA_TABLE+64]
vpbroadcastb zmm3, BYTE [MY_DATA_TABLE+128]
vpbroadcastb zmm4, BYTE [MY_DATA_TABLE+192]
Which one is better at all? Please attention that I don't talk about just vpbroadcastb !! I'm talking about everything ...
My data-table holds 256 x 16-byte (for 0 to 255 (packed byte)) ... it's not 256 x 64-byte ...