1

I need to put a 32-bit absolute address into a register on AArch64. (e.g. an MMIO address, not PC-relative).

On ARM32 it was possible to use lower16 & upper16 to load an address into a register

movw    r0, #:lower16:my_addr
movt    r0, #:upper16:my_addr

Is there a way to do similar thing on AArch64 by using movk?

If the code is relocated, I still want the same absolute address, so adr is not suitable.

ldr from a nearby literal pool would work, but I'd rather avoid that.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
user3124812
  • 1,861
  • 3
  • 18
  • 39
  • 1
    Relative memory reading via LDR and ADR is relocatable code. On the other hand your ARM32 example code isn't relocatable. . Also note that `:lower16:` and `:upper16:` wouldn't be sufficient for 64-bit ARM code because addresses are 64-bit. – Ross Ridge Mar 12 '19 at 01:44
  • Noup, ldr & adr are not relocateable in my case since memory region they are referencing could not be copied into a new location. – user3124812 Mar 12 '19 at 01:48
  • LDR and ADR are PC relative so work even if the program is relocated. – Ross Ridge Mar 12 '19 at 01:51
  • 1
    alright, mate. I need to load an absolute address without usage of `LDR` & `ADR` instructions. – user3124812 Mar 12 '19 at 03:00
  • Is your absolute address in the low 32 bits of address space? – Peter Cordes Mar 12 '19 at 04:33
  • @PeterCordes, yes, it fits into 32b. – user3124812 Mar 12 '19 at 04:45
  • @PeterCordes You're making assumptions about the poster's intent. – Ross Ridge Mar 12 '19 at 04:46
  • @RossRidge: yes, I am. It seemed most useful to turn this into an answerable question. Hopefully they'll let us know if any of the assumptions were wrong. I think my guess is the best fit for the clues we have in comments. (But IDK why they think an `LDR` wouldn't work; a literal pool should get relocated unmodified along with the code. So hopefully we'll get an edit to the question from the OP to clarify that.) – Peter Cordes Mar 12 '19 at 04:49
  • There is a 'trampoline' to jump into some address. Trampoline could be copied and executed from different locations, so it could only use an absolute destination address. Literal pool would work if copied together with respected executable code, that's right. There is some limitation on this my side though. I better to avoid an extra caring about a pool and copying it. That's why `movw/movt`-like approach is so appealing. – user3124812 Mar 12 '19 at 06:06
  • What prevents you from using a sequence of `LDR`, `B`, `32-bit address`? Wouldn't you be able to place it anywhere (perhaps, adjusting for alignment) and execute without any problem? – Alexey Frunze Mar 12 '19 at 07:43
  • Just to be clear, `LDR` is not PC relative, it is **register** relative. That register might be the PC (for values in the literal pool), it might be the SP (for local variables in a function), or it could be some other general purpose register. – Elliot Alderson Mar 12 '19 at 11:29
  • 1
    @ElliotAlderson: pretty sure we're talking about whatever instructions the assembler chooses to use for a `ldr w0, =0x12345678` pseudo-instruction. Which could be `mov`/`movk`. – Peter Cordes Mar 12 '19 at 11:37
  • @PeterCordes Then the comments should be that the LDR **pseudo-instruction** is PC relative. The actual LDR **instruction** is register relative. For newcomers, the distinction matters. – Elliot Alderson Mar 12 '19 at 11:42
  • @ElliotAlderson: but the value loaded isn't PC-relative, only the place it's loaded from. And like I said, an assembler could choose to expand it to `mov`/`movk` instead of a load at all. Or for simple constants, just a `mov`. – Peter Cordes Mar 12 '19 at 11:47
  • @ElliotAlderson: Yes, I agree it's an important point that `LDR reg, =const` *can* be expanded to a PC-relative load; the whole concept of nearby literal pools is very important. I was trying to focus on what was relevant for this question. (If the constant is known at assemble-time, not just link time, then `mov` + `movk` are easy to use and probably best.) – Peter Cordes Mar 12 '19 at 12:49
  • @ElliotAlderson Your comments aren't helpful. My first two comments were directed at the original poster and made in the context of the question asked with the goal of trying to clarify the question. They weren't meant to try to educate people learning assembly language and should not have been interpreted that way. – Ross Ridge Mar 12 '19 at 17:29
  • @RossRidge If you only wanted to help the OP then perhaps you should have contacted them directly, As it is, this site is designed to help many people, with a range of experience levels, beyond the OP so it is important that the content is not misleading. Your statement that LDR is PC relative is, at best misleading and at worst simply incorrect. I believe my comments will be helpful to some readers who come along later, and they took nothing away from helping the OP. – Elliot Alderson Mar 12 '19 at 18:03
  • @ElliotAlderson Sorry, but you misunderstand the purpose of comments. They are in fact a means of contacting the author of post directly, and are not meant to educate users, only to address issues with the post. https://meta.stackexchange.com/questions/19756/how-do-comments-work – Ross Ridge Mar 12 '19 at 18:12
  • @RossRidge I think I understand the purpose of comments quite well. From your link: "Comments exist so that users can **talk about** questions and answers without posting new answers that do not make an attempt to answer the question asked." If you had put the incorrect statements in an answer I would have commented on your answer, but since you put them in a comment this was my only method for correcting your comment. Are you relying on technicalities now, rather than calling my comments "not helpful"? – Elliot Alderson Mar 12 '19 at 19:07
  • @ElliotAlderson Your comments are useless pedantry. They weren't meant to inform me of anything I didn't already know. You justified your comments as being for the benefit of people learning assembly language, you can't now argue that they were for my benefit. You're the one trying to rely on technicalities. – Ross Ridge Mar 12 '19 at 19:20
  • @RossRidge Sorry, I was trying to explain why I commented in the way I did. I wasn't trying to inform you or to write for your benefit, just to suggest that you make your writing more clear for new users. You can call it pedantics; I call it clarifying a point of possible confusion. I still believe that my comments added helpful information to the discussion. I wasn't intending to hurt your feelings, and I apologize if I did. – Elliot Alderson Mar 12 '19 at 19:30
  • OMG, never expect such a hot discussion over a question. Thanks everyone, answer is found and I believe that could be useful not only for me. :) – user3124812 Mar 13 '19 at 01:14

2 Answers2

5

If your address is an assemble-time constant, not link-time, this is super easy. It's just an integer, and you can split it up manually.

I asked gcc and clang to compile unsigned abs_addr() { return 0x12345678; } (Godbolt)

// gcc8.2 -O3
abs_addr():
    mov     w0, 0x5678               // low half
    movk    w0, 0x1234, lsl 16       // high half
    ret

(Writing w0 implicitly zero-extends into 64-bit x0, same as x86-64).


Or if your constant is only a link-time constant and you need to generate relocations in the .o for the linker to fill in, the GAS manual documents what you can do, in the AArch64 machine-specific section:

Relocations for ‘MOVZ’ and ‘MOVK’ instructions can be generated by prefixing the label with #:abs_g2: etc. For example to load the 48-bit absolute address of foo into x0:

    movz x0, #:abs_g2:foo     // bits 32-47, overflow check
    movk x0, #:abs_g1_nc:foo  // bits 16-31, no overflow check
    movk x0, #:abs_g0_nc:foo  // bits  0-15, no overflow check

The GAS manual's example is sub-optimal; going low to high is more efficient on at least some AArch64 CPUs (see below). For a 32-bit constant, follow the same pattern that gcc used for a numeric literal.

 movz x0, #:abs_g0_nc:foo           // bits  0-15, no overflow check
 movk x0, #:abs_g1:foo              // bits 16-31, overflow check

#:abs_g1:foo will is known to have its possibly-set bits in the 16-31 range, so the assembler knows to use a lsl 16 when encoding movk. You should not use an explicit lsl 16 here.

I chose x0 instead of w0 because that's what gcc does for unsigned long long. Probably performance is identical on all CPUs, and code size is identical.

.text
func:
   // efficient
     movz x0, #:abs_g0_nc:foo           // bits  0-15, no overflow check
     movk x0, #:abs_g1:foo              // bits 16-31, overflow check

   // inefficient but does assemble + link
   //  movz x1, #:abs_g1:foo              // bits 16-31, overflow check
   //  movk x1, #:abs_g0_nc:foo           // bits  0-15, no overflow check

.data
foo: .word 123       // .data will be in a different page than .text

With GCC: aarch64-linux-gnu-gcc -nostdlib aarch-reloc.s to build and link (just to prove we can, this will just crash if you actually ran it), and then aarch64-linux-gnu-objdump -drwC a.out:

a.out:     file format elf64-littleaarch64


Disassembly of section .text:

000000000040010c <func>:
  40010c:       d2802280        mov     x0, #0x114                      // #276
  400110:       f2a00820        movk    x0, #0x41, lsl #16

Clang appears to have a bug here, making it unusable: it only assembles #:abs_g1_nc:foo (no check for the high half) and #:abs_g0:foo (overflow check for the low half). This is backwards, and results in a linker error (g0 overflow) when foo has a 32-bit address. I'm using clang version 7.0.1 on x86-64 Arch Linux.

$ clang -target aarch64 -c aarch-reloc.s
aarch-reloc.s:5:15: error: immediate must be an integer in range [0, 65535].
     movz x0, #:abs_g0_nc:foo
              ^

As a workaround g1_nc instead of g1 is fine, you can live without overflow checks. But you need g0_nc, unless you have a linker where checking can be disabled. (Or maybe some clang installs come with a linker that's bug-compatible with the relocations clang emits?) I was testing with GNU ld (GNU Binutils) 2.31.1 and GNU gold (GNU Binutils 2.31.1) 1.16

$ aarch64-linux-gnu-ld.bfd aarch-reloc.o 
aarch64-linux-gnu-ld.bfd: warning: cannot find entry symbol _start; defaulting to 00000000004000b0
aarch64-linux-gnu-ld.bfd: aarch-reloc.o: in function `func':
(.text+0x0): relocation truncated to fit: R_AARCH64_MOVW_UABS_G0 against `.data'

$ aarch64-linux-gnu-ld.gold aarch-reloc.o 
aarch-reloc.o(.text+0x0): error: relocation overflow in R_AARCH64_MOVW_UABS_G0

MOVZ vs. MOVK vs. MOVN

movz = move-zero puts a 16-bit immediate into a register with a left-shift of 0, 16, 32 or 48 (and clears the rest of the bits). You always want to start a sequence like this with a movz, and then movk the rest of the bits. (movk = move-keep. Move 16-bit immediate into register, keeping other bits unchanged.)

mov is sort of a pseudo-instruction that can pick movz, but I just tested with GNU binutils and clang, and you need an explicit movz (not mov) with an immediate like #:abs_g0:foo. Apparently the assembler won't infer that it needs movz there, unlike with a numeric literal.

For a narrow immediate, e.g. 0xFF000 which has non-zero bits in two aligned 16-bit chunks of the value, mov w0, #0x18000 would pick the bitmask-immediate form of mov, which is actually an alias for ORR-immediate with the zero register. AArch64 bitmask-immediates use a powerful encoding scheme for repeated patterns of bit-ranges. (So e.g. and x0, x1, 0x5555555555555555 (keep only the even bits) can be encoded in a single 32-bit-wide instruction, great for bit-hacks.)

There's also movn (move not) which flips the bits. This is useful for negative values, allowing you to have all the upper bits set to 1. There's even a relocation for it, according to AArch64 relocation prefixes.


Performance: movz low16; movk high16 in that order

The Cortex A57 optimization manual

4.14 Fast literal generation

Cortex-A57 r1p0 and later revisions support optimized literal generation for 32- and 64-bit code

    MOV wX, #bottom_16_bits
    MOVK wX, #top_16_bits, lsl #16

[and other examples]

... If any of these sequences appear sequentially and in the described order in program code, the two instructions can be executed at lower latency and higher bandwidth than if they do not appear sequentially in the program code, enabling 32-bit literals to be generated in a single cycle and 64-bit literals to be generated in two cycles.

The sequences include movz low16 + movk high16 into x or w registers, in that order. (And also back-to-back movk to set the high 32, again in low, high order.) According to the manual, both instructions have to use w, or both have to use x registers.

Without special support, the movk would have to wait for the movz result to be ready as an input for an ALU operation to replace that 16-bit chunk. Presumably at some point in the pipeline, the 2 instructions merge into a single 32-bit immediate movz or movk, removing the dependency chain.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • `abs_g*` is what I was looking for. Thanks! – user3124812 Mar 13 '19 at 01:04
  • Just to note, instead of `movz x0, #:abs_g2:foo`, I used `mov x0, #0` + `movk x0, #:abs_g2_nc:foo`. For some reason linker could not make first version and return "unrecognized reloc 267". – user3124812 Mar 13 '19 at 01:11
  • @user3124812: I don't think you need to zero the register first and then merge with `movk`. `mov x0, #:abs_g2_nc:foo` should be able to put a 16-bit immediate left-shifted to any position in a register (and zero the rest of the bits). BTW, isn't `g2` for bits `32-47`? That would make your address 4GB aligned, and 48 bits, not 32. – Peter Cordes Mar 13 '19 at 01:55
  • `mov x0, #:abs_g2_nc:foo` is not compiled with a Clang, GCC, on the other hand, handles that properly. I checked that on godbolt.org as well. In fact address fits 32b but it's not a big deal to have one extra 'explicit' instruction after all and do not worry about that in future – user3124812 Mar 13 '19 at 02:52
  • @user3124812: I was just experimenting with that, it seems you need an explicit `movz`, not just `mov`. Working on an update. – Peter Cordes Mar 13 '19 at 02:53
  • I searched about that `abs_g` and found that `movz` is going together with `abs_g3`. After all: `movz x0,#:abs_g3:foo` + `movk x0,#:abs_g2_nc:foo` + etc with `g1` & `g0`. Credits for https://stackoverflow.com/questions/38570495/aarch64-relocation-prefixes – user3124812 Mar 13 '19 at 02:56
  • @user3124812: yeah, you're right, clang appears to be broken. You can use `g1` or `g1_nc`, and just do without error checking for the high half. But you *need* `g0_nc` for the low half (unless your address is 64k-aligned), and I wasn't able to get clang to assemble that. `g0` is unusable because it will always error at link time. – Peter Cordes Mar 13 '19 at 05:08
1

Assuming that Peter Cordes' edits to your post reflect your actual intent, you can use the MOVL psuedo-instruction to load an absolute address into a register without using the LDR instruction. For example:

    MOVL x0, my_addr

The MOVL instruction has the advantage of working both with externally defined symbols and locally defined constants. The pseudo-instruction will expand to two or four instructions, depending on whether the destination is a 32-bit or 64-bit register, usually a MOV instruction followed by one or three MOVK instructions

However it's not obvious why the LDR instruction, specifically the LDR pseudo-instruction, wouldn't also work. This normally results in a PC relative load from a literal pool that the assembler will place in the same section (area) as your code.

For example:

    LDR x0, =my_addr

would be assembled to something like:

    LDR x0, literal_pool   ; LDR (PC-relative literal)
    ; ...
literal_pool:
    .quad my_addr

Since literal_pool is part of the same code section as the PC-relative LDR instruction that references it, the offset between the instruction and the symbol never changes, making the code relocatable. You can place your trampoline code in its own section and/or use the LTORG directive to ensure that the literal pool gets placed in a close and easily predictable location.

Ross Ridge
  • 38,414
  • 7
  • 81
  • 112