1

I just started learning ARM assembly. I am currently on a 32-bit Raspian with "GNU assembler version 2.35.2 (arm-linux-gnueabihf)".

This is my simple program to load part of ascii into a register :

.global _start
_start:
    ldr r1,=helloworld
    ldr r2,[r1]

    @prepare to exit
    mov r0,#0
    mov r7,#1
    svc 0

.data
helloworld:
    .ascii "HelloWorld"

I loaded it into gdb and can see that my register r2 loads 0x6c6c6548 (in ascii "lleH"). A quick objdump shows :

Contents of section .data:
 0000 48656c6c 6f576f72 6c64               HelloWorld

I have below questions :

  1. How does the string look like in memory? In other words, when the endianness come into picture? Will reversal happen while loading into memory? Or the string will be loaded as is into memory but gets reversed while loading into register?
  2. Why the content of register r2 for below program with .word is 0x12345678 instead of 0x78563412 ? Why there is no endianess followed?

Note : .word used instead of .ascii

.global _start
_start:
    ldr r1,=helloworld
    ldr r2,[r1]
    mov r0,#0
    mov r7,#1
    svc 0

.data
helloworld:
    .word 0x12345678

EDIT

The memory dump for first program shows that even the memory has string in same order as in the source code and the object file :

>>> x/32xb 0x1008c
0x1008c:    0x48    0x65    0x6c    0x6c    0x6f    0x57    0x6f    0x72
0x10094:    0x6c    0x64    0x41    0x11    0x00    0x00    0x00    0x61

This indicates that the ldr instruction is converting that memory read into little endian format where LSB holds the first byte in memory. Is the understanding correct? But this still does not answer why this did not happen for a .word.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Naveen
  • 7,944
  • 12
  • 78
  • 165
  • The bytes are never reversed. Reading the value of a register as if the most significant byte is "the first byte" makes it look as though the bytes are reversed, but that's just an effect of how you're reading them. – harold Dec 25 '21 at 13:47

2 Answers2

2

Endianess or byte order is the order in which the bytes comprising a number are represented in memory.

A string is an array of bytes. Each byte of this string is subject to endianess, but for a single byte, little and big endian come out to the same thing.

For your second question: endianess only affects data while being stored in memory. The assembler gives you a human readable representation of the computer program. The token 0x12345678 represents a certain number. When transferred to memory, this token will be written to memory in the appropriate byte order. The assembler takes care of this.

You will also see the register content as 0x12345678 when watching the execution of your program in a debugger. This is because registers are not part of memory and are not divided into bytes. Each register holds a 32 bit number. The CPU transfers data between registers and memory in the configured byte order (see the SETEND instruction) And without the register being divided into bytes, there is no meaningful way to assign a byte order to it. The debugger can only show you its numeric value. And this just comes out to be the value you assigned to it in your program. Crazy how this works, eh?

fuz
  • 88,405
  • 25
  • 200
  • 352
  • Thanks. I know that registers are not byte addressible (but can be byte accessible). I did a `hexdump` of my object file for first program and can see `48 65 6C 6C 6F 57 6F 72 6C 64` (for `HelloWorld`) which is in same order as the string. But for object file of second program, I see `78 56 34 12`. So it appears that assembler already reverses the `.word`. Now the content of object files are loaded as-is into memory. When I use instruction like `ldr`, the cpu reads data from memory into registers in bytes, and hence reads first byte into lsb, then next byte and so on till msb to make 4 bytes. – Naveen Dec 25 '21 at 11:40
  • This way, we indirectly load the number `0x12345678` in correct order as in the source file. Is this understanding of mine correct? – Naveen Dec 25 '21 at 11:40
  • @Naveen Yes, this understanding is correct. But I'm not sure what you mean by “registers can be byte accessible.” In the case of the ARM instruction set, I don't think they are. – fuz Dec 25 '21 at 11:57
  • Oh yes. I mean that for general cases e.g. x86 allows that (`ah`,`al`). – Naveen Dec 25 '21 at 11:58
  • 2
    @Naveen On x86 this is correct, but it still doesn't really indicate a byte order of registers. You just have a specific name for a specific part of the register, but because there are no instructions to index the register file, there isn't really a byte order. – fuz Dec 25 '21 at 12:01
  • 1
    One might argue that a byte order exists e.g. with SIMD registers where table lookup instructions like `TBL` exist, but the byte order is really isolated to these instructions and some architectures (like PowerPC) have table lookup instruction with both little and big endian byte order. – fuz Dec 25 '21 at 12:02
  • 2
    @Naveen: re: register endianness: it's more clear-cut for vectors, as fuz said: they're indexable at runtime, and also bit-shift within SIMD elements can shift bits across boundaries which connects them together into larger integers. See [How does endianness work with SIMD registers?](https://stackoverflow.com/a/62676466). You can bit-shift RAX and then access AL and AH, so it's clear that AL is the least-significant byte, but it doesn't have an address; the only way the order is exposed any other way is via the endianness of a word, dword or qword store of AX, EAX, or RAX. – Peter Cordes Dec 25 '21 at 13:36
1

.ascii is a string of bytes .word is a list of 32 bit items not 8 bit items, they are incomparable. You wanted .byte perhaps?

.ascii "Hello"
.align
.word 0x12345678
.byte 0x12,0x34,0x56,0x78

assemble and disassemble

00000000 <.text>:
   0:   6c6c6548    cfstr64vs   mvdx6, [ip], #-288  ; 0xfffffee0
   4:   0000006f    andeq   r0, r0, pc, rrx
   8:   12345678    eorsne  r5, r4, #120, 12    ; 0x7800000
   c:   78563412    ldmdavc r6, {r1, r4, sl, ip, sp}^

link, copy to binary and dump

00000000  48 65 6c 6c 6f 00 00 00  78 56 34 12 12 34 56 78 |Hello...xV4..4Vx|
00000010

No surprises here everything is as expected so far. The ascii string is a string of bytes, we see those in order as we declared them. The word is a word, this is a little endian target, 0x12345678, 0x78 is the least significant byte so it goes first at the lowest address. To compare against .ascii apples to apples we need a string of bytes, so 0x12 was declared first just like 'H' was declared first so we see it first in memory.

ldr r0,label0
ldr r1,label1

.ascii "Hello"
.align
label0:
.word 0x12345678
label1:
.byte 0x12,0x34,0x56,0x78

assemble and disassemble

00000000 <label0-0x10>:
   0:   e59f0008    ldr r0, [pc, #8]    ; 10 <label0>
   4:   e59f1008    ldr r1, [pc, #8]    ; 14 <label1>
   8:   6c6c6548    cfstr64vs   mvdx6, [ip], #-288  ; 0xfffffee0
   c:   0000006f    andeq   r0, r0, pc, rrx

00000010 <label0>:
  10:   12345678    eorsne  r5, r4, #120, 12    ; 0x7800000

00000014 <label1>:
  14:   78563412    ldmdavc r6, {r1, r4, sl, ip, sp}^

Again no surprise. The DISASSEMBLER has tried to turn these bytes into instructions and has shown them as words, so we see the 0x12345678 and 0x78563412 respectively and those are the values that would land in r0 and r1

Link and copy to binary and hexdump -C

00000000  08 00 9f e5 08 10 9f e5  48 65 6c 6c 6f 00 00 00  |........Hello...|
00000010  78 56 34 12 12 34 56 78                           |xV4..4Vx|
0

And we did not change anything so the output does not change with respect to the data items.

old_timer
  • 69,149
  • 8
  • 89
  • 168