Given a specific computer system, is it possible to estimate the actual precise run time of a piece of Assembly code

Question

this is a piece of Assembly code

section .text
    global _start       ;must be declared for using gcc
_start:                     ;tell linker entry point
    mov edx, len    ;message length
    mov ecx, msg    ;message to write
    mov ebx, 1      ;file descriptor (stdout)
    mov eax, 4      ;system call number (sys_write)
    int 0x80        ;call kernel
    mov eax, 1      ;system call number (sys_exit)
    int 0x80        ;call kernel

section .data

msg db  'Hello, world!',0xa ;our dear string
len equ $ - msg         ;length of our dear string

Given a specific computer system, is it possible to predict precisely the actual run time of a piece of Assembly code.

score 48 · Accepted Answer · answered Aug 31 '19 at 19:06

I can only quote from the manual of a rather primitive CPU, a 68020 processor from around 1986: "Calculating the exact runtime of a sequence of instructions is difficult, even if you have precise knowledge of the processor implementation". Which we don't have. And compared to a modern processor, that CPU was primitive.

I can't predict the runtime of that code, and neither can you. But you can't even define what "runtime" of a piece of code is, when a processor has massive caches, and massive out-of-order capabilities. A typical modern processor can have 200 instructions "in flight", that is in various stages of execution. So the time from trying to read the first instruction byte, to retiring the last instruction, can be quite long. But the actual delay to all other work that the processor needs doing may be (and typically is) a lot less.

Of course doing two calls to the operating system makes this completely unpredictable. You don't know what "writing to stdout" actually does, so you can't predict the time.

And you can't know the clock speed of the computer at the precise moment you run the code. It may be in some power saving mode, the computer may have reduced clock speed because it got hot, so even the same number of clock cycles can take different amounts of time.

All in all: Totally unpredictable.

TheHans255 · Answer 2 · 2019-11-11T19:10:11.527

You cannot do this in general, but in some senses, you very much can, and there have been a few historical cases in which you indeed had to.

The Atari 2600 (or Atari Video Computer System) was one of the earliest home video game systems and was first released in 1978. Unlike later systems of the era, Atari could not afford to give the device a frame buffer, meaning that the CPU had to run code at every scanline to determine what to produce - if this code took over 17.08 microseconds to run (the HBlank interval), the graphics would not be properly set before the scanline began drawing them. Worse, if the programmer wanted to draw more complex content than what the Atari normally allowed, they had to measure exact times for instructions and change the graphics registers as the beam was being drawn, with a span of 57.29 microseconds for the whole scan line.

However, the Atari 2600, like many other systems based on the 6502, had a very important feature that enabled the careful time management required for this scenario: the CPU, the RAM, and the TV signal all ran off clocks based on the same master clock. The TV signal ran off a 3.98 MHz clock, portioning the times above into an integer number of "color clocks" that managed the TV signal, and a cycle of the CPU and RAM clocks was exactly three color clocks, allowing the CPU's clock to be an accurate measure of time relative to the current progress TV signal. (For more information on this, check out the Stella Programmer's Guide, written for the Stella Atari 2600 emulator).

This operating environment, in addition, meant that every CPU instruction had a defined amount of cycles it would take in every case, and many 6502 developers published this information in reference tables. For instance, consider this entry for the CMP (Compare Memory with accumulator) instruction, taken from this table:

CMP  Compare Memory with Accumulator

     A - M                            N Z C I D V
                                    + + + - - -

     addressing    assembler    opc  bytes  cycles
     --------------------------------------------
     immediate     CMP #oper     C9    2     2
     zeropage      CMP oper      C5    2     3
     zeropage,X    CMP oper,X    D5    2     4
     absolute      CMP oper      CD    3     4
     absolute,X    CMP oper,X    DD    3     4*
     absolute,Y    CMP oper,Y    D9    3     4*
     (indirect,X)  CMP (oper,X)  C1    2     6
     (indirect),Y  CMP (oper),Y  D1    2     5*

*  add 1 to cycles if page boundary is crossed

Using all of this information, Atari 2600 (and other 6502 developers) were able to determine exactly how long their code was taking to execute, and construct routines that did what they needed and still complied with the Atari's TV signal timing requirements. And because this timing was so exact (especially for time-wasting instructions like NOP), they were even able to use it to modify the graphics as they were being drawn.

Of course, the Atari's 6502 is a very specific case, and all of this is possible only because the system had all of the following:

A master clock that ran everything, including RAM. Modern systems have independent clocks for the CPU and the RAM, with the RAM clock often being slower and the two not necessarily being in sync.
No caching of any kind - the 6502 always accessed DRAM directly. Modern systems have SRAM caches that make it more difficult to predict the state - while it is perhaps still possible to predict the behavior of a system with a cache, it is definitely more difficult.
No other programs running simultaneously - the program on the cartridge had complete control of the system. Modern systems run multiple programs at once using non-deterministic scheduling algorithms.
A clock speed slow enough that signals could travel across the system in time. On a modern system with clock speeds of 4 GHz (for example), it takes a photon of light 6.67 clock cycles to travel the length of a half-meter motherboard - you could never expect a modern processor to interact with something else on the board in just one cycle, since it takes more than one cycle for a signal on the board to even reach the device.
A well defined clock speed that rarely changes (1.19 MHz in the case of the Atari) - the CPU speeds of modern systems change all the time, while an Atari could not do this without also affecting the TV signal.
Published cycle timings - the x86 does not define how long any of its instructions take.

All of these things came together to create a system where it was possible to craft sets of instructions that took an exact amount of time - and for this application, that's exactly what was demanded. Most systems do not have this degree of precision simply because there is no need for it - calculations either get done when they get done, or if an exact amount of time is needed, an independent clock can be queried. But if the need is right (such as on some embedded systems), it can still appear, and you will be able to accurately determine how long your code takes to run in these environments.

And I should also add the big massive disclaimer that all of this only applies to constructing a set of assembly instructions that will take an exact amount of time. If what you want to do is take some arbitrary piece of assembly, even in these environments, and ask "How long does this take to execute?", you categorically cannot do that - that is the Halting Problem, which has been proven unsolvable.

EDIT 1: In a previous version of this answer, I stated that the Atari 2600 had no way of informing the processor of where it was in the TV signal, which forced it to keep the entire program counted and synchronized from the very beginning. As pointed out to me in the comments, this is true of some systems like the ZX Spectrum, but is not true of the Atari 2600, since it contains a hardware register that halts the CPU until the next horizontal blanking interval occurs, as well as a function to begin the vertical blanking interval at will. Hence, the problem of counting cycles is limited to each scanline, and only becomes exact if the developer wishes to change content as the scanline is being drawn.

score 15 · Answer 3 · answered Aug 31 '19 at 19:17

There are two aspects at play here

As @gnasher729 points out, if we know the exact instructions to execute, it's still difficult to estimate the exact runtime because of things like caching, branch prediction, scaling, etc.

However, the situation is even worse. Given a chunk of assembly, it's impossible to know which instructions will run, or even to know how many instructions will run. This is because of Rice's theorem: if we could determine that precisely, then we could use that information to solve the Halting Problem, which is impossible.

Assembly code can contain jumps and branches, which are enough to make the full trace of a program possibly infinite. There has been work on conservative approximations of execution time, that gives upper bounds on execution, through things like cost semantics or annotated type systems. I'm not familiar with anything for assembly specifically but I wouldn't be surprised if something like that existed.

Davislor · Answer 4 · 2019-09-03T16:24:39.027

Back in the era of 8-bit computers, some games did something like that. Programmers would use the exact amount of time it took to execute instructions, based on the amount of time they took and the known clock speed of the CPU, to synchronize with the exact timings of the video and audio hardware. Back in those days, the display was a cathode-ray-tube monitor that would cycle through each line of screen at a fixed rate and paint that row of pixels by turning on and off the cathode ray to activate or deactivate the phosphors. Because programmers needed to tell the video hardware what to display right before the beam reached that part of the screen, and fit the rest of the code into whatever time was left over, they called that “racing the beam.”

It absolutely would not work on any modern computer, or for code like your example.

Why not? Here are some things that would mess up the simple, predictable timing:

CPU speed and memory fetches are both bottlenecks on execution time. It’s a waste of money to run a CPU faster than it can fetch instructions to execute, or to install memory that can deliver bytes faster than the CPU can accept them. Because of this, old computers ran both off the same clock. Modern CPUs run much faster than main memory. They manage that by having instruction and data caches. The CPU will still stall if it ever needs to wait for bytes that aren’t in the cache. The same instructions will therefore run much faster if they’re already in the cache than if they’re not.

Furthermore, modern CPUs have long pipelines. They keep up their high throughput by having another part of the chip do preliminary work on the next several instructions in the pipeline. This will fail if the CPU doesn’t know what the next instruction will be, which can happen if there’s a branch. Therefore, CPUs try to predict conditional jumps. (You don’t have any in this code snippet, but perhaps there was a mispredicted conditional jump to it that clogged the pipeline. Besides, good excuse to link that legendary answer.) Similarly, the systems that call int 80 to trap into kernel mode actually are using a complicated CPU feature, an interrupt gate, that introduces unpredictable delay.

If your OS uses preemptive multitasking, the thread running this code could lose its timeslice at any moment.

Racing the beam also only worked because the program was running on the bare metal and banged directly on the hardware. Here, you’re calling int 80 to make a system call. That passes control over to the operating system, which gives you no timing guarantee. You then tell it do I/O on an arbitrary stream, which might have been redirected to any device. It’s much too abstract for you to say how much time the I/O takes, but it will surely dominate the time spent executing instructions.

If you want exact timing on a modern system, you need to introduce a delay loop. You have to make the faster iterations run at the speed of the slowest, the reverse not being possible. One reason people do so in the real world is to prevent leaking cryptographic information to an attacker who can time which requests take longer than others.

score 2 · Answer 5 · answered Sep 03 '19 at 14:43

Would the choice of "computer system" happen to include microcontrollers? Some microcontrollers have very predictable execution times, for example the 8 bit PIC series have four clock cycles per instruction unless the instruction branches to a different address, reads from flash or is a special two-word instruction.

Interrupts will obvously disrupt this kind of timimg but it is possible to do a lot without an interrupt handler in a "bare metal" configuration.

Using assembly and a special coding style it is possible to write code that will always take the same time to execute. It isn't so common now that most PIC variants have multiple timers, but it is possible.

score 1 · Answer 6 · answered Sep 03 '19 at 08:56

This is somewhat tangential but the space shuttle had 4 redundant computers that depended on being accurately synced, i.e. their run-time matching exactly.

The very first launch attempt of the space shuttle was scrubbed when the Backup Flight Software (BFS) computer refused to synchronize with the four Primary Avionics Software System (PASS) computers. Details in "The Bug Heard Round the World" here. Fascinating read about how the software was developed to match cycle for cycle and might give you interesting background.

score 0 · Answer 7 · answered Sep 03 '19 at 11:12

I think we're mixing two different issues here. (And yes, I know this has been said by others, but I hope I can express it more clearly.)

First we need to get from the source code to the sequence of instructions that is actually executed (which needs knowledge of the input data as well as the code - how many times do you go round a loop? which branch is taken after a test?). Because of the halting problem, the sequence of instructions may be infinite (non-termination) and you can't always determine that statically, even with knowledge of the input data.

Having established the sequence of instructions to be executed, you then want to determine the execution time. That can certainly be estimated with some knowledge of the system architecture. But the problem is that on many modern machines, the execution time depends heavily on caching of memory fetches, which means it depends as much on the input data as on the instructions executed. It also depends on correct guessing of conditional branch destinations, which again is data dependent. So it's only going to be an estimate, it's not going to be exact.

Given a specific computer system, is it possible to estimate the actual precise run time of a piece of Assembly code

7 Answers7