11

I have a branching in c++:

if (x & 1)
{
    x = function_1(x);
}
else
{
    x = function_2(x);
}

If function_1 and function_2 are constant time and it takes the same time to compute them, is such branching still vulnerable for side-channel attacks? Can attacker somehow know which condition was executed?

j.p.
  • 1,657
  • 20
  • 17
Tom
  • 1,251
  • 8
  • 17

3 Answers3

14

Yes, if/else is vulnerable to timing attack. So is selecting the function to call by an array index, as in that other answer.

The only next-to-reliable solution for constant-time execution in a high-level language, lacking indication about the target environment, is: Make sure the code execution path and the memory access patterns are independent of the data.

This precludes calling a single of function_1 or function_2 (as the question's code and said other answer do), and indexing arrays, based on data that changes under control or in a manner observable by an adversary, even partly and indirectly.

If the extra cost of calling both functions is tolerable, and the behavior of the functions is well-defined and acceptable regardless of the low-order bit of their input, then this will do with most current C compilers and hardware:

   m = -(x&1);  // mask variable; assumes  m  and  x  are int
   x = (function_1(x) & m) | (function_2(x) & ~m);

or equivalently

   x = function_1(x) & -(x&1) | function_2(x) & ~ -(x&1);

This calls both functions, and selects the result appropriately, using that mask -(x&1) has all bits at 1 for odd x, at 0 otherwise.

I do not know any CPU + C compiler introduced since 1985 where there would be a timing dependency on x introduced by that technique. Modern literature on constant time implementation of crypto assume there's none, without saying. That's despite the fact execution time of modern high-performance CPUs is unspecified, and can change unpredictably from the perspective of an application programmer (e.g. due to compilers/linkers and options thereof, code and data alignment, interrupts, memory wait states and refresh cycles, caches, pipelines, speculative execution, branch predictors, other threads on the same core/CPU/linked CPUs, arbitration of resources between threads on the same core, virtualization, BIOS or VMs settings, and the seemingly endless flow of flawed microcode changes compromising performance to mitigate the latest reported side-channel attack…)

Importantly, that unrealistic hypothesis in the question

function_1 and function_2 are constant time and it takes the same time to compute them

can be replaced by the plausible and falsifiable:

Variations in the execution time of function_1 are uncorrelated to the value of x. function_2 has the same property.

Caveat 1: the C/C++ languages give insurance only about the results obtained, none on the execution time. I know no compiler manual that cares to, and some CPUs without documentation that do.

Caveat 2: -(x&1) to get a mask according to low-order bit of x is a well-known idiom, and nowadays works on almost all CPUs. That's because they use two's complement convention, with -1 represented by all ones. But I lost track about if C promises to hide whatever different behavior the hardware might have (I leave that to the useful comments). And that needs to be ascertained no matter what.

Caveat 3: Adaptation may be necessary according to the type of x due to C promotion rules. I usually cast 1 to the type of x, e.g. if x is an uint32_t, I use -(x&(uint32_t)1), and would live happy with it if there was not compiler warnings.

Some compilers complain when taking the negative of an unsigned expression. They are wrong, per the C standard. The easy route is to bow and use (0-(x&1)), with casting of the constants as above. Another is to disable that complaint with some pragma or option. Yet another, hopeless in my experience, is to send a problem report. Arguments carefully made go to NUL (the local equivalent of /dev/null) including: the construct is useful, common, explicitly supported by ISO C; and making a warning about it could get all warnings indiscriminately silenced.

Note: I assumed the return type of function_1 and function_2 is the same type as x.


Performing security / cryptography on standard CPUs where adversaries can also run code has been consistently brittle since the 1980's. There are useful advances: process isolation with Memory Management Units, security rings or enclaves. There are software heuristics to reduce vulnerability to acceptable levels. But, so far, no silver bullet.

So when the stakes are high, it's often best to offload security functions to trusted hardware (Smart Cards, security CPUs, HSMs, TPMs…) designed for security first, not performance first; and where adversaries can't run code (the simplest thus best), or that's designed in.

fgrieu
  • 149,326
  • 13
  • 324
  • 622
4

This started as an edit to fgrieu's answer, but it was very long, so I decided to post it as a separate answer.

I agree with everything fgrieu wrote in his answer

I will quote fgrieu's answer here to give more emphasis :

The only next-to-reliable solution for constant-time execution in a high-level language, lacking indication about the target environment, is: Make sure the code execution path and the memory access patterns are independent of the data.

If we don't consider memory access patterns, since it is not in the scope of the question. The quoted can only theoretically be achieved with constant memory access time and constant time execution of every instruction. Of course these come with the cost of extra clock cycles and extra memory accesses.

Memory access paterns refer to another type of side channel attack that require Oblivious Random Access Memory (ORAM) to be secure against side channel attacks.


If function_1 and function_2 are constant time and it takes the same time to compute them, is such branching still vulnerable for side-channel attacks?

Most commonly yes, if/else is vulnerable to timing attack, because when you use an if/else statement you can easily expect from the compiler or the runtime environment to execute a branch. In any case you generally try to avoid it.

Selecting the function to call by an array index, as in mentallurg's answer in Python for example isn't that bad of an approach, it has constant memory access time and the code that is executed by Python to make the function call is contant time.


Before I begin.

  • In the case of memory accesses, note that since we are interested in timing attacks, we are only interested in memory access time and not patterns that may leak information because this would make our situation a bit more complicated.
  • I only consider caches at the end, for now we don't consider caches, as everyone knows CPU caches can be the worst nightmare of a cryptographer in such situations.
  • I assume that each instruction that doesn't include a memory access it takes the same clock cycles to execute.

Analysis of mentallurg's answer :

Let's take this simple Python code for example

def f1(x):
    return x + 1

def f2(x): return x + 2

def main(): functions = [f1, f2] x = int(input("Enter value:")) i = x % 2 result = functionsi print(result)

if name=='main': main()

This gets compiled to the following Python bytecode (output of pycdas for .pyc compiled with Python 3.10) :

        [Code]
            File Name: test.py
            Object Name: main
            Arg Count: 0
            Pos Only Arg Count: 0
            KW Only Arg Count: 0
            Locals: 4
            Stack Size: 3
            Flags: 0x00000043 (CO_OPTIMIZED | CO_NEWLOCALS | CO_NOFREE)
            [Names]
                'f1'
                'f2'
                'int'
                'input'
                'print'
            [Var Names]
                'functions'
                'x'
                'i'
                'result'
            [Free Vars]
            [Cell Vars]
            [Constants]
                None
                'Enter value:'
                2
            [Disassembly]
                0       LOAD_GLOBAL             0: f1
                2       LOAD_GLOBAL             1: f2
                4       BUILD_LIST              2
                6       STORE_FAST              0: functions
                8       LOAD_GLOBAL             2: int
                10      LOAD_GLOBAL             3: input
                12      LOAD_CONST              1: 'Enter value:'
                14      CALL_FUNCTION           1
                16      CALL_FUNCTION           1
                18      STORE_FAST              1: x
                20      LOAD_FAST               1: x
                22      LOAD_CONST              2: 2
                24      BINARY_MODULO           
                26      STORE_FAST              2: i
                28      LOAD_FAST               0: functions
                30      LOAD_FAST               2: i
                32      BINARY_SUBSCR           
                34      LOAD_FAST               1: x
                36      CALL_FUNCTION           1
                38      STORE_FAST              3: result
                40      LOAD_GLOBAL             4: print
                42      LOAD_FAST               3: result
                44      CALL_FUNCTION           1
                46      POP_TOP                 
                48      LOAD_CONST              0: None
                50      RETURN_VALUE

The lookup and the execution of the line result = functions[i](x) is done from lines bytes 26-36. Let's take a look at the BINARY_SUBSCR operator. It takes two arguments, a list (it can be a dict, or tuple but let's don't focus on this) as the first and an index from the stack (the ones that were previously loaded with LOAD_FAST), returns the value at that index and decreases the stack by 1.

Now, let's take a look how BINARY_SUBSCR is implemented in CPython. The implementation can be found here and is the following:

        TARGET(BINARY_SUBSCR) {
            PREDICTED(BINARY_SUBSCR);
            PyObject *sub = POP();
            PyObject *container = TOP();
            PyObject *res = PyObject_GetItem(container, sub);
            Py_DECREF(container);
            Py_DECREF(sub);
            SET_TOP(res);
            if (res == NULL)
                goto error;
            JUMPBY(INLINE_CACHE_ENTRIES_BINARY_SUBSCR);
            DISPATCH();
        }

Now the whole analysis can be focused on PyObject *res = PyObject_GetItem(container, sub);. This is a generic method and until the item is retrieved various other method are called in the intermediate. Of course we can expect $O(1)$ complexity. It remains to check it. At the end PyList_GetItem is called which is the following :

PyObject *
PyList_GetItem(PyObject *op, Py_ssize_t i)
{
    if (!PyList_Check(op)) {
        PyErr_BadInternalCall();
        return NULL;
    }
    if (!valid_index(i, Py_SIZE(op))) {
        _Py_DECLARE_STR(list_err, "list index out of range");
        PyErr_SetObject(PyExc_IndexError, &_Py_STR(list_err));
        return NULL;
    }
    return ((PyListObject *)op) -> ob_item[i];
}

As we can see on the last line. It has $O(1)$ complexity. Of course due to the complexity of high level languages they are never used in practice for such applications. So let's try this code in a lower level language, like C, to see what it produces.

#include <stdio.h>

int f1(int x) { return x + 1; }

int f2(int x) { return x + 2; }

int main() { int x; scanf("%d", &x); int (*(functions[2]))(int) = {f1, f2}; int i; i = x %2; int result = functionsi; }

This is the x86_64 from Godbolt with the latest GCC without optimizations :

f1:
        addiu   $sp,$sp,-8
        sw      $fp,4($sp)
        move    $fp,$sp
        sw      $4,8($fp)
        lw      $2,8($fp)
        nop
        addiu   $2,$2,1
        move    $sp,$fp
        lw      $fp,4($sp)
        addiu   $sp,$sp,8
        jr      $31
        nop

f2: addiu $sp,$sp,-8 sw $fp,4($sp) move $fp,$sp sw $4,8($fp) lw $2,8($fp) nop addiu $2,$2,2 move $sp,$fp lw $fp,4($sp) addiu $sp,$sp,8 jr $31 nop

$LC0: .ascii "%d\000" main: addiu $sp,$sp,-56 sw $31,52($sp) sw $fp,48($sp) move $fp,$sp addiu $2,$fp,32 move $5,$2 lui $2,%hi($LC0) addiu $4,$2,%lo($LC0) jal __isoc99_scanf nop

    lui     <span class="math-container">$2,%hi(f1)
addiu   $</span>2,<span class="math-container">$2,%lo(f1)
sw      $</span>2,36(<span class="math-container">$fp)
lui     $</span>2,%hi(f2)
    addiu   <span class="math-container">$2,$</span>2,%lo(f2)
    sw      <span class="math-container">$2,40($</span>fp)
    lw      <span class="math-container">$3,32($</span>fp)
    li      <span class="math-container">$2,-2147483648                  # 0xffffffff80000000
ori     $</span>2,<span class="math-container">$2,0x1
and     $</span>2,<span class="math-container">$3,$</span>2
    bgez    <span class="math-container">$2,$</span>L6
    nop

    addiu   <span class="math-container">$2,$</span>2,-1
    li      <span class="math-container">$3,-2                 # 0xfffffffffffffffe
or      $</span>2,<span class="math-container">$2,$</span>3
    addiu   <span class="math-container">$2,$</span>2,1

$L6: sw $2,24($fp) lw $2,24($fp) nop sll $2,$2,2 addiu $3,$fp,24 addu $2,$3,$2 lw $2,12($2) lw $3,32($fp) nop move $4,$3 move $25,$2 jalr $25 nop

    sw      <span class="math-container">$2,28($</span>fp)
    move    <span class="math-container">$2,$</span>0
    move    <span class="math-container">$sp,$</span>fp
    lw      <span class="math-container">$31,52($</span>sp)
    lw      <span class="math-container">$fp,48($</span>sp)
    addiu   <span class="math-container">$sp,$</span>sp,56
    jr      $31
    nop

More specifically we are interested in these instructions in which the calling is made to the appropriate function :

        lw      $2,24($fp)
        nop
        sll     $2,$2,2
        addiu   $3,$fp,24
        addu    $2,$3,$2
    lw      $2,12($2)
    lw      $3,32($fp)
    nop
    move    $4,$3
    move    $25,$2
    jalr    $25
        nop
    sw      <span class="math-container">$2,28($</span>fp)
    move    <span class="math-container">$2,$</span>0

As we can see no branches are made, except from jalr which call the appropriate function.


Analysis of fgrieu's answer

Of course it is easy to see that this is constant time :

#include <stdio.h>

int f1(int x) { return x + 1; }

int f2(int x) { return x + 2; }

int main() { int x; scanf("%d", &x); int (*(functions[2]))(int) = {f1, f2}; int m = -(x&1); // mask variable m must have the same type as x x = (function_1(x) & m) | (function_2(x) & ~m); printf(x); }

Again, Goldbolt with the same options :

f1:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], edi
        mov     eax, DWORD PTR [rbp-4]
        add     eax, 1
        pop     rbp
        ret
f2:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], edi
        mov     eax, DWORD PTR [rbp-4]
        add     eax, 2
        pop     rbp
        ret
.LC0:
        .string "%d"
main:
        push    rbp
        mov     rbp, rsp
        push    rbx
        sub     rsp, 40
        lea     rax, [rbp-24]
        mov     rsi, rax
        mov     edi, OFFSET FLAT:.LC0
        mov     eax, 0
        call    __isoc99_scanf
        mov     QWORD PTR [rbp-48], OFFSET FLAT:f1
        mov     QWORD PTR [rbp-40], OFFSET FLAT:f2
        mov     eax, DWORD PTR [rbp-24]
        and     eax, 1
        neg     eax
        mov     DWORD PTR [rbp-20], eax
        mov     eax, DWORD PTR [rbp-24]
        mov     edi, eax
        mov     eax, 0
        call    function_1
        and     eax, DWORD PTR [rbp-20]
        mov     ebx, eax
        mov     eax, DWORD PTR [rbp-24]
        mov     edi, eax
        mov     eax, 0
        call    function_2
        mov     edx, DWORD PTR [rbp-20]
        not     edx
        and     eax, edx
        or      eax, ebx
        mov     DWORD PTR [rbp-24], eax
        mov     eax, DWORD PTR [rbp-24]
        cdqe
        mov     rdi, rax
        mov     eax, 0
        call    printf
        mov     eax, 0
        mov     rbx, QWORD PTR [rbp-8]
        leave
        ret

We focus on this part where the assignment to m is done and the inline branching :

        mov     eax, DWORD PTR [rbp-24]
        and     eax, 1
        neg     eax
        mov     DWORD PTR [rbp-20], ea
        mov     eax, DWORD PTR [rbp-24]
        mov     edi, eax
        mov     eax, 0
        call    function_1
        and     eax, DWORD PTR [rbp-20]
        mov     ebx, eax
        mov     eax, DWORD PTR [rbp-24]
        mov     edi, eax
        mov     eax, 0
        call    function_2
        mov     edx, DWORD PTR [rbp-20]
        not     edx
        and     eax, edx
        or      eax, ebx
        mov     DWORD PTR [rbp-24], eax

We can actually see constant time execution.

So what is the difference between these solutions :

Both satisfy anti-timing analysis measures but the second one also hides the access patterns. It always calls both functions. Since we haven't mentioned caches so far, in the presence of caches the second one seems more secure against timing side channel attacks. Simply because it needs the code of both functions to be in cache in order to execute the instruction. In the second case only the one that is being called is cached. If we assume that for some specific period of time f1 was called and f2 was evicted from cache, there would be a difference in time when the f2 will be called again.

For other solutions and further reading on the matter you can consider [1] and [2].

fgrieu
  • 149,326
  • 13
  • 324
  • 622
tur11ng
  • 992
  • 5
  • 22
-3

You can put functions to an array an refer them by index.

In Python it would look like following:

def f1( x ):
    ...

def f2( x ): ...

functions = [ f1, f2 ]

i = x % 2

result = functions i

mentallurg
  • 2,661
  • 1
  • 17
  • 24