2

I'm looking for a generic static analysis framework that could be used to detect problems that aren't necessarily specific to a particular programming language; for example:

  • Variable taint checking / unvalidated user input
  • Detecting when a variable is unused
  • Detecting when an identifier shadows another identifier in the same scope
  • Detecting unsafe casting / direct assignment of variables of one type to another type

One way I imagine this might work would be for language-specific parsers to emit generalized information about identifiers and how they're used. Here are programs in three different languages that might have similar representations:

Intermediate Representation

This is not supposed to be an exhaustive list of properties / annotations that might be supported, and not all properties or their values would be supported for every language.

file: {
    scope: "global"
    path: "[...]"

    declarations: [
        {
            type: "variable"
            identifier: "SCRIPT_NAME"
            annotations: [
                "constant",
            ]
        },

        {
            type: "variable"
            identifier: "ARGV"
            annotations: [
                "untrusted-input",
            ]
        },

        {
            type: "function"
            identifier: "never_called"
        },

        {
            type: "function"
            identifier: "multiply"
            calls: [
                {
                    filename: "[...]"
                    line: 19
                }
            ]
        },

        {
            type: "function"
            identifier: "main"
        }
    ]
}

Python

#!/usr/bin/env python3
import os
import sys

ARGV = sys.argv
SCRIPT_NAME = os.path.basename(ARGV[0])

def never_called():
    pass

def multiply(a, b):
    return a * b

def main():
    result = multiply(int(sys.argv[1]), int(sys.argv[2]))
    print(multiply(result, int(sys.argv[3])))

if __name__ == "__main__":
    main()

C

#include <libgen.h>
#include <stdio.h>
#include <stdlib.h>

char *SCRIPT_NAME;

void never_called()
{
}

int multiply(int a, int b)
{
    return a * b;
}

int main(int argc, char **argv)
{
    int result;

    SCRIPT_NAME = basename(argv[0]);

    result = multiply(atoi(argv[1]), atoi(argv[2]));
    printf("%d\n", multiply(result, atoi(argv[3])));

    return 0;
}

BASH

#!/usr/bin/env bash
declare SCRIPT_NAME="$0"
declare -a ARGV=("$0" "$@")

function never_called()
{
    true
}

function multiply()
{
    local -i a="$1"
    local -i b="$2"

    echo "$((a * b))"
}

function main()
{
    local -i result

    result="$(multiply "${ARGV[1]}" "${ARGV[2]}")"
    multiply "$result" "${ARGV[3]}"
}

main

Notes / More Ideas

  • In the Python script, the analyzer could see that "int" is called on untrusted input, and it could record that the variables have effectively been sanitized by the "int" function. I imagine taint tracking would always maintain historical information to make it easy to assess why variable is considered untainted.
  • Similarly, the analyzer might have annotations for the type of validation that has been performed. For example, there could be annotations to indicate that a value has been confirmed to be an int and separate annotations that indicate that the range of the result has been validated.

So far I haven't been able to find anything like this. Perhaps something like this could be done with LLVM as a long as the language in question had an LLVM implementation. I don't think it would even necessarily need to be a completely working or perfectly correct implementation.

Eric Pruitt
  • 121
  • 2

0 Answers0