Is there a language-independent, static analysis framework?

Question

I'm looking for a generic static analysis framework that could be used to detect problems that aren't necessarily specific to a particular programming language; for example:

Variable taint checking / unvalidated user input
Detecting when a variable is unused
Detecting when an identifier shadows another identifier in the same scope
Detecting unsafe casting / direct assignment of variables of one type to another type

One way I imagine this might work would be for language-specific parsers to emit generalized information about identifiers and how they're used. Here are programs in three different languages that might have similar representations:

Intermediate Representation

This is not supposed to be an exhaustive list of properties / annotations that might be supported, and not all properties or their values would be supported for every language.

file: {
    scope: "global"
    path: "[...]"

    declarations: [
        {
            type: "variable"
            identifier: "SCRIPT_NAME"
            annotations: [
                "constant",
            ]
        },

        {
            type: "variable"
            identifier: "ARGV"
            annotations: [
                "untrusted-input",
            ]
        },

        {
            type: "function"
            identifier: "never_called"
        },

        {
            type: "function"
            identifier: "multiply"
            calls: [
                {
                    filename: "[...]"
                    line: 19
                }
            ]
        },

        {
            type: "function"
            identifier: "main"
        }
    ]
}

Python

#!/usr/bin/env python3
import os
import sys

ARGV = sys.argv
SCRIPT_NAME = os.path.basename(ARGV[0])

def never_called():
    pass

def multiply(a, b):
    return a * b

def main():
    result = multiply(int(sys.argv[1]), int(sys.argv[2]))
    print(multiply(result, int(sys.argv[3])))

if __name__ == "__main__":
    main()

C

#include <libgen.h>
#include <stdio.h>
#include <stdlib.h>

char *SCRIPT_NAME;

void never_called()
{
}

int multiply(int a, int b)
{
    return a * b;
}

int main(int argc, char **argv)
{
    int result;

    SCRIPT_NAME = basename(argv[0]);

    result = multiply(atoi(argv[1]), atoi(argv[2]));
    printf("%d\n", multiply(result, atoi(argv[3])));

    return 0;
}

BASH

#!/usr/bin/env bash
declare SCRIPT_NAME="$0"
declare -a ARGV=("$0" "$@")

function never_called()
{
    true
}

function multiply()
{
    local -i a="$1"
    local -i b="$2"

    echo "$((a * b))"
}

function main()
{
    local -i result

    result="$(multiply "${ARGV[1]}" "${ARGV[2]}")"
    multiply "$result" "${ARGV[3]}"
}

main

Notes / More Ideas

In the Python script, the analyzer could see that "int" is called on untrusted input, and it could record that the variables have effectively been sanitized by the "int" function. I imagine taint tracking would always maintain historical information to make it easy to assess why variable is considered untainted.
Similarly, the analyzer might have annotations for the type of validation that has been performed. For example, there could be annotations to indicate that a value has been confirmed to be an int and separate annotations that indicate that the range of the result has been validated.

So far I haven't been able to find anything like this. Perhaps something like this could be done with LLVM as a long as the language in question had an LLVM implementation. I don't think it would even necessarily need to be a completely working or perfectly correct implementation.

Is there a language-independent, static analysis framework?

Intermediate Representation

Python

C

BASH

Notes / More Ideas

0 Answers0