I'm looking for a generic static analysis framework that could be used to detect problems that aren't necessarily specific to a particular programming language; for example:
- Variable taint checking / unvalidated user input
- Detecting when a variable is unused
- Detecting when an identifier shadows another identifier in the same scope
- Detecting unsafe casting / direct assignment of variables of one type to another type
One way I imagine this might work would be for language-specific parsers to emit generalized information about identifiers and how they're used. Here are programs in three different languages that might have similar representations:
Intermediate Representation
This is not supposed to be an exhaustive list of properties / annotations that might be supported, and not all properties or their values would be supported for every language.
file: {
scope: "global"
path: "[...]"
declarations: [
{
type: "variable"
identifier: "SCRIPT_NAME"
annotations: [
"constant",
]
},
{
type: "variable"
identifier: "ARGV"
annotations: [
"untrusted-input",
]
},
{
type: "function"
identifier: "never_called"
},
{
type: "function"
identifier: "multiply"
calls: [
{
filename: "[...]"
line: 19
}
]
},
{
type: "function"
identifier: "main"
}
]
}
Python
#!/usr/bin/env python3
import os
import sys
ARGV = sys.argv
SCRIPT_NAME = os.path.basename(ARGV[0])
def never_called():
pass
def multiply(a, b):
return a * b
def main():
result = multiply(int(sys.argv[1]), int(sys.argv[2]))
print(multiply(result, int(sys.argv[3])))
if __name__ == "__main__":
main()
C
#include <libgen.h>
#include <stdio.h>
#include <stdlib.h>
char *SCRIPT_NAME;
void never_called()
{
}
int multiply(int a, int b)
{
return a * b;
}
int main(int argc, char **argv)
{
int result;
SCRIPT_NAME = basename(argv[0]);
result = multiply(atoi(argv[1]), atoi(argv[2]));
printf("%d\n", multiply(result, atoi(argv[3])));
return 0;
}
BASH
#!/usr/bin/env bash
declare SCRIPT_NAME="$0"
declare -a ARGV=("$0" "$@")
function never_called()
{
true
}
function multiply()
{
local -i a="$1"
local -i b="$2"
echo "$((a * b))"
}
function main()
{
local -i result
result="$(multiply "${ARGV[1]}" "${ARGV[2]}")"
multiply "$result" "${ARGV[3]}"
}
main
Notes / More Ideas
- In the Python script, the analyzer could see that "int" is called on untrusted input, and it could record that the variables have effectively been sanitized by the "int" function. I imagine taint tracking would always maintain historical information to make it easy to assess why variable is considered untainted.
- Similarly, the analyzer might have annotations for the type of validation that has been performed. For example, there could be annotations to indicate that a value has been confirmed to be an int and separate annotations that indicate that the range of the result has been validated.
So far I haven't been able to find anything like this. Perhaps something like this could be done with LLVM as a long as the language in question had an LLVM implementation. I don't think it would even necessarily need to be a completely working or perfectly correct implementation.