3

I'm new to world of static analysis and am trying to build a new analysis of C programs for llvm compiler.

I've started with the build of the graph of the constraints of the program: The edges represent the flow of data through the program (according to the statements or function calls) and the nodes the run-time memory locations.

I'm wondering if for the labeling of the constraints I do need a symbol table with all the constraints and their labels. I found out that we can build a CFG on top of the LLVM by only parsing the LLVM IR.

for example, we could just have:

//Build list nodes without successors.
for (Function::iterator e = F.end() ; e != BB ; ++BB) {
    BI = BB->begin();
    for(BasicBlock::iterator BE = BB->end(); BI != BE; ++BI){
        Instruction * instruction = dyn_cast<Instruction>(BI);
        StaticAnalysis::ListNode* node = new StaticAnalysis::ListNode(counter++);
        node->inst = instruction;
        helper.insert(pair<Instruction*,StaticAnalysis::ListNode*>(instruction,node));
        CFGNodes.push_back(node);
    }
}

So, my question is: Would this be possible also for a flow graph? Or a symbol table is needed to construct one?

Raphael
  • 73,212
  • 30
  • 182
  • 400

1 Answers1

1

You want to build a graph where the vertices are the set of runtime locations in memory. Unfortunately, this isn't possible. The set of runtime locations can be an unbounded set.

For instance, consider code that creates a linked list to hold $n$ data items provided on the input; the number of memory locations is at least $n$, so if the data items are provided on the input, there is no upper bound on the number of memory locations. Or, consider code like

while (...) {
    char *p = malloc(..);
    ... do something with p ...
    free(p);
}

Each call to malloc() creates a new memory location, so the number of runtime memory locations is at least the number of iterations of the loop. There may be no fixed upper bound on the possible number of iterations of the loop.

Different static analysis techniques deal with this in different ways:

  • Data flow analysis typically deals with this by ignoring everything in the heap, and only trying to reason about local variables. In this way, you obtain a finite graph.

  • Points-to analysis deals with this using abstraction. It builds a points-to graph, where each node in the graph corresponds to a potentially unbounded number of runtime memory locations. At each node in the graph, we record facts that are known to be true for all of the runtime memory locations that correspond to that node. For instance, we might have one node per malloc() call site. In other words, we might have a single node for all runtime memory locations that are allocated by any of the calls to malloc() on line 79 (the first call to malloc() on that line, the second call to malloc() on that line, etc.; they're all merged into a single node).

You might want to spend some time reading up on how data flow analysis and points-to analysis work.

More generally, I suggest you ask a new question where you tell us what you're trying to achieve. Presumably the graph you were talking about was only a means to an end; what was the end goal? Presumably you wanted to build it only because you thought it'd help you with some task; what is it you really want to learn about the program? If you ask a new question telling us what your goal is (what you're trying to infer about the program), we might be able to suggest an appropriate static analysis technique that would be suitable for that goal.

D.W.
  • 167,959
  • 22
  • 232
  • 500