How to identify similar expressions for automatic function extraction?

How can I define structurally common AST subtrees in order to decompose them into separate functions?

eg. with this pseudocode in mind (assume the language only allows pure, terminating functions):

f(a, b, c) {
    return (a + b) * c * 6;
}

g(x[4], k) {
    var y[4];
    for (i in 0..3)
        y[i] = f(x[i], 1, k);
    return y;
}

varying arr[4];
result = g(arr, 1);

      

... after full specialization and insertion, we get the following tree of primitive operations representing the value of the program result:

(make-vec4
    (* (arr 0) 6)
    (* (+ (arr 1) 1) 6)
    (* (+ (arr 2) 1) 6)
    (* (+ (arr 3) 1) 6) )

      

(this is a terrible example, as the extended result is still very similar in structure to the input ... assuming changes can propagate to the structural boundaries of the input code though)

It is obvious to the human eye that the resulting tree contains three similar expressions, which we can now convert to calls to a function of the type fn(i) { return 6 * (arr[i] + 1); }

, since the size of the cache is mumble mumble, etc. (or it is more realistic to use, for example, a map

or fold

). But how does the compiler identify them as similar in order to be considered candidates for extraction?

Eliminating identical subexpressions should be easy using something like hashing. But you can't use that for this problem, because the hashes created by moving upward from the leaves are not related in any way to each other. Is there a way to "build" from the root nodes and determine the divergence point between the two expression trees to see where the branch becomes an argument? (without using any knowledge of the shape of the original program, which was supposedly expanded without any recognition and might not have been optimally split anyway)

It looks like there must be some way to do this, ordering subtrees and comparing neighbors, but this would require some sort of position independent ordering ...?

+3


source to share


1 answer


What you want to do is called "clone detection". What you specifically want to do is detect clones above abstract syntax trees.

This technical article (to me) is (the last time I looked) the most referenced document on how to do this: Clone detection using abstract syntax trees .



There is a commercial tool based on this approach called CloneDR.

+3


source







All Articles