How to predict if a function name matches

Question

How to predict if a function name matches

Let's say you have a repository of 10,000 function names and possibly their frequency of use in a body of code that might be in C / C # / C ++. (they usually have different conventions)

Some examples might be:

DoPaint
OnPaint
CloseWindow
DeleteGraphOnClose
FreeConnection
ConnectInternat (smallTypo, but part of code)
FreeSoH

Now, given the name of the function, how can we predict if the name should follow the human name convention ?

Note:

Obviously, all candidate names will be valid names
generated names can have arbitrary characters and will be considered bad
Cases of letters can be garbled.

Some candidates:

Z090292 - not likely
onDelete - likely
CloseWindow - likely
iGetIndex - unlikely

Any pointers on technology and software are welcome

+2

data-mining text-mining

RD Aug 29. 09 at 21:36

source to share

6 answers

AAA · Answer 1 · 2009-08-29T21:57:59+0000

You can try to do some kind of Bayesian analysis on the text:

Load the list of names (and their frequencies) into your program. For now, this may mean naming names. So, for example, CloseWindow becomes Close and Window, with both increasing in frequency. At this point, it would be helpful to load some non-human function names to teach the program to the viceroys as well.
Take the name of the function and using the data you just collected, find the probability of each part appearing

P ((HumanGenerated | seeing token) = P (seeing token | human generated) * P (Humangenerated)) / P (seeing token)

In this case, the probability that something is generated by a human or a computer will be determined based on known knowledge, that is, what percentage of function names are considered human.

The probability of seeing the token (P (See the Token)) should gradually evolve. This will consist of how many times the token appears in human functions and how many times it manifests in computer functions ... this decision is based on the premise that the program learns over time (and therefore needs to be trained)

The result, P ((HumanGenerated | Seeing the Token), will give you the likelihood that the function name will be human generated.

NB: This is just a rough outline, many details are missing. If you are interested in this line of research, which I would suggest reading on probability theory and in particular Bayesian analysis

Martin v. Löwis · Answer 2 · 2009-08-29T21:43:52+0000

Split the identifiers into individual words (based on capitalization) and put the words in a spell checker (like ispell ). Treat all misspelled words as non-human, as well as the identifiers in which they occur.

Peder skou · Answer 3 · 2009-08-29T21:48:43+0000

A friend of mine can help. As far as I can tell, he is doing his Ph.D. thesis on this topic.

Homepage

+1

Peder skou Aug 29. '09 at 21:48

source to share

TrueWill · Answer 4 · 2009-08-29T21:45:18+0000

Prediction, if it is human-generated, is a very difficult question. Analyzing your codebase to find function names is easier - you can look at tools like NDepend .

Jeff · Answer 5 · 2009-08-29T21:46:21+0000

You can probably find gum. Alternatively, you can do regular expression searches for typical words like do, get, set, in, etc. Before the next headword.

Imagist · Answer 6 · 2009-08-29T22:24:37+0000

In addition to using the vocabulary suggested by Martin W. Lowes, this is a good option, but you should also keep in mind the following general forms of variables:

Single letter variable names.
Variable names that use underscores instead of the camel case.
Metasynthetic variables
Hungarian notation.
Keywords / types with an attached symbol (i.e. $return

or list_

).

How to predict if a function name matches

More articles: