How to predict if a function name matches

Let's say you have a repository of 10,000 function names and possibly their frequency of use in a body of code that might be in C / C # / C ++. (they usually have different conventions)

Some examples might be:

DoPaint
OnPaint
CloseWindow
DeleteGraphOnClose
FreeConnection
ConnectInternat (smallTypo, but part of code)
FreeSoH 

      

Now, given the name of the function, how can we predict if the name should follow the human name convention ?

Note:

  • Obviously, all candidate names will be valid names
  • generated names can have arbitrary characters and will be considered bad
  • Cases of letters can be garbled.

Some candidates:

Z090292 - not likely
onDelete - likely
CloseWindow - likely
iGetIndex - unlikely

      

Any pointers on technology and software are welcome

+2


source to share


6 answers


You can try to do some kind of Bayesian analysis on the text:

  • Load the list of names (and their frequencies) into your program. For now, this may mean naming names. So, for example, CloseWindow becomes Close and Window, with both increasing in frequency. At this point, it would be helpful to load some non-human function names to teach the program to the viceroys as well.
  • Take the name of the function and using the data you just collected, find the probability of each part appearing

    P ((HumanGenerated | seeing token) = P (seeing token | human generated) * P (Humangenerated)) / P (seeing token)

In this case, the probability that something is generated by a human or a computer will be determined based on known knowledge, that is, what percentage of function names are considered human.



The probability of seeing the token (P (See the Token)) should gradually evolve. This will consist of how many times the token appears in human functions and how many times it manifests in computer functions ... this decision is based on the premise that the program learns over time (and therefore needs to be trained)

The result, P ((HumanGenerated | Seeing the Token), will give you the likelihood that the function name will be human generated.

NB: This is just a rough outline, many details are missing. If you are interested in this line of research, which I would suggest reading on probability theory and in particular Bayesian analysis

+2


source


Split the identifiers into individual words (based on capitalization) and put the words in a spell checker (like ispell ). Treat all misspelled words as non-human, as well as the identifiers in which they occur.



+1


source


A friend of mine can help. As far as I can tell, he is doing his Ph.D. thesis on this topic.

Homepage

+1


source


Prediction, if it is human-generated, is a very difficult question. Analyzing your codebase to find function names is easier - you can look at tools like NDepend .

0


source


You can probably find gum. Alternatively, you can do regular expression searches for typical words like do, get, set, in, etc. Before the next headword.

0


source


In addition to using the vocabulary suggested by Martin W. Lowes, this is a good option, but you should also keep in mind the following general forms of variables:

  • Single letter variable names.
  • Variable names that use underscores instead of the camel case.
  • Metasynthetic variables
  • Hungarian notation.
  • Keywords / types with an attached symbol (i.e. $return

    or list_

    ).
0


source







All Articles