Removing lines from C source code

Can anyone point me to a program that removes lines from C source code? Example

#include <stdio.h>
static const char *place = "world";
char * multiline_str = "one \
two \
three\n";
int main(int argc, char *argv[])
{
        printf("Hello %s\n", place);
        printf("The previous line says \"Hello %s\"\n", place);
        return 0;
}

      

becomes

#include <stdio.h>
static const char *place = ;
char * multiline_str = ;
int main(int argc, char *argv[])
{
        printf(, place);
        printf(, place);
        return 0;
}

      

I am looking for a program very similar to stripcmt only that I want to strip lines, not comments.

The reason I'm looking for a program already developed and not just a handy regex is because when you start looking at all the corner cases (quotes in strings, multi-line strings, etc.) things usually start to get (much) more complicated. than it seems at first glance. And also there are limits to what RE can do, I suspect it is not possible for this task. If you think you have an extremely strong regex, feel free to obey, but please don't be naive sed 's/"[^"]*"//g'

as suggestions.

(No need for special handling (possibly unconfigured) lines in comments, they will be removed first)

Support for multiline strings with embedded newlines is not important (not legal C), but lines spanning multiple lines ending with \ at the end should be supported.

This is pretty much the same as several other questions , but I haven't found any links to any tools.

+2


source to share


4 answers


You can download the source code for StripCmt (.tar.gz - 5kB). This is trivially small and shouldn't be too hard to adapt to interleaved lines instead ( released under the GPL ).

You can also look into the official lexical language rules for C strings. I found this very quickly, but it may not be definitive. It defines the string as:



stringcon ::= "{ch}", where ch denotes any printable ASCII character (as specified by isprint()) other than " (double quotes) and the newline character.

      

+4


source


All tokens in C (and most other programming languages) are "regular". That is, they can be matched with a regular expression.

Regular expression for C strings:

"([^"\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))*"

      

The regex is not too hard to understand. Basically a string literal is a pair of double quotes surrounding a bunch:

  • non-special characters (non-quote / backslash / newline)
  • escapes that start with a backslash and then consist of one of:
    • simple escape character
    • 1 to 3 octal digits
    • x and 1 or more hexadecimal digits

This is based on sections 6.1.4 and 6.1.3.4 of the C89 / C90 spec. If something else has crept into C99, it won't catch it, but it shouldn't be hard to fix.



Here's a python script to filter a C source file removing string literals:

import re, sys
regex = re.compile(r'''"([^"\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))*"''')
for line in sys.stdin:
  print regex.sub('', line.rstrip('\n'))

      

EDIT:

It occurred to me after I posted above that while it is true that all C tokens are regular, not highlighting everything we have the opportunity for trouble. In particular, if a double quote appears, in which there should be another token, we can get on the garden path. You mentioned that the comments are already parsed, so we only need to worry about character literals (although the approach I'm using can be easily extended to handle comments). Here's a more robust script that handles character literals:

import re, sys
str_re = r'''"([^"\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))*"'''
chr_re = r"""'([^'\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))'"""

regex = re.compile('|'.join([str_re, chr_re]))

def repl(m):
  m = m.group(0)
  if m.startswith("'"):
    return m
  else:
    return ''
for line in sys.stdin:
  print regex.sub(repl, line.rstrip('\n'))

      

Basically, we find the string and character literal, and then we keep the char literals only, but strip the string literals. Linear regex char is very similar to string literal.

+5


source


In ruby:

#!/usr/bin/ruby
f=open(ARGV[0],"r")
s=f.read
puts(s.gsub(/"(\\(.|\n)|[^\\"\n])*"/,""))
f.close

      

prints to standard output

0


source


In Python using pyparsing:

from pyparsing import dblQuotedString

source = open(filename).read()
dblQuotedString.setParseAction(lambda : "")
print dblQuotedString.transformString(source)

      

Also prints to stdout.

0


source







All Articles