Read lines in ASCII file using C

I would like to count the number of lines in an ASCII text file. I thought the best way to do this is to count the newlines in the file:

for (int c = fgetc(fp); c != EOF; c = fgetc(fp)) {  /* Count word line endings. */
    if (c == '\n') ++lines;
}

      

However, I'm not sure if this will take the last line into account on all Windows and Linux. That is, if my text file ends up like below, without an explicit newline, is there any encoded file in there, or do I need to add an extra one ++lines;

after the for loop?

cat
dog

      

Then what if there is an explicit newline at the end of the file? Or do I just need to check this case by keeping track of the previously read value?

+3


source to share


7 replies


If there is no newline, it will not be generated. C tells you exactly what.



+3


source


Text files must always end with a line. There is no canonical way to handle files that don't.

Here's how some tools choose to deal with characters after the last line feed:



  • wc

    doesn't consider it a string (so you have good priority for this)
  • Vim marks the file as [noeol]

    and saves the file without trailing line feed
  • GNU sed

    treats the file as if it had the last line feed
  • sh

    read

    exits with an error but still returns data

Since the behavior is largely undefined, you can just do whatever is convenient or useful to you.

+3


source


First, there will be no implicitly encoded newline at the end of the last line. The only way that will mean a new line is that the software or the person who created the file put it there. However, assuming this is generally considered good practice.

The final answer to what you should report as line count depends on the convention you need to follow for the software or people who will be using that line count, and you can probably guess about the behavior of the input source as well.

Most command line tools terminate their output with a newline character. In this case, a reasonable answer might be to report the number of newlines as the number of actual lines.

On the other hand, when a text editor displays the file, you will see that the line numbering in the field (if supported) contains the last line number, empty or not. This kind of explains to the user that there is a blank line there, but if you want to count the number of lines displayed in the field, that's one plus the number of newlines in the file. It is common for some coders not to end their last lines with a newline character (sometimes out of negligence), so that would actually be the correct answer in this case.

I'm not sure if any other conventions make sense. For example, if you don't want to read the last line, if it is not empty, then what is considered non-empty? Does the file end after a new line? What if there are gaps on this line? What if there are multiple blank lines at the end of the file?

+3


source


If you are going to use this method, you can always keep a separate counter for the number of letters on the line you are on. If the counter at the end is greater than 1, then you know that there was no data on the last line.

int letters = 0

for (int c = fgetc(fp); c != EOF; c = fgetc(fp)) {  /* Count word line endings. */
    letters++; // Increase count on character

    if (c == '\n')
    {
        ++words;
        letters = 0; // Set back to 0 after new line
    }
}

if (letters > 0)
{
    ++words;
}

      

+2


source


Your concern is real, the last line in the file may be missing at the end of the end marker. The end-of-line marker is the only '\n'

Linux CR LF pair on Windows that the C runtime automatically converts to '\n'

.

You can simplify your code and handle the special case of the last line skipping a line:

int c, last = '\n', lines = 0;

while ((c = getc(fp)) != EOF) {  /* Count word line endings. */
    if (c == '\n')
        lines += 1;
    last = c;
}
if (last != '\n')
    lines += 1;

      

Since you're concerned about speed, using getc

instead fgetc

will help on platforms where it is defined as a macro that processes stream structures directly and only calls a function to replenish the buffer, every BUFSIZ

character or so if the stream is not buffered.

+2


source


How about this:

Create a flag for yourself to keep track of any characters \n

following \n

, which is reset when c=='\n'

. After EOF

, check if the flag is true and increment it if so.

bool more_chars = false;
for (int c = fgetc(fp); c != EOF; c = fgetc(fp)) {  /* Count word line endings. */
            if (c == '\n') {
              more_chars = false;
              ++words;
            } else more_chars = true;
 }
 if(more_chars) words++;

      

+1


source


Windows and UNIX / Linux style line breaks don't make any difference here. On any system, a text file may or may not have a newline at the end of the last line.

If you always add 1 to the line count, this effectively counts the blank line at the end of the file when there is a newline at the end (i.e. the file "foo\n"

will be considered to have two lines: "foo"

and ""

). This might be a perfectly reasonable solution, depending on how you want to define the string.

Another definition of "line" is that it always ends with a newline character, i.e. the file "foo\nbar"

will only have one line ( "foo"

) by this definition. This definition is used wc

.

Of course, you can keep track of whether the new character was the last character in the file, and only add 1 to the count if it isn't. Then "line" will be defined as ending on a new line or non-empty at the end of the file, which sounds rather difficult to me.

-1


source







All Articles