Parsing a file containing non-printable ASCII characters

Question

Parsing a file containing non-printable ASCII characters

I have a file (probably binary) that contains mostly non-printable ASCII characters, as the output of the octal dump utility shows.

od  -a MyFile.log 
0000000  cr  nl esc   a soh nul esc   * soh   L soh nul nul nul nul nul
0000020 nul soh etx etx etx soh nul nul nul nul nul nul nul nul nul nul
0000040 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
*
0000100 nul nul nul nul nul soh etx etx etx nul nul nul nul nul nul nul
0000120 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
0000140 nul nul nul nul nul nul nul nul soh etx etx etx soh nul nul nul
0000160 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
0000200 nul nul nul nul nul nul nul nul nul nul nul soh etx etx etx etx
0000220 etx soh etx etx etx etx etx etx etx soh etx etx etx etx etx etx
0000240 etx soh etx etx etx etx etx soh soh soh soh soh nul nul nul nul
0000260 nul nul nul nul nul nul nul nul nul nul nul nul nul nul etx etx
0000300 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul

I would like to do the following:

Parsing file or partition on the indentation sections, starting with any of the characters esc

, fs

, gs

and us

(ASCII-numbers 27, 28, 29 and 31).
The output file has human-readable ASCII characters such as an octal dump.
Save the result to a file.

What would be the best way to do this? I would rather use UNIX / Linux shell utilities eg. grep to do this task instead of the C program.

Thank.

Edit . I used the octal dump utility command od -A n -a -v MyFile.log

to remove the offsets from the file as follows:

  cr  nl esc   a soh nul esc   * soh   L soh nul nul nul nul nul
 nul soh etx etx etx soh nul nul nul nul nul nul nul nul nul nul
 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
 nul nul nul nul nul soh etx etx etx nul nul nul nul nul nul nul
 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
 nul nul nul nul nul nul nul nul soh etx etx etx soh nul nul nul
 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
 nul nul nul nul nul nul nul nul nul nul nul soh etx etx etx etx
 etx soh etx etx etx etx etx etx etx soh etx etx etx etx etx etx
 etx soh etx etx etx etx etx soh soh soh soh soh nul nul nul nul
 nul nul nul nul nul nul nul nul nul nul nul nul nul nul etx etx

I would like to go from or maybe link this file with some other utility, eg. AWK.

+3

linux bash shell parsing ascii

Olumide 06 Mar At 13:54

source to share

5 answers

If you have access to awk that supports regular expressions in RS (like gawk), you can do:

awk 'BEGIN {ORS = ""; RS = "\ x1b | \ x1c | \ x1d | \ x1f"; cmd = "od -a"}
    {print | cmd; close (cmd)} 'MyFile.log> output

This will dump all of the output into one file. If you want each "paragraph" to be in a different output file, you can do:

awk 'BEGIN {ORS = ""; RS = "\ x1b | \ x1c | \ x1d | \ x1f"; cmd = "od -a"}
    {print | cmd "> output" NR} 'MyFile.log

to write files output1, output2, etc.

Note that the awk standard states that the behavior is unspecified if RS contains more than one character, but many awk implementations will support regular expressions like this.

+2

William pursell 06 Mar At 14:11

source to share

I think it would be easier to make a flexible program:

/*
 * This file is part of flex.
 * 
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 * 
 * Neither the name of the University nor the names of its contributors
 * may be used to endorse or promote products derived from this software
 * without specific prior written permission.
 * 
 * THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
 * IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
 * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 * PURPOSE.
 */

    /************************************************** 
        start of definitions section

    ***************************************************/

%{
/* A template scanner file to build "scanner.c". */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <getopt.h>
/*#include "parser.h" */

//put your variables here
char FileName[256];
FILE *outfile;
char inputName[256];


// flags for command line options
static int output_flag = 0;
static int help_flag = 0;

%}


%option 8bit 
%option nounput nomain noyywrap 
%option warn

%%
    /************************************************ 
        start of rules section

    *************************************************/


    /* these flex patterns will eat all input */ 
\x1B { fprintf(yyout, "\n\n"); }
\x1C { fprintf(yyout, "\n\n"); }
\x1D { fprintf(yyout, "\n\n"); }
\x1F { fprintf(yyout, "\n\n"); }
[:alnum:] { ECHO; }
.  { }
\n { ECHO; }


%%
    /**************************************************** 
        start of code section


    *****************************************************/

int main(int argc, char **argv);

int main (argc,argv)
int argc;
char **argv;
{
    /****************************************************
        The main method drives the program. It gets the filename from the
        command line, and opens the initial files to write to. Then it calls the lexer.
        After the lexer returns, the main method finishes out the report file,
        closes all of the open files, and prints out to the command line to let the
        user know it is finished.
    ****************************************************/

    int c;

    // the gnu getopt library is used to parse the command line for flags
    // afterwards, the final option is assumed to be the input file

    while (1) {
        static struct option long_options[] = {
            /* These options set a flag. */
            {"help",   no_argument,     &help_flag, 1},
            /* These options don't set a flag. We distinguish them by their indices. */

            {"useStdOut", no_argument,       0, 'o'},
            {0, 0, 0, 0}
        };
           /* getopt_long stores the option index here. */
        int option_index = 0;
        c = getopt_long (argc, argv, "ho",
            long_options, &option_index);

        /* Detect the end of the options. */
        if (c == -1)
            break;

        switch (c) {
            case 0:
               /* If this option set a flag, do nothing else now. */
               if (long_options[option_index].flag != 0)
                 break;
               printf ("option %s", long_options[option_index].name);
               if (optarg)
                 printf (" with arg %s", optarg);
               printf ("\n");
               break;

            case 'h':
                help_flag = 1;
                break;

            case 'o':
               output_flag = 1;
               break;

            case '?':
               /* getopt_long already printed an error message. */
               break;

            default:
               abort ();
            }
    }

    if (help_flag == 1) {
        printf("proper syntax is: cleaner [OPTIONS]... INFILE OUTFILE\n");
        printf("Strips non printable chars from input, adds line breaks on esc fs gs and us\n\n");
        printf("Option list: \n");
        printf("-o                      sets output to stdout\n");
        printf("--help                  print help to screen\n");
        printf("\n");
        printf("If infile is left out, then stdin is used for input.\n");
        printf("If outfile is a filename, then that file is used.\n");
        printf("If there is no outfile, then infile-EDIT is used.\n");
        printf("There cannot be an outfile without an infile.\n");
        return 0;
    }

    //get the filename off the command line and redirect it to input
    //if there is no filename then use stdin


    if (optind < argc) {
        FILE *file;

        file = fopen(argv[optind], "rb");
        if (!file) {
            fprintf(stderr, "Flex could not open %s\n",argv[optind]);
            exit(1);
        }
        yyin = file;
        strcpy(inputName, argv[optind]);
    }
    else {
        printf("no input file set, using stdin. Press ctrl-c to quit");
        yyin = stdin;
        strcpy(inputName, "\b\b\b\b\bagainst stdin");
    }

    //increment current place in argument list
    optind++;


    /********************************************
        if no input name, then output set to stdout
        if no output name then copy input name and add -EDIT.csv
        otherwise use output name

    *********************************************/
    if (optind > argc) {
        yyout = stdout;
    }   
    else if (output_flag == 1) {
        yyout = stdout;
    }
    else if (optind < argc){
        outfile = fopen(argv[optind], "wb");
        if (!outfile) {
                fprintf(stderr, "Flex could not open %s\n",FileName);
                exit(1);
            }
        yyout = outfile;
    }
    else {
        strncpy(FileName, argv[optind-1], strlen(argv[optind-1])-4);
        FileName[strlen(argv[optind-1])-4] = '\0';
        strcat(FileName, "-EDIT");
        outfile = fopen(FileName, "wb");
        if (!outfile) {
                fprintf(stderr, "Flex could not open %s\n",FileName);
                exit(1);
            }
        yyout = outfile;
    }

    yylex();
    if (output_flag == 0) {
        fclose(yyout);
    }
    printf("Flex program finished running file %s\n", inputName);
    return 0;
}

To compile for windows or linux use the linux box with flex

and mingw

. Then run this make file in the same directory as the previous file scanner.l

.

TARGET = cleaner.exe
TESTBUILD = cleaner
LEX = flex
LFLAGS = -Cf
CC = i586-mingw32msvc-gcc
CFLAGS = -O -Wall 
INSTALLDIR = 

.PHONY: default all clean install uninstall cleanall

default: $(TARGET)

all: default install

OBJECTS = $(patsubst %.l, %.c, $(wildcard *.l))

%.c: %.l
    $(LEX) $(LFLAGS) -o $@ $<

.PRECIOUS: $(TARGET) $(OBJECTS)

$(TARGET): $(OBJECTS)
    $(CC) $(OBJECTS) $(CFLAGS) -o $@

linux: $(OBJECTS)
    gcc $(OBJECTS) $(CFLAGS) -o $(TESTBUILD)

cleanall: clean uninstall

clean:
    -rm -f *.c
    -rm -f $(TARGET)
    -rm -f $(TESTBUILD)

uninstall:
    -rm -f $(INSTALLDIR)/$(TARGET)

install:
    cp -f $(TARGET) $(INSTALLDIR)

Once compiled and placed in your path, just use with od -A n -a -v MyFile.log | cleaner

.

+1

Spencer Rathbun 06 Mar 12 at 14:59

source to share

I wrote a simple main.c program

#include <stdio.h>

char *human_ch[]=
{
"NILL",
"EOL"
};
char code_buf[3];

// you can implement whatever you want for coversion to human-readable format
const char *human_readable(int ch_code)
{
    switch(ch_code)
    {
    case 0:
        return human_ch[0];
    case '\n':
        return human_ch[1];
    default:
        sprintf(code_buf,"%02x", (0xFF&ch_code) );
        return code_buf;
    }
}

int main( int argc, char **argv)
{
    int ch=0;
    FILE *ofile;
    if (argc<2)
        return -1;

    ofile=fopen(argv[1],"w+");
    if (!ofile)
        return -1;

    while( EOF!=(ch=fgetc(stdin)))
    {

        fprintf(ofile,"%s",human_readable(ch));
        switch(ch)
        {
            case 27:
            case 28:
            case 29:
            case 31:
                fputc('\n',ofile); //paragraph separator
                break;
            default:
                fputc(' ',ofile); //characters separator
                break;
        }
    }

    fclose(ofile);
    return 0;
}

The program reads stdin byte by byte and uses a function human_readable()

to convert each byte to a user-specified value. In my example, I have implemented the values of jus EOL

and NILL

and in all other ways the program writes the hexadecimal code of the character to the output file
compile: gcc main.c

using the program:./a.out outfile <infile

+1

2r2w 06 Mar 12 at 15:23

source to share

Here's a little Python program that does what you want (at least a bit of splitting):

#!/usr/bin/python

import sys

def main():
    if len(sys.argv) < 3:
        return

    name = sys.argv[1]
    codes = sys.argv[2]

    p = '%s.out.%%.4d' % name
    i = 1

    fIn = open(name, 'r')
    fOut = open(p % i, 'w')

    c = fIn.read(1)
    while c != '':
        fOut.write(c)
        c = fIn.read(1)

        if c != '' and codes.find(c) != -1:
            fOut.close()
            i = i + 1
            fOut = open(p % i, 'w')

    fOut.close()
    fIn.close()

if __name__ == '__main__':
    main()

Using:

python split.py file codes

eg.

On the bash command line:

python split.py input.txt $'\x1B'$'\x1C'

Will issue files input.txt.out.0001

, input.txt.out.0002

... after separation input.txt

in any of the code (in this example 127 and 128).

You can then iterate over those files and convert them to printable format by transferring them od

.

for f in `ls input.txt.out.*`; do od $f > $f.od; done

0

Manish 06 Mar 12 at 16:09

source to share

ninjalj · Accepted Answer · 2012-03-24T23:36:59+0000

od -a -An -v file | perl -0777ne 's/\n//g,print "$_\n " for /(?:esc| fs| gs| us)?(?:(?!esc| fs| gs| us).)*/gs'

od -a -An -v file

→ octal file dump with named characters ( -a

), no addresses ( -An

) and no suppressed duplicate lines ( -v

).
-0777

→ the whole slurp file (line separator - nonexistent character 0777

).
-n

-> use implicit loop to read input (whole 1 line).
for /(?:esc| fs| gs| us)?(?:(?!esc| fs| gs| us).)*/gs

→ for each section ( /g

), which optionally begins in esc

, fs

, gs

or us

, and contains a maximum sequence of characters (including the new line: /s

) without esc

)>, fs

, gs

or us

.
s/\n//g

-> remove lines from -> od

print "$_\n "

print section and newline (and space for formatting od

)

Parsing a file containing non-printable ASCII characters

More articles: