Reading java file with escape characters for newline

I have a Unicode file that needs to be exported to a database (Vertica). The column separator is CTRL + B, the record separator is the newline character (\ n). Whenever there is a new row in the column, CTRL + A is used as an escape character.

When I use BufferedReader.readLine () to read this file, records with IDs 2 and 4 are read as two records. Whereas I want to read them as a single whole record as indicated in the output.

Here is a sample input file. | stands for CTRL + B and ^ stands for CTRL + A.

Input
ID|Name|Job Desc
----------------
1|xxxx|SO Job
2|YYYY|SO Careers^
Job
3|RRRRR|SO
4|ZZZZ^
 ZZ|SO Job
5|AAAA|YU

Output:
ID|Name|Job Desc
----------------
1|xxxx|SO Job
2|YYYY|SO Careers Job
3|RRRRR|SO
4|ZZZZ ZZ|SO Job
5|AAAA|YU

      

The file is huge, so I cannot use StringEscapeUtils. Any suggestions on this?

+3


source to share


2 answers


You can use Scanner

with custom separator. The multiplier used is divisible by \n

, but not \u0001\n

(where \u0001

represents CTRL+A

):



try {
    PrintWriter writer = new PrintWriter("dboutput.txt");
    Scanner sc = new Scanner(new File("dbinput.txt"));
    sc.useDelimiter(Pattern.compile("^(?!.*(\\u0001\\n)).*\\n$"));
    while (sc.hasNext()) {
        writer.println(sc.next());
    }
    scanner.close();
    writer.close();
} catch (FileNotFoundException e) {
   e.printStackTrace();
} 

      

+2


source


Tim is partially correct in his answer. But it still doesn't allow CTRL + A escaped newlines.

Here is my solution for this (with Tim's answer )



File f = new File("C:\\Users\\SV7104\\Desktop\\sampletest.txt");
Scanner sc = new Scanner(f).useDelimiter(Pattern.compile("\\s*\\u0002\\n\\s*"));
            while (sc.hasNext()) {
                System.out.print(1);
                System.out.println(sc.next().toString().replaceAll("\\u0001\\n", " "));

            }

      

If there is any other efficient method, I am also interested to know about this.

0


source







All Articles