Using java to read files in HDFS and match multi-line blocks with regex

Question

Using java to read files in HDFS and match multi-line blocks with regex

I am working with a log analysis tool.

I am using the log aggregation feature with Hadoop. When I do this, the Hadoop log file is so large that some cloud API methods do not fully read the contents of the files into memory.

I want to match multi-line blocks in files where the first line contains a string [map]

and the last line contains [\map]

- I think I can do this based on a regex. The commonly used BufferedReader

one could not meet my requirements.

My question is, is there any other way to step through the file line by line, checking for the ones that match my regex?

PS I do not want to split the file into several smaller files for processing, as I am worried that this will result in some matching information not being found, since I could split the file in the middle of the corresponding block.

Below is a snippet of the log file - I need a section between [map]

and [/MAP]

:

2015-04-16 20: 30: 09,240 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: dump TS struct
2015-04-16 20: 30: 09,240 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: 

    [MAP] Id = 4
      [Children]
        [TS] Id = 2
          [Children]
            [RS] Id = 3
              [Parent] Id = 2 null [\ Parent]
            [\ RS]
         [\ Children]
         [Parent> Id = 4 null [\ Parent]
       [\ TS]
      [\ Children]
    [\ MAP]

2015-04-16 20: 30: 09,241 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: Initializing Self 4 MAP
2015-04-16 20: 30: 09,242 INFO [main] org.apache.hadoop.hive.ql.exec.TableScanOperator: Initializing Self 2 TS
2015-04-16 20: 30: 09,242 INFO [main] org.apache.hadoop.hive.ql.exec.TableScanOperator: Operator 2 TS initialized

+3

java regex hadoop

sol Apr 28 15 at 7:41

source to share

2 answers

Instead of a buffered reader, you can use the java NIO package which is very fast compared to the buffered reader

0

Shaik mujahid ali Apr 28 15 at 10:23

source to share

J Richard Snape · Accepted Answer · 2015-04-28T10:57:45+0000

NB EDITED after clarification in comments

You might be able to find multi-line blocks using Regex - you can of course write a Regex that will match them for example. .*\[MAP\]((?s).*)\[\\MAP\]

- noting that in Java you will also have to escape all characters \

, while (?s)

letting the character .

match newlines, i.e.

String mapBlockRegex = ".*\\[MAP\\]((?s).*)\\[\\\\MAP\\]";`

However - as you pointed out - this creates difficulties if the file does not fit in memory, and splitting also has some difficulties.

I suggest another idea - scan the file line by line and use a state variable to indicate if you are in a block or not. The main algorithm is as follows

When you match the start of a block, set the state variable to true.
While the state is true, add text to StringBuilder
When you agree with block ending, set the state variable to false and use the String

one you created for example. output it to a file, to the console, or use it programmatically.

Java solution

I propose one way to implement the above - using Scanner

- that traverses the stream line by line, discarding how it happens, while avoiding OutOfMemoryError

. Please note that this code can throw exceptions - I threw them, but you can put them in a block try..catch..finally

. Also note that Scanner

swallows the IO exception, but as the docs say , if it's important to you:

The most recent IOException thrown by the underlying readable can be obtained with the ioException () method.

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.util.Scanner;


public class LogScanner
{

    public static void main(String[] args) throws FileNotFoundException
    {
        FileInputStream inputStream = null;
        Scanner sc = null;

        String path = "D:\\hadoopTest.log";
        String blockStart= ".*\\[MAP\\].*";
        String blockEnd = ".*\\[\\\\MAP\\].*";
        boolean inBlock = false;
        StringBuilder block = null;

        inputStream = new FileInputStream(path);
        sc = new Scanner(inputStream, "UTF-8");
        while (sc.hasNextLine()) {
            String line = sc.nextLine();
            if (line.matches(blockStart)) {
                inBlock = true;
                block = new StringBuilder();
            }

            if (inBlock) {
                block.append(line);
                block.append("\n");
            }

            if (line.matches(blockEnd)) {
                inBlock = false;
                String completeBlock = block.toString();
                System.out.println(completeBlock);
                // I'm outputting the blockto stdout, you could append to a file\whatever.
            }
        }

        sc.close();
    }
}

Caveat There may be characteristics in your file where this will not work without any changes. If you can have nested blocks [map]

, then there inBlock

must be an int where you increment, if you agree with the start and decrement of the block, if you match the end - add for any, inblock > 0

and only output the full string when inBlock

it vanishes.

Splitting command line when searching for matches on the same line

If you've searched on a string and the matches were guaranteed to be on the same line, then the split will be OK as long as the split occurs only at the end of complete lines.

In this case, you can use the command line to split the file. If you are on Linux (or any * nix I think) you can use the split command, eg.

split --lines=75000

This discusses the question and answer in more detail

There is no equivalent command on Windows that I know of, but you can install things that will work similarly - eg. GNU CoreUtils for Windows or 7-Zip. Caveat: I've never used them to separate.

Using java to read files in HDFS and match multi-line blocks with regex

Java solution

Splitting command line when searching for matches on the same line

More articles: