Regular expressions for parsing formatted numbers

I am parsing documents containing large amounts of formatted numbers, for example:

 Frc consts  --     1.4362                 1.4362                 5.4100
 IR Inten    --     0.0000                 0.0000                 0.0000
 Atom AN      X      Y      Z        X      Y      Z        X      Y      Z
    1   6     0.00   0.00   0.00     0.00   0.00   0.00     0.00   0.00   0.00
    2   1     0.40  -0.20   0.23    -0.30  -0.18   0.36     0.06   0.42   0.26

      

These are separated lines with significant leading space, and there may or may not be significant white space in the space). They consist of 72,72, 78, 78 and 78 characters. I can deduce borders between fields. They are described (using the fortran format (nx = nspaces, an = n alphanum, in = integer in n columns, fm.n = float of m characters with n places after the decimal point):

 (1x,a14,1x,f10.4,13x,f10.4,13x,f10.4)
 (1x,a14,1x,f10.4,13x,f10.4,13x,f10.4)
 (1x,a4,a4,3(2x,3a7))
 (1x,2i4,3(2x,3f7.2))
 (1x,2i4,3(2x,3f7.2))

      

I have potentially several thousand different formats (which I can autogenerate or farm), and I describe them using regular expressions describing components. This way, if regf10_4 is a regex for any string that satisfies the f10.4 constraint, I can create a regex of the form:

COMMENTS 
      (\s
      .{14}
      \s
      regf10_4,
      \s{13}
      regf10_4,
      \s{13}
      regf10_4,
)

      

I would like to know if there are regular expressions that satisfy reuse in this way. There are a wide variety of ways that computers and humans create numbers that are compatible, say f10.4. I believe the following are legitimate fortran input and / or output (I don't require suffixes of the form f or d as in 12.4f) [formatting in SO should be read as not first space for first, one for second, etc.]

-1234.5678
 1234.5678
            // missing number
 12345678.
 1.
 1.0000000
    1.0000
        1.
 0.
        0.
     .1234
    -.1234
    1E2
    1.E2
    1.E02
  -1.0E-02
**********  // number over/underflow

      

They must be robust to the contents of adjacent fields (for example, only to check for exactly 10 characters at the exact position. So the following (a1, f5.2, a1):

a-1.23b   // -1.23
- 1.23.   // 1.23
3 1.23-   // 1.23

      

I am using Java, so Java 1.6 compatible regex constructs are needed (e.g. no perl extensions)

+2


source to share


3 answers


As I understand it, each line contains one or more fixed-width fields that can contain labels, spaces, or different types of data. If you know the width and types of fields, extraction of data - it's just a question substring()

, trim()

and (optional) Whatever.parseWhatever()

. Regexes cannot make this job easier - in fact, all they can do is make it much more difficult.



The scanner doesn't really help. True, it has predefined regexes for different value types, and it does the conversions for you, but it still needs to be told which type to look for each time, and it needs fields to be separated by a delimiter that it can recognize. Fixed-width data, by definition, does not require delimiters. You could fake the delimiters by following the view that many characters should be left in the string, but this is just another way to make the job harder than it should be.

It looks like performance will be a major issue; even if you can make the regex work it will probably be too slow. Not because regexes are inherently slow, but because of the corruption you will have to go through in order to fit this problem. I suggest you forget about regular expressions for this to work.

+2


source


You can start with this and go from there.

This regex matches all of your numbers.
Unfortunately it also matches 3 in 3 1.23 -



// [-+]?(?:[0-9]+(?:\.[0-9]*)?|\.[0-9]+)(?:[eE][-+]?[0-9]+)?
// 
// Match a single character present in the list "-+" «[-+]?»
//    Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
// Match the regular expression below «(?:[0-9]+(?:\.[0-9]*)?|\.[0-9]+)»
//    Match either the regular expression below (attempting the next alternative only if this one fails) «[0-9]+(?:\.[0-9]*)?»
//       Match a single character in the range between "0" and "9" «[0-9]+»
//          Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
//       Match the regular expression below «(?:\.[0-9]*)?»
//          Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
//          Match the character "." literally «\.»
//          Match a single character in the range between "0" and "9" «[0-9]*»
//             Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//    Or match regular expression number 2 below (the entire group fails if this one fails to match) «\.[0-9]+»
//       Match the character "." literally «\.»
//       Match a single character in the range between "0" and "9" «[0-9]+»
//          Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
// Match the regular expression below «(?:[eE][-+]?[0-9]+)?»
//    Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
//    Match a single character present in the list "eE" «[eE]»
//    Match a single character present in the list "-+" «[-+]?»
//       Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
//    Match a single character in the range between "0" and "9" «[0-9]+»
//       Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Pattern regex = Pattern.compile("[-+]?(?:[0-9]+(?:\\.[0-9]*)?|\\.[0-9]+)(?:[eE][-+]?[0-9]+)?");
Matcher matcher = regex.matcher(document);
while (matcher.find()) {
    // matched text: matcher.group()
    // match start: matcher.start()
    // match end: matcher.end()
} 

      

+1


source


This is only a partial answer, but I was warned about the Scanner in Java 1.5 that can scan text and interpret numbers, giving BNF for numbers that can be scanned and interpreted by this Java utility. Basically, I guess BNF can be used to build a regex.

0


source







All Articles