Find jogging caps

I have a file containing some fully capitalized words and some words in a mixed case, and I want to extract the fully capitalized runs of words (contained in one line), i.e. things separated \b

and containing at least two uppercase and lowercase letters. All 7-bit.

So, for example, if the line

The QUICK Brown fox JUMPs OV3R T4E LAZY DoG.

      

then i want to extract QUICK

and OV3R T4E LAZY

.

This is what I have so far:

while (<$fh>) { # file handle
    my @array = $_ =~ /\b[^a-z]*[A-Z][^a-z]*[A-Z][^a-z]*\b/;
    push @bigarray, @array;
}

      

Is there a more elegant way to do this than [^a-z]*[A-Z][^a-z]*[A-Z][^a-z]*

?

+3


source to share


3 answers


You seem to want all the characters of the word definition (construct \w

) to be in the list.
To find and resolve at least two headers and no lower case, you probably are not going to get around the fact that they must be additionally surrounded by headers or numbers or an underscore.

Perhaps it just matches what you need.

\b[\d_]*[A-Z]+[\d_]*[A-Z]+[\d_]*\b

Expand:

 \b 
 [\d_]* 
 [A-Z]+ 
 [\d_]* 
 [A-Z]+ 
 [\d_]* 
 \b 

      

ah, results

Entrance:

The QUICK Brown fox JUMPs OV3R T4E LAZY DoG.  

      



Output:

 **  Grp 0 -  ( pos 4 , len 5 ) 
QUICK  
-----
 **  Grp 0 -  ( pos 26 , len 4 ) 
OV3R  
-----
 **  Grp 0 -  ( pos 31 , len 3 ) 
T4E  
-----
 **  Grp 0 -  ( pos 35 , len 4 ) 
LAZY  

      


update - If you want to match consecutive chunks separated by spaces at will,
this will work.

 # (?&two_caps)(?:\s+(?&two_caps))*(?(DEFINE)(?<two_caps>\b[\d_]*[A-Z]+[\d_]*[A-Z]+[\d_]*\b))

 (?&two_caps) 
 (?:
      \s+ (?&two_caps) 
 )*

 (?(DEFINE)
      (?<two_caps>
           \b 
           [\d_]* 
           [A-Z]+ 
           [\d_]* 
           [A-Z]+ 
           [\d_]* 
           \b 
      )
 )

      

Output:

 **  Grp 0 -  ( pos 4 , len 5 ) 
QUICK  
 **  Grp 1 -  NULL 
---------
 **  Grp 0 -  ( pos 26 , len 13 ) 
OV3R T4E LAZY  
 **  Grp 1 -  NULL 

      

+1


source


If you really need these matches to run, perhaps use split with zero-width assertions, then filter the results:

while (<DATA>) {
    for my $e (split (/(?<=\b)([A-Z0-9_ ]+)(?=\b)/)){
        $e =~ s/^\s+|\s+$//g;
        print "$e\n" unless ($e =~/^$/ or $e =~ /.*[a-z]/);
    }
}

__DATA__
The QUICK Brown fox JUMPs OV3R T4E LAZY DoG.

      

Printing

QUICK
OV3R T4E LAZY

      

So how does it work?

split

will separate parts that match your criteria from those that don't:



use Data::Dumper;

while (<DATA>) {
    print Dumper split (/(?<=\b)([A-Z0-9_ ]+)(?=\b)/); 
}

      

Prints:

$VAR1 = 'The';
$VAR2 = ' QUICK ';
$VAR3 = 'Brown';
$VAR4 = ' ';
$VAR5 = 'fox';
$VAR6 = ' ';
$VAR7 = 'JUMPs';
$VAR8 = ' OV3R T4E LAZY ';
$VAR9 = 'DoG.';

      

The loop then loops over that array, removes whitespace from each element, and checks for a lowercase character or empty string.

Which results in one line to create your array for each line:

grep { $_ =~ /(?=[A-Z]{2,})^[^a-z]+$/ } map {s/^\s+|\s+$//g; $_} split (/(?<=\b)([A-Z0-9_ ]+)(?=\b)/);

      

+1


source


\b(?=\S*[A-Z]\S*[A-Z])[A-Z0-9]{2,}\b

      

Try it. Check out the demo.

https://regex101.com/r/cK4iV0/24

0


source







All Articles