How do you retrieve only the lowercase portions of a string in Stata?

Question

How do you retrieve only the lowercase portions of a string in Stata?

Here's some sample data:

part1
"Cambridge, Maryland TEST MODEL SEADROME"
"L.B. MAYER HONORED"
"A TOWN MOVES"
"U.S. SAVINGS BONDS RALLY"
"N.D. NOSES OUT S.M.U. BY 27 TO 20"
"Philadelphia, Pa. BURN 2,300 SQUEALERS"
"Odd Bits In To-day News"
"Saratoga Springs, N.Y. DIAVOLO IS STAR AT BRILLIANT SPA OPENING"
"Risk Death in Daring Race"
"Philadelphia, PA. IT HIGHER EDUCATION"
"806 DECORATIONS"
"Snow Hauled 20 Miles For Skiers"
"F.D.R. ASKS VICTORY EFFORT"

Each line is uppercase, lowercase, or uppercase. I am trying to use regular expressions to only fetch the tops of a string, but with no luck. The best I could do is detect when a string starts or ends with a certain number of uppercase characters:

generate title = regexs(0) if regexm(part1, "^[A-Z][A-Z][A-Z].*[A-Z][A-Z][A-Z]$")

I also tried the following, which I pulled from another question on the forum:

generate title = regexs(0) if(regexm(part1, "\b[A-Z]{2,}\b"))

It is supposed to look for words with at least two uppercase letters in the string, but only returns the missing values for me. I am using Stata version 13.1 for Mac.

+3

string regex uppercase stata

Danny Walker 21 jul. 15 at 12:20

source to share

3 answers

The implication is that the question is that you would expect the regex spec to output all instances. As reasonable as it may be, not how regular expressions work in Stata. You need loops for instances. It uses moss

( ssc install moss

), which is the main target. (The point of collecting moss is typical weak gameplay on the part of the second author of the program if he is reading this.)

clear 
input str100 part1
"Cambridge, Maryland TEST MODEL SEADROME"
"L.B. MAYER HONORED"
"A TOWN MOVES"
"U.S. SAVINGS BONDS RALLY"
"N.D. NOSES OUT S.M.U. BY 27 TO 20"
"Philadelphia, Pa. BURN 2,300 SQUEALERS"
"Odd Bits In To-day News"
"Saratoga Springs, N.Y. DIAVOLO IS STAR AT BRILLIANT SPA OPENING"
"Risk Death in Daring Race"
"Philadelphia, PA. IT HIGHER EDUCATION"
"806 DECORATIONS"
"Snow Hauled 20 Miles For Skiers"
"F.D.R. ASKS VICTORY EFFORT"
end 
compress 

moss part1, match("([A-Z]+)") regex 
egen wanted = concat(_match*), p(" ")
l wanted

     +--------------------------------------------------+
     |                                           wanted |
     |--------------------------------------------------|
  1. |                          C M TEST MODEL SEADROME |
  2. |                                L B MAYER HONORED |
  3. |                                     A TOWN MOVES |
  4. |                          U S SAVINGS BONDS RALLY |
  5. |                        N D NOSES OUT S M U BY TO |
     |--------------------------------------------------|
  6. |                               P P BURN SQUEALERS |
  7. |                                        O B I T N |
  8. | S S N Y DIAVOLO IS STAR AT BRILLIANT SPA OPENING |
  9. |                                          R D D R |
 10. |                       P PA IT S HIGHER EDUCATION |
     |--------------------------------------------------|
 11. |                                      DECORATIONS |
 12. |                                        S H M F S |
 13. |                        F D R ASKS VICTORY EFFORT |
     +--------------------------------------------------+

I assumed you wanted spaces between the results; this is hardly understandable. You do not include punctuation between uppercase; if you want it, you need to change the regex accordingly.

0

Nick cox 21 jul. At 14:01

source to share

I cannot think of a single rule that will parse cleanly with a single command of this data type. Often the best strategy is to target simple cases and then move on to more complex cases while diminishing returns make additional attempts unattractive.

It is important to pay attention to unintended matches when using regular expressions, especially if the number of observations is large. I am using listsome

(from SSC) for this type of work.

It looks like it part1

often starts with a city name followed by a name / abbreviation. Here's the code that handles simple and city / state cases:

clear
input str60 part1
"Cambridge, Maryland TEST MODEL SEADROME" 
"L.B. MAYER HONORED" 
"A TOWN MOVES" 
"U.S. SAVINGS BONDS RALLY" 
"N.D. NOSES OUT S.M.U. BY 27 TO 20" 
"Philadelphia, Pa. BURN 2,300 SQUEALERS" 
"Odd Bits In To-day News" 
"Saratoga Springs, N.Y. DIAVOLO IS STAR AT BRILLIANT SPA OPEN" 
"Risk Death in Daring Race" 
"Philadelphia, PA. IT HIGHER EDUCATION" 
"806 DECORATIONS" 
"Snow Hauled 20 Miles For Skiers" 
"F.D.R. ASKS VICTORY EFFORT" 
end

* take care of the easy cases where there are no lowercase letters
gen title = part1 if !regexm(part1,"[a-z]")

* this type of string work is easier if text is aligned to the left
leftalign   // (from SSC)

* target cases of City, State at the start of part1.
* with complex patterns, it easy to miss unintended matches when
* lots of obs are involved so use -listsome- (from SSC to track changes)
gen title0 = title
replace title = trim(regexs(3)) if regexm(part1,"^([A-Z][a-z ]*)+, ([A-Z][a-z]*\.?)+([^a-z]+$)")
listsome if title != title0

list part1 title

0

Robert picard 21 jul. 15 at 15:49

source to share

Roberto ferrer · Accepted Answer · 2015-07-21T13:01:32+0000

As @stribizhev points out, negation can be a way:

clear
set more off

input ///
str70 myvar
"Cambridge, Maryland TEST MODEL SEADROME"
"L.B. MAYER HONORED"
"A TOWN MOVES"
"U.S. SAVINGS BONDS RALLY"
"N.D. NOSES OUT S.M.U. BY 27 TO 20"
"Philadelphia, Pa. BURN 2,300 SQUEALERS"
"Odd Bits In To-day News"
"Saratoga Springs, N.Y. DIAVOLO IS STAR AT BRILLIANT SPA OPENING"
"Risk Death in Daring Race"
"Philadelphia, PA. IT HIGHER EDUCATION"
"806 DECORATIONS"
"Snow Hauled 20 Miles For Skiers"
"F.D.R. ASKS VICTORY EFFORT"
end

gen title = trim(regexs(2)) if regexm(myvar, "([,.]*)([^a-z]*$)")

list title

Result

. list title

     +-----------------------------------------------+
     |                                         title |
     |-----------------------------------------------|
  1. |                           TEST MODEL SEADROME |
  2. |                            L.B. MAYER HONORED |
  3. |                                  A TOWN MOVES |
  4. |                      U.S. SAVINGS BONDS RALLY |
  5. |             N.D. NOSES OUT S.M.U. BY 27 TO 20 |
     |-----------------------------------------------|
  6. |                          BURN 2,300 SQUEALERS |
  7. |                                               |
  8. | N.Y. DIAVOLO IS STAR AT BRILLIANT SPA OPENING |
  9. |                                               |
 10. |                     PA. IT HIGHER EDUCATION |
     |-----------------------------------------------|
 11. |                               806 DECORATIONS |
 12. |                                               |
 13. |                    F.D.R. ASKS VICTORY EFFORT |
     +-----------------------------------------------+

I think that's close to what you want, but not perfect. It's hard to imagine a simple method to clean up strings if they don't have a regular structure. Compare, for example, the input / output of observations 6 and 10.

If you have a title database, after the initial cleanup, you can compare and contrast against that. See ssc describe strgroup

for example.

How do you retrieve only the lowercase portions of a string in Stata?

More articles: