Select a string without substring

I'm trying to select only names from text like this (Slovak Wikipedia dump):

    |Meno = Hans Joachim
|Plné meno = Aristoteles (???????????)
|Plné meno = Francis Bacon
|Plné meno = Sokrates ({{Cudzojazyčne|grc|????????|pc=n}})
|Meno            = Svätý František z Assisi <br /> ''(Giovanni Battista Bernardone)''
  |Meno = Friedrich Ludwig Gottlob Frege
   |Meno             = Adam František Kollár (Kolárik)
|meno    = [[J. Edgar Hoover|John Edgar Hoover]]
|meno    = [[Benedikt XIV. (17401758)|Benedikt XIV.]]
|meno    = [[Milan Rastislav Štefánik|Milan Rastislav Štefánik]]
   |Meno             = '''Ján Filc'''
  |Meno = Jean le Rond d'Alembert

      

The output should look like this:

Hans Joachim
Aristoteles
Francis Bacon
Sokrates
Svätý František z Assisi
Friedrich Ludwig Gottlob Frege
Adam František Kollár (Kolárik)
J. Edgar Hoover|John Edgar Hoover
Benedikt XIV. (1740 – 1758)|Benedikt XIV.
Milan Rastislav Štefánik|Milan Rastislav Štefánik
Ján Filc
Jean le Rond d'Alembert

      

When the name is spelled correctly, this regex works fine: = *(.*?)$

But when there is such a thing as "(???????????)", HTML tags, and anything between "{{and"}} " , I can't pick a name without the unwanted substring.

I tried many variations on this regex tag page (http://regex101.com/r/gS8iQ9/1) but none of them worked.

In Java code, I am using

Pattern pattern = Pattern.compile("= *(.*?)$");
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
   String foundSubstring = matcher.group(1);
   ...

      

Thanks for any help or suggestions on how to select text after "=" but without question marks, HTML code, etc.

+3


source to share


2 answers


Your regex was almost correct, but your input is a trick to work with, and you can do it in one line:

String name = line.replaceAll(".*?=[\\[ ']*([\\p{L}0-9|'. ()–]+[\\p{L}.)]).*", "$1");

      



Watch live demo

I tested this and it produced your desired output given your sample input.

+2


source


Try the following:

Pattern pattern = Pattern.compile("=[\\s\\p{Punct}]*(.*?)\\p{Punct}*$");

      



\p{Punct}

means punctuation: one of! "# $% & '() * +, -. / :; <=>? @ [] ^ _` {|} ~

+1


source







All Articles