Select a string without substring
I'm trying to select only names from text like this (Slovak Wikipedia dump):
|Meno = Hans Joachim
|Plné meno = Aristoteles (???????????)
|Plné meno = Francis Bacon
|Plné meno = Sokrates ({{Cudzojazyčne|grc|????????|pc=n}})
|Meno = Svätý František z Assisi <br /> ''(Giovanni Battista Bernardone)''
|Meno = Friedrich Ludwig Gottlob Frege
|Meno = Adam František Kollár (Kolárik)
|meno = [[J. Edgar Hoover|John Edgar Hoover]]
|meno = [[Benedikt XIV. (1740 – 1758)|Benedikt XIV.]]
|meno = [[Milan Rastislav Štefánik|Milan Rastislav Štefánik]]
|Meno = '''Ján Filc'''
|Meno = Jean le Rond d'Alembert
The output should look like this:
Hans Joachim
Aristoteles
Francis Bacon
Sokrates
Svätý František z Assisi
Friedrich Ludwig Gottlob Frege
Adam František Kollár (Kolárik)
J. Edgar Hoover|John Edgar Hoover
Benedikt XIV. (1740 – 1758)|Benedikt XIV.
Milan Rastislav Štefánik|Milan Rastislav Štefánik
Ján Filc
Jean le Rond d'Alembert
When the name is spelled correctly, this regex works fine: = *(.*?)$
But when there is such a thing as "(???????????)", HTML tags, and anything between "{{and"}} " , I can't pick a name without the unwanted substring.
I tried many variations on this regex tag page (http://regex101.com/r/gS8iQ9/1) but none of them worked.
In Java code, I am using
Pattern pattern = Pattern.compile("= *(.*?)$");
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
String foundSubstring = matcher.group(1);
...
Thanks for any help or suggestions on how to select text after "=" but without question marks, HTML code, etc.
+3
source to share
2 answers