Regex how to find a pattern?
I need to split the below text with Regex syntax. I actually found recipes for dddd-dddd
and dddd-ddd[x]
. What's with the text? I need to get this string value as follows: "British Journal of Applied Science & Technology"
. How do I write it to a regular expression?
337 British Journal of Applied Science & Technology 2231-0843 5
338 British Journal of Economics, Management & Trade 2278-098X 5
339 British Journal of Education, Society & Behavioural Science 2278-0998 6
340 British Journal of Environment and Climate Change 2231-4784 5
341 British Journal of Mathematics & Computer Science 2231-0851 4
342 British Journal of Medicine and Medical Research 2231-0614 8
343 British Journal of Pharmaceutical Research 2231-2919 4
344 British Microbiology Research Journal 2231-0886 9
345 Bromatologia i Chemia Toksykologiczna 0365-9445 5
346 Budownictwo Górnicze i Tunelowe 1234-5342 5
347 Budownictwo i Architektura 1899-0665 3
348 Budownictwo, Technologie, Architektura 1644-745X 3
349 Builder 1896-0642 2
350 Built Environment 0263-7960 10
351 Bulgarian Journal of Veterinary Medicine 1311-1477 8
352 Bulgarian Medicine 1314-3387 2
353 Bulletin de la Société des sciences et des lettres de Łódź, Série: Recherches sur les déformations 0459-6854 7
354 Bulletin of Alfred Nobel University. Series "Legal Science" 2226-2873 6
355 Bulletin of Geography. Socio-economic Series 1732-4254 10
356 Bulletin of Geography: Physical Geography Series 2080-7686 9
357 Bulletin of the Polish Academy of Sciences. Mathematics 0239-7269 9
358 Business and Economic Horizons 1804-1205 8
359 Business and Economics Research Journal 1309-2448 10
360 Business Process Management Journal 1463-7154 10
source to share
(?<=\d\s)\D+(?=\s\d)
This should find what you need. If you're wondering how it works: The first part of Regex ( (?<=\d\s)
) declares that the phrase you are looking for must start after a digit ( \d
) followed by a space ( \s
).
The second part ( \D+
) is what is actually found. This means any number of characters without a number.
The third part ( (?=\s\d)
) ensures that other spaces and numbers follow the result.
source to share
You can do this with an expression that uses lookahead and lookbehind, for example:
(?<=\d{3}\s).*(?=\s\d{4}-)
This expression requires three numbers, followed by a space before the text, and four numbers, preceded by a space, and then a dash after the text. The name itself follows a straight pattern .*
.
source to share
Since you are not specifying the target language or something, this is how you could do it with perl:
cat test.txt | perl -pe 's/^\d+\s//' | perl -pe 's/[0-9X "-]+$//'
The second expression may need to be adapted depending on how the rest of your data looks like.
Prints:
British Journal of Applied Science & Technology
British Journal of Economics, Management & Trade
British Journal of Education, Society & Behavioural Science
British Journal of Environment and Climate Change
[snip]
Bulletin of the Polish Academy of Sciences. Mathematics
Business and Economic Horizons
Business and Economics Research Journal
Business Process Management Journal
source to share
\d+ (.+) ....-.... \d+
extracting:
British Journal of Applied Science & Technology
British Journal of Economics, Management & Trade
British Journal of Education, Society & Behavioural Science
British Journal of Environment and Climate Change
British Journal of Mathematics & Computer Science
British Journal of Medicine and Medical Research
British Journal of Pharmaceutical Research
[... cut ...]
source to share
I understand that you are looking for REGEX, but if you want something a little more direct, it looks like your document can be parsed easily using simple string manipulation. I offer this idea as an alternative for people who don't want to use REGEX.
String tmp = "340 British Journal of Environment and Climate Change 2231-4784 5";
String ending = tmp.substring(tmp.length() - 11);
tmp = tmp.substring(0, (tmp.length() - 11)); //parse off the ending
StringTokenizer st = new StringTokenizer(tmp, " ");
String index = st.nextToken(); //reads the first int up to the first space.
tmp = tmp.substring(index.length()); //parse front
Now tmp is the name of the log, index is the first few characters, and the link at the end is saved as a completion.This method only works if all lines are specified exactly as above, or within the same range.
source to share