Javascript regex with variable input
I want to filter the following information from a long piece of text. Which I copy and paste the textbox and then process the table as a result. from
- Name
- Address
- Status
Snippet example: (kind of randomized names and addresses, etc.)
Thuisprikindeling voor: Vrijdag 15 Mei 2015 DE SMART BON 22 afspraken
Pagina 1/4
Persoonlijke mededeling:
Algemene mededeling:
Prikpostgegevens: REEK-Eeklo extern, (-)
Telefoonnummer Fax Mobiel 0499/9999999 Email
DUMMY FOO V Stationstreet 2 8000 New York F N - Sober BSN: 1655
THUIS Analyses: Werknr: PIN: 000000002038905
Opdrachtgever: Laboratorium Arts:
Mededeling: Some comments // VERY DIFFICULT
FO DUMMY FOO V Butterstreet 6 8740 Melbourne F N - Sober BSN: 15898
THUIS Analyses: Werknr: AFD 3 PIN: 000000002035900
Opdrachtgever: Laboratorium Arts:
Mededeling: ZH BLA / BLA BLA - AFD 3 - SOCIAL BEER
JOHN FOOO V Waterstreet 1 9990 Rome F N - Sober BSN: 17878
THUIS / Analyses: Werknr: K111 PIN: 000000002037888
Opdrachtgever: Laboratorium Arts:
Mededeling: TRYOUT/FOO
FO SMOOTH M.FOO M Queen Elisabethstreet 19 9990 Paris F NN - Not Sober BSN: 14877
I want to get out of this:
DUMMY FOO Stationstreet 2 8000 New York Sober
FO DUMMY FOO Butterstreet 6 8740 Melbourne Sober
JOHN FOOO Waterstreet 1 9990 Rome Sober
FO SMOOTH M.FOO Queen Elisabethstreet 19 9990 Paris Not sober
My strategy at the moment uses the following:
- Filter out all lines with at least two words in capitals at the beginning of the line. And a 4-digit postcode.
- Then discard all other lines as I need lines with names and addresses
- Then I remove all the information I need for this line
- Share name / address / status
I am using the following code:
//Regular expressions
//Filter all lines which start with at least two UPPERCASE words following a space
pattern = /^(([A-Z'.* ]{2,} ){2,}[A-Z]{1,})(?=.*BSN)/;
postcode = /\d{4}/;
searchSober= /(N - Sober)+/;
searchNotSober= /(NN - Not sober)+/;
adres = inputText.split('\n');
for (var i = 0; i < adres.length; i++) {
// If in one line And a postcode and which starts with at least
// two UPPERCASE words following a space
temp = adres[i]
if ( pattern.test(temp) && postcode.test(temp)) {
//Remove BSN in order to be able to use digits to sort out the postal code
temp = temp.replace( /BSN.*/g, "");
// Example: DUMMY FOO V Stationstreet 2 8000 New York F N - Sober
//Selection of the name, always take first part of the array
var name = temp.match(/^([-A-Z'*.]{2,} ){1,}[-A-Z.]{2,}/)[0];
//remove the name from the string
temp = temp.replace(/^([-A-Z'*.]{2,} ){1,}[-A-Z.]{2,}/, "");
// V Stationstreet 2 8000 New York F N - Sober
//filter out gender
//Using jquery trim for whitespace trimming
// V
var gender = $.trim(temp.match(/^( [A-Z'*.]{1} )/)[0]);
//remove gender
temp = temp.replace(/^( [A-Z'*.]{1} )/, "");
// Stationstreet 2 8000 New York F N - Sober
//looking for status
var status = "unknown";
if ( searchNotsober.test(temp) ) {
status = "Not soberr";
else if ( searchSober.test(temp) ) {
status = "Sober";
else {
status = "unknown";
//Selection of the address /^.*[0-9]{4}.[\w-]{2,40}/
//Stationstreet 2 8000 New York
var address = $.trim(temp.match(/^.*[0-9]{4}.[\w-]{2,40}/gm));
//assemble into person object.
var person={name: name + "", address: address + "", gender: gender +"", status:status + "", location:[] , marker:[]};
Now I have a problem:
- Sometimes names are not recorded in CAPITALS
- Sometimes the zip code is not added, so my code just stops working.
- Sometimes they put * in front of the name
The broader question is, what strategy can you take to deal with these dirty input problems? Do I have to make cases for every error I see in these snippets that I receive? i feel i dont know exactly what i will choose from this piece of code every time i run it with a different input.
source to share
Here's a general way to handle it:
Find all the lines that most likely match. A match on Treb or something else prevents you from missing a match, even if it gives you false positives.
Filter out false positives, this you need to update and tweak as you go. Make sure you only filter out what is irrelevant.
Stringent filtering of input that does not match is logged / submitted for manual control, which matches a known strong pattern
Normalizing and extracting data will now be much easier as you can restrict data entry at this stage
source to share