Javascript regex with variable input

I want to filter the following information from a long piece of text. Which I copy and paste the textbox and then process the table as a result. from

  • Name
  • Address
  • Status

Snippet example: (kind of randomized names and addresses, etc.)

Thuisprikindeling voor: Vrijdag 15 Mei 2015 DE SMART BON 22 afspraken
Pagina 1/4
Persoonlijke mededeling:
Algemene mededeling:
Prikpostgegevens: REEK-Eeklo extern, (-)
Telefoonnummer Fax Mobiel 0499/9999999 Email dummy.dummy@gmail.com
DUMMY FOO V Stationstreet 2 8000 New York F N - Sober BSN: 1655
THUIS Analyses: Werknr: PIN: 000000002038905
Opdrachtgever: Laboratorium Arts:
Mededeling:  Some comments // VERY DIFFICULT
FO DUMMY FOO V Butterstreet 6 8740 Melbourne F N - Sober BSN: 15898
THUIS Analyses: Werknr: AFD 3 PIN: 000000002035900
Opdrachtgever: Laboratorium Arts:
Mededeling: ZH BLA / BLA BLA - AFD 3 - SOCIAL BEER
JOHN FOOO V Waterstreet 1 9990 Rome F N - Sober BSN: 17878
THUIS / Analyses: Werknr: K111 PIN: 000000002037888
Opdrachtgever: Laboratorium Arts:
Mededeling: TRYOUT/FOO
FO SMOOTH M.FOO M Queen Elisabethstreet 19 9990 Paris F NN - Not Sober BSN: 14877

      

I want to get out of this:

DUMMY FOO Stationstreet 2 8000 New York Sober
FO DUMMY FOO Butterstreet 6 8740 Melbourne Sober
JOHN FOOO Waterstreet 1 9990 Rome Sober
FO SMOOTH M.FOO Queen Elisabethstreet 19 9990 Paris Not sober

      

My strategy at the moment uses the following:

  • Filter out all lines with at least two words in capitals at the beginning of the line. And a 4-digit postcode.
  • Then discard all other lines as I need lines with names and addresses
  • Then I remove all the information I need for this line
  • Share name / address / status

I am using the following code:

  //Regular expressions

    //Filter all lines which start with at least two UPPERCASE words following a space
    pattern = /^(([A-Z'.* ]{2,} ){2,}[A-Z]{1,})(?=.*BSN)/;
    postcode = /\d{4}/;
    searchSober= /(N - Sober)+/;
    searchNotSober= /(NN - Not sober)+/;

    adres = inputText.split('\n');


    for (var i = 0; i < adres.length; i++) {

        // If in one line And a postcode and which starts with at least
        // two UPPERCASE words following a space
        temp = adres[i]

        if (  pattern.test(temp) && postcode.test(temp)) {

            //Remove BSN in order to be able to use digits to sort out the postal code
            temp = temp.replace( /BSN.*/g, "");

            // Example: DUMMY FOO V Stationstreet 2 8000 New York F N - Sober

            //Selection of the name, always take first part of the array
            // DUMMY FOO
            var name = temp.match(/^([-A-Z'*.]{2,} ){1,}[-A-Z.]{2,}/)[0];

            //remove the name from the string
            temp = temp.replace(/^([-A-Z'*.]{2,} ){1,}[-A-Z.]{2,}/, "");
            // V Stationstreet 2 8000 New York F N - Sober

            //filter out gender
            //Using jquery trim for whitespace trimming
            // V
            var gender = $.trim(temp.match(/^( [A-Z'*.]{1} )/)[0]);

            //remove gender
            temp = temp.replace(/^( [A-Z'*.]{1} )/, "");

            // Stationstreet 2 8000 New York F N - Sober
            //looking for status

            var status = "unknown";
            if ( searchNotsober.test(temp) ) {
                status = "Not soberr";
            }
            else if ( searchSober.test(temp) ) {
                status = "Sober";


            }
            else {
                status = "unknown";
            }

            //Selection of the address /^.*[0-9]{4}.[\w-]{2,40}/
            //Stationstreet 2 8000 New York
            var address = $.trim(temp.match(/^.*[0-9]{4}.[\w-]{2,40}/gm));

            //assemble into person object.
            var person={name: name + "", address: address + "", gender: gender +"", status:status + "", location:[] , marker:[]};
            result.push(person);
        }
    }

      

Now I have a problem:

  • Sometimes names are not recorded in CAPITALS
  • Sometimes the zip code is not added, so my code just stops working.
  • Sometimes they put * in front of the name

The broader question is, what strategy can you take to deal with these dirty input problems? Do I have to make cases for every error I see in these snippets that I receive? i feel i dont know exactly what i will choose from this piece of code every time i run it with a different input.

+3


source to share


1 answer


Here's a general way to handle it:



  • Find all the lines that most likely match. A match on Treb or something else prevents you from missing a match, even if it gives you false positives.

  • Filter out false positives, this you need to update and tweak as you go. Make sure you only filter out what is irrelevant.

  • Stringent filtering of input that does not match is logged / submitted for manual control, which matches a known strong pattern

  • Normalizing and extracting data will now be much easier as you can restrict data entry at this stage

0


source







All Articles