Discard array if preg_match doesn't match pattern?

Question

Discard array if preg_match doesn't match pattern?

I have a multidimensional array that looks like this:

Array
(
    [0] => Array
        (
            [0] => Title 1
            [1] => Some text ... US5801351017 ...
        )

    [1] => Array
        (
            [0] => Title 2
            [1] => Some text ... US0378331005 ...
        )

    [2] => Array
        (
            [0] => Title 3
            [1] => Some text ... //Note here that it does not contain an ISIN Code
        )
...

I am trying to filter arrays to match my Regex containing ISIN code. The above array was derived from the following code:

$title = $html->find("h3.r a");
$titlearray = array_map(function($value){
    return trim($value->plaintext);
}, $title);

$description = $html->find("span.st");
$descriptionarray = array_map(function($value){
    $string = strip_tags($value);
    return $string;
}, $description);

$result1 = array();
foreach($titlearray as $key => $value) {
    $tmp = array($value);
    if (isset($descriptionarray[$key])) {
        $tmp[] = $descriptionarray[$key];
    }
    $result1[] = $tmp;
}

print_r($result1);

I wrote a code that is very close, but not actually unset

arrays that do not contain an ISIN. I have the code:

$title = $html->find("h3.r a");
$titlearray = array_map(function($value){
    return trim($value->plaintext);
}, $title);

$description = $html->find("span.st");
$descriptionarray = array_map(function($value){
    $match = array();
    $string = strip_tags($value);
    $pattern = "/[BE|BM|FR|BG|VE|DK|HR|DE|JP|HU|HK|JO|US|BR|XS|FI|GR|IS|RU|LB|"
            . "PT|NO|TW|UA|TR|LK|LV|LU|TH|NL|PK|PH|RO|EG|PL|AA|CH|CN|CL|EE|CA|"
            . "IR|IT|ZA|CZ|CY|AR|AU|AT|IN|CS|CR|IE|ID|ES|PE|TN|PA|SG|IL|US|MX|"
            . "SK|KRSI|KW|MY|MO|SE|GB|GG|KY|JE|VG|NG|SA|MU]{2}[A-Z0-9]{10}/";
    preg_match($pattern, $string, $match);
    return $match;
}, $description);

$merged = array();
$i=0;
foreach($descriptionarray as $value){
  $merged[$i] = $value;
  $merged[$i][] = $titlearray[$i];
  $i++;
}

print_r($merged);

which gives me these arrays:

Array
(
    [0] => Array
        (
            [0] => US5801351017
            [1] => Title 1
        )

    [1] => Array
        (
            [0] => US0378331005
            [1] => Title 2
        )

    [2] => Array
        (
            [0] => Title 3
        )
...

How can I get rid of arrays that don't match my Regex? What I'm looking for is the output:

Array
(
    [0] => Array
        (
            [0] => Title 1
            [1] => US5801351017
        )

    [1] => Array
        (
            [0] => Title 2
            [1] => US0378331005
        )
...

EDIT

@CasimiretHippolyte

According to him, I have this code now:

$titles = $html->find("h3.r a");

$descriptions = $html->find("span.st");

$ISIN_PATTERN = "/[BE|BM|FR|BG|VE|DK|HR|DE|JP|HU|HK|JO|US|BR|XS|FI|GR|IS|RU|LB|"
            . "PT|NO|TW|UA|TR|LK|LV|LU|TH|NL|PK|PH|RO|EG|PL|AA|CH|CN|CL|EE|CA|"
            . "IR|IT|ZA|CZ|CY|AR|AU|AT|IN|CS|CR|IE|ID|ES|PE|TN|PA|SG|IL|US|MX|"
            . "SK|KRSI|KW|MY|MO|SE|GB|GG|KY|JE|VG|NG|SA|MU]{2}[A-Z0-9]{10}/";

$results = [];

foreach ($descriptions as $k => $v) {
    if (preg_match($ISIN_PATTERN, strip_tags($v), $m)) {
        $results[] = ['Title' => trim($titles[$k]->plaintext), 'ISIN' => $m[1]];
    }
}

print_r($results);

This narrows down my array by only selecting elements that match the Regex, but it doesn't display the matches in 'ISIN' => $m[1]

. It outputs this:

Array
(
    [0] => Array
        (
            [Title] => Title 1
            [ISIN] => 
        )

    [1] => Array
        (
            [Title] => Title 2
            [ISIN] => 
        )
...

FURTHER PICTURE

This code solves the problem:

$titles = $html->find("h3.r a");

$descriptions = $html->find("span.st");

$ISIN_PATTERN = "/[BE|BM|FR|BG|VE|DK|HR|DE|JP|HU|HK|JO|US|BR|XS|FI|GR|IS|RU|LB|"
            . "PT|NO|TW|UA|TR|LK|LV|LU|TH|NL|PK|PH|RO|EG|PL|AA|CH|CN|CL|EE|CA|"
            . "IR|IT|ZA|CZ|CY|AR|AU|AT|IN|CS|CR|IE|ID|ES|PE|TN|PA|SG|IL|US|MX|"
            . "SK|KRSI|KW|MY|MO|SE|GB|GG|KY|JE|VG|NG|SA|MU]{2}[A-Z0-9]{10}/";

$results1 = [];

foreach ($descriptions as $k => $v) {
    if (preg_match($ISIN_PATTERN, strip_tags($v), $m)) {
        $results1[] = ['Title' => trim($titles[$k]->plaintext), 'ISIN' => $m[1]];
    }
}

$titlesarray = array_column($results1, 'Title');

$results2 = array_map(function($value){
    $match = array();
    $string = strip_tags($value);
    $pattern = "/[BE|BM|FR|BG|VE|DK|HR|DE|JP|HU|HK|JO|US|BR|XS|FI|GR|IS|RU|LB|"
            . "PT|NO|TW|UA|TR|LK|LV|LU|TH|NL|PK|PH|RO|EG|PL|AA|CH|CN|CL|EE|CA|"
            . "IR|IT|ZA|CZ|CY|AR|AU|AT|IN|CS|CR|IE|ID|ES|PE|TN|PA|SG|IL|US|MX|"
            . "SK|KRSI|KW|MY|MO|SE|GB|GG|KY|JE|VG|NG|SA|MU]{2}[A-Z0-9]{10}/";
    preg_match($pattern, $string, $match);
    return $match;
}, $descriptions);

$descriptionarray = array_column($results2, 0);

$result3 = array();
foreach($titlesarray as $key => $value) {
    $tmp = array($value);
    if (isset($descriptionarray[$key])) {
        $tmp[] = $descriptionarray[$key];
    }
    $result3[] = $tmp;
}

print_r($result3);

I scraped something together very quickly as I need a quick solution. This is very inefficient considering that I am using an additional one arrar_map()

, simplify the arrays into a simple array and then concatenate them together. Also, I am iterating over my Regex.

LAST EDIT

@ CasimiretHippolyte's answer is the most efficient solution and provides an answer for using either his template with $m[1]

or my template with $m[0]

.

+3

arrays php foreach regex

Ava Barbilla May 18 '15 at 11:24 PM

source to share

1 answer

Casimir et Hippolyte · Accepted Answer · 2015-05-19T01:18:30+0000

You can construct your code in a different way with a simple foreach and construct the result items one by one only when the ISIN code is found:

$titles = $html->find("h3.r a");
$descriptions = $html->find("span.st");

define ('ISIN_PATTERN', '~
 \b  # there is probably a word boundary at the begin of the ISIN code
 (?=([A-Z]{2}[A-Z0-9]{10})\b) # check the format before testing the whole alternation
                              # at the same time, the ISIN is captured in group 1
 (?: # so, this alternation is only here to make the pattern fail or succeed
     C[AHLNRSYZ]|I[DELNRST]|P[AEHKLT]|S[AEIGK]|A[ARTU]|B[EGMR]|L[BKUV]|M[OUXY]|T[HNRW]
     |E[EGS]|G[BGR]|H[KRU]|J[EOP]|K[RWY]|N[GLO]|D[EK]|F[IR]|R[OU]|U[AS]|V[EG]|XS|ZA
 )~x');

$results = [];

foreach ($descriptions as $k => $v) {
    if (preg_match(ISIN_PATTERN, strip_tags($v), $m))
        $results[] = [ 'ISIN' => $m[1], 'Title' => trim($titles[$k]->plaintext) ]; 
}

print_r($results);

Note: this code is untested and can probably be improved. Some ideas:

stop using simplehtml and use DOMDocument and DOMXPath
the manual model is designed with all countries equally probable. If not, rewrite it to check the priorities of the most relevant countries

Discard array if preg_match doesn't match pattern?

EDIT

More articles: