Modified PHP get_meta_tags doesn't work for some urls

I am trying to use code from custom notes on php.net for get_meta_tags . From what it seems, if the meta tag is formatted <meta content="foo" name="bar" />

then the code will skip it. Currently, only tags formatted as <meta name="bar" content="foo"/>

. I am not good with regex and have tried unsuccessfully to fix it. Below is an example url that seems to be slipping through the regex. Sorry in advance that my question is not necessarily feature related get_meta_tags

, but it seems like it might be related to some of the other issues people are having with this feature.

It seems the problem is here somewhere here:

preg_match_all('/<[\s]*meta[\s]*(name|property)="?' . '([^>"]*)"?[\s]*' . 'content="?([^>"]*)"?[\s]*[\/]?[\s]*>/si', $contents, $match);

      

which could be something like:

preg_match_all('/<[\s]*meta[\s]*(name|property|content)="?' . '([^>"]*)"?[\s]*' . '(content|name)="?([^>"]*)"?[\s]*[\/]?[\s]*>/si', $contents, $match);

      

But then again, I'm pretty awful with regex. Any ideas?

+3


source to share


1 answer


The idea is to grab the meta name / property inside the lookahead for sequence independence:

function extract_meta_tags($source)
{
  $pattern = '
  ~<\s*meta\s

  # using lookahead to capture type to $1
    (?=[^>]*?
    \b(?:name|property|itemprop|http-equiv)\s*=\s*
    (?|"\s*([^"]*?)\s*"|\'\s*([^\']*?)\s*\'|
    ([^"\'>]*?)(?=\s*/?\s*>|\s\w+\s*=))
  )

  # capture content to $2
  [^>]*?\bcontent\s*=\s*
    (?|"\s*([^"]*?)\s*"|\'\s*([^\']*?)\s*\'|
    ([^"\'>]*?)(?=\s*/?\s*>|\s\w+\s*=))
  [^>]*>

  ~ix';

  if(preg_match_all($pattern, $source, $out))
    return array_combine(array_map('strtolower', $out[1]), $out[2]);
  return array();
}

      

See test in regex101 . The branch reset function is used to retrieve the values โ€‹โ€‹of different quote styles.

print_r(extract_meta_tags($str));

Try with some different data on eval.in




Use this in your html section <head>

. To get the source and fetch of a page:

1.) Get the source code using cURL , file_get_contents or fsockopen .

2.) Extract <head>

using dom or regex like this: (?is)<head\b[^>]*>(.*?)</head>

3.) Extract meta tags from <head>

using the provided regex or try with a parser .

+1


source







All Articles