Get the string of the first argument of the calling function

I want to search with PHP files for a special function call. The reason is because I want to generate .MO-Files for GetText-Extension. So I first need to create a .PO-Files that contains all the required text strings.

I already find a lot of texts, but there are some problems.

Here is my Regex to find the first argument of a function:

/\_\([\'|\"]{1}(.+?[^\\\])[\'|\"]{1}[,]{0,1}.*?\)+/si

      

I need to find function calls with the following patterns:

_("text");
_("text %s", 3);
_('text');

      

The text can contain escaped quotes. My problem is this is urgent that I need to know if there was an apostrophe or a regular quote for the call.

If I have a challenge

_('"text"');

      

then i get the problem i get the text

"text

      

without an end quote.

Do any of you have an idea how I can get my Regex to work?

+3


source to share


2 answers


I would use a PHP tokenizer for this kind of thing, not regular expressions:



$funcName = '_';
$tokens   = token_get_all(file_get_contents('path/to/your/script.php'));
$strings  = array();

foreach($tokens as $index => $token){

  if(!is_array($token))
    continue;

  if($token[0] === T_CONSTANT_ENCAPSED_STRING){

    if(!isset($tokens[$index - 2]) || ($tokens[$index - 1] !== "("))
      continue;

    list($id, $text, $line) = $tokens[$index - 2];

    // this is your string (substr drops quotes around it)
    if(($id === T_STRING) && ($text === $funcName))
      $strings[] = substr($token[1], 1, -1);

  }    
}

var_dump($strings);

      

+4


source


Raw regex:

_\((?|'((?:[^'\\]|\\.)*)'|"((?:[^"\\]|\\.)*)")

      

Restricted regex:

~_\((?|'((?:[^'\\]|\\.)*)'|"((?:[^"\\]|\\.)*)")~

      



The result is capturing group 1. I used the reset pattern branch (?|pattern)

so that the capture group number was reset for every branch variable split |

.

There (?|'((?:[^'\\]|\\.)*)'|"((?:[^"\\]|\\.)*)")

are 2 templates inside the reset branch :

  • '((?:[^'\\]|\\.)*)'

    : Matching and capturing content within a single-quoted string that consists of either an unquoted sequence, no backslash, or an escaped sequence. Actually, I'm a bit sloppy here, since the (raw) new line character is considered part of the string. I don't think the spec will allow this, but if the input contains valid code then there should be no problem.

  • "((?:[^"\\]|\\.)*)"

    : Same as above, but for a double quoted string.

Note that I am not using the rest of the function arguments.

+2


source







All Articles