Extract urdu / arabic phrases / sentences from string
I want to extract phrases from Urdu from a user-submitted string in PHP. For this I tried the following test code:
$pattern = "#([\x{0600}-\x{06FF}]+\s*)+#u";
if (preg_match_all($pattern, $string, $matches, PREG_SET_ORDER)) {
print_r($matches);
} else {
echo 'No matches.';
}
Now if, for example, $string
contains
In his books (some of which include دنیا گول ہے, آوارہ گرد کی ڈائری, and ابن بطوطہ کے تعاقب میں), Ibn-e-Insha has told amusing stories of his travels.
I am getting the following output:
Array ( [0] => Array ( [0] => دنیا گول ہے [1] => ہے ) [1] => Array ( [0] => آوارہ گرد کی ڈائری [1] => ڈائری ) [2] => Array ( [0] => ابن بطوطہ کے تعاقب میں [1] => میں ) )
Even though I get my desired matches ( دنیا گول ہے
, آوارہ گرد کی ڈائری
and ابن بطوطہ کے تعاقب میں
), I also get unwanted ( ہے
, ڈائری
and میں
), each of which is actually the last word of his phrase). Can anyone point out how I can avoid unwanted matches?
This is because the capture group is ([\x{0600}-\x{06FF}]+\s*)
matched multiple times, each time overwriting what matched the previous time. You can get the expected result simply by moving it to a non-capture group - (?:[\x{0600}-\x{06FF}]+\s*)
- but here's a more correct alternative:
$pattern = "#(?:[\x{0600}-\x{06FF}]+(?:\s+[\x{0600}-\x{06FF}]+)*)#u";
The first [\x{0600}-\x{06FF}]+
matches the first word, and then, if there are spaces followed by another word, (?:\s+[\x{0600}-\x{06FF}]+)*
matches it and any subsequent words. But after the last word, it doesn't match a space, which I suppose you don't need.
source to share