Extract urdu / arabic phrases / sentences from string

Question

Extract urdu / arabic phrases / sentences from string

I want to extract phrases from Urdu from a user-submitted string in PHP. For this I tried the following test code:

$pattern = "#([\x{0600}-\x{06FF}]+\s*)+#u";
if (preg_match_all($pattern, $string, $matches, PREG_SET_ORDER)) {
    print_r($matches);
} else {
    echo 'No matches.';
}

Now if, for example, $string

contains

In his books (some of which include دنیا گول ہے, آوارہ گرد کی ڈائری, and ابن بطوطہ کے تعاقب میں), Ibn-e-Insha has told amusing stories of his travels.

I am getting the following output:

Array
(
    [0] => Array
        (
            [0] => دنیا گول ہے
            [1] => ہے
        )

    [1] => Array
        (
            [0] => آوارہ گرد کی ڈائری
            [1] => ڈائری
        )

    [2] => Array
        (
            [0] => ابن بطوطہ کے تعاقب میں
            [1] => میں
        )

)

Even though I get my desired matches ( دنیا گول ہے

, آوارہ گرد کی ڈائری

and ابن بطوطہ کے تعاقب میں

), I also get unwanted ( ہے

, ڈائری

and میں

), each of which is actually the last word of his phrase). Can anyone point out how I can avoid unwanted matches?

+2

php regex

user165581 30 Aug At 12:02 pm

source to share

1 answer

Alan moore · Accepted Answer · 2009-08-30T13:41:55+0000

This is because the capture group is ([\x{0600}-\x{06FF}]+\s*)

matched multiple times, each time overwriting what matched the previous time. You can get the expected result simply by moving it to a non-capture group - (?:[\x{0600}-\x{06FF}]+\s*)

- but here's a more correct alternative:

$pattern = "#(?:[\x{0600}-\x{06FF}]+(?:\s+[\x{0600}-\x{06FF}]+)*)#u";

The first [\x{0600}-\x{06FF}]+

matches the first word, and then, if there are spaces followed by another word, (?:\s+[\x{0600}-\x{06FF}]+)*

matches it and any subsequent words. But after the last word, it doesn't match a space, which I suppose you don't need.

Extract urdu / arabic phrases / sentences from string

More articles: