Split text into php word problems, hard problem

I am trying to break the text into words:

$delimiterList = array(" ", ".", "-", ",", ";", "_", ":",
           "!", "?", "/", "(", ")", "[", "]", "{", "}", "<", ">", "\r", "\n",
           '"');
$words = mb_split($delimiterList, $string);

      

which works great with strings, but I'm stuck in some cases where I have to do numbers.

eg. If I have the text "Look at this. My score is 3.14 and I'm happy about it." Now the array

[0]=>Look,
[1]=>at,
[2]=>this,
[3]=>My,
[4]=>score,
[5]=>is,
[6]=>3,
[7]=>14,
[8]=>and, ....

      

Then also 3.14 is divisible by 3 and 14, which shouldn't happen in my case. I mean that the dot should divide two strings, but not two numbers. It should look like this:

[0]=>Look,
[1]=>at,
[2]=>this,
[3]=>My,
[4]=>score,
[5]=>is,
[6]=>3.14,
[7]=>and, ....

      

But I have no idea how to avoid these cases!

Does anyone know how to fix this problem?

Thanx, Granit

+2


source to share


4 answers


Or use regex :)



<?php
$str = "Look at this.My score is 3.14, and I am happy about it.";

// alternative to handle Marko example (updated)
// /([\s_;?!\/\(\)\[\]{}<>\r\n"]|\.$|(?<=\D)[:,.\-]|[:,.\-](?=\D))/

var_dump(preg_split('/([\s\-_,:;?!\/\(\)\[\]{}<>\r\n"]|(?<!\d)\.(?!\d))/',
                    $str, null, PREG_SPLIT_NO_EMPTY));

array(13) {
  [0]=>
  string(4) "Look"
  [1]=>
  string(2) "at"
  [2]=>
  string(4) "this"
  [3]=>
  string(2) "My"
  [4]=>
  string(5) "score"
  [5]=>
  string(2) "is"
  [6]=>
  string(4) "3.14"
  [7]=>
  string(3) "and"
  [8]=>
  string(1) "I"
  [9]=>
  string(2) "am"
  [10]=>
  string(5) "happy"
  [11]=>
  string(5) "about"
  [12]=>
  string(2) "it"
}

      

+9


source


Have a look at strtok . It allows you to dynamically change the parsing tokens, so you can split the string manually in a while loop by pushing each split word into an array.



+6


source


My first idea was preg_match_all('/\w+/', $string, $matches);

, but this gives a similar result to what you have. The problem is that the numbers separated by a dot are very ambiguous. This can mean both a decimal point and the end of a sentence, so we need a way to change the string in a way that eliminates double meaning.

For example, in this sentence there are several parts that we want to save as a single word "Look at this.My score is 3.14, and I am happy about it. It not 334,3 and today not 2009-12-12 11:12:13."

.

We start by creating a search-> replace dictionary to encode exceptions into something that won't be split:

$encode = array(
    '/(\d+?)\.(\d+?)/' => '\\1DOT\\2',
    '/(\d+?),(\d+?)/' => '\\1COMMA\\2',
    '/(\d+?)-(\d+?)-(\d+?) (\d+?):(\d+?):(\d+?)/' => '\\1DASH\\2DASH\\3SPACE\\4COLON\\5COLON\\6'
);

      

Then we code the exceptions:

foreach ($encode as $regex => $repl) {
    $string = preg_replace($regex, $repl, $string);
}

      

Split the line:

preg_match_all('/\w+/', $string, $matches);

      

And convert the encoded word back:

$decode = array(
    'search' =>  array('DOT', 'COMMA', 'DASH', 'SPACE', 'COLON'),
    'replace' => array('.',   ',',     '-',    ' ',     ':'    )
);
foreach ($matches as $k => $v) {
    $matches[$k] = str_replace($decode['search'], $decode['replace'], $v);
}

      

$matches

now contains the original sentence split into words with correct exceptions.

You can make the regex used in exceptions as simple or complex as you like, but some ambiguity will always come through, like two sentences with the first trailing one and the next one starting with a number: Number of the counting shall be 3.3 only and nothing but the 3.5 is right out..

+1


source


Use ". ",

instead of ".",

in $delimiterList

.

0


source







All Articles