How to identify a similar string through keywords

keywords: all words with more than 3 characters

I want to compare keywords between two strings with these conditions:

  • Moving words doesn't matter (example 1 for this case)
  • Words with no more than 3 characters are not calculated (example 2 for this case)
  • Put a shorter sentence in str1 (number of characters). (example 3 for this case)
  • I just need different words in str1 versus str2. (example 4 for this case)

In fact, I have a robot that attacks two news sites daily and copies the news into my database. Then I need an algorithm to compare news titles and identify duplicate news. (As you know, the same news has a different headline from different news websites. But often the same keywords are included in the headline of the same news)

example1: Word movement doesn't matter.

str1= 'hello petter'
str2= 'petter hello'

result: 0 

      

example2: Words with no more than 3 characters are not evaluated

str1= 'hello !!'
str2= 'petter hello'

result: 0 // '!!' are less than 3characters and str1 is 'hello'. then result:0

      

OR

str1= 'hello petter‌ how are u?'
str2= 'petter hello how are you'

result: 0 // str1 is 'hello petter how are'

      

example3: Variables must be changed

str1= 'hello petter‌ how are you ?'
str2= 'petter hello how are you?'
// Then
str1= 'hello petter‌ how are you?'
str2= 'petter hello how are you ?'

result:1 // 1 is for 'you' (in str1)

      

Example4: Different words are not important in str2

str1= 'hello petter‌ how are you?'
str2= 'petter hello how are you ?'

result: 1 // str2 is 'petter hello how are you', then 1 is for: 'you?' (in str1)

      

Note: "you" (in str2) is not important to me, because it doesn't match with any words in str1.

example: (for more information)

str1= 'petter‌ hello how are you pal?'
str2= 'petter hello how are... !!'

// In first str1 change with str2
str1= 'petter hello how are... !!'
str2= 'petter‌ hello how are you pal?'

// Then remove '!!' (in str1)
str1= 'petter hello how are...'
str2= 'petter‌ hello how are you pal?'

result: 1 // 1 for 'are...' (in str1) - ['are','you','pal?' does not matter (in str2)]

      

Finally, I need a function to identify duplicate news through the result and the number of keywords (all words with more than 3 characters).

$keywords_numb=7;
$result=2;

function identify_duplicate($keywords_numb,$result){
    if($keywords_numb / 3 >= $result){
        $Specified = 'this is a new news';
    }

    else $Specified = 'this is a duplicate news';
        return $Specified;

}

    echo $Specified;

      

output:

this is a new news

      

Does anyone know how I can write this program? Relations

+3


source to share


2 answers


With the help of @karthik manchala I did it ...

   $str1='this news is about a player named Ronaldo';
   $str2='The player who called Ronaldo';

 function identify_duplicate($str1, $str2){
   if(strlen($str1)>strlen($str2)){
       list($str1, $str2) = array($str2, $str1); // swap two variables
   }

   $str1 = explode(" ", $str1);
   $str2 = explode(" ", $str2);

    $words_numb = sizeof($str1);
    $result=$words_numb;

    foreach($str1 as $val){
     if(in_array($val, $str2) || strlen($val) <= 3){
         $result--;
     }
  }

   if($words_numb / 3 >=$result){
        $Specified = 'this is a duplicate news';
       }
    else $Specified = 'this is a new news';
        return $Specified;
}


$out=identify_duplicate($str1, $str2);
echo $out;

      



Output:

this is duplicate news

0


source


You don't need regex for this .. you can use the following function and pass the lines in any order:



function identify_duplicate($var1, $var2){
   if(strlen($var1)>=strlen($var2)){
       $str1 = $var1;
       $str2 = $var2;
   }
   else{
       $str1 = $var2;
       $str2 = $var1;
   }
   $str1 = explode(" ", $str1);
   $str2 = explode(" ", $str2);

  $return = sizeof($str1);

  foreach($str1 as $val){
     if(in_array($val, $str2) || strlen($val) <= 3){
         $return = $return - 1;
     }
  }

   return $return;
}

      

+2


source







All Articles