How to identify a similar string through keywords
keywords: all words with more than 3 characters
I want to compare keywords between two strings with these conditions:
- Moving words doesn't matter (example 1 for this case)
- Words with no more than 3 characters are not calculated (example 2 for this case)
- Put a shorter sentence in str1 (number of characters). (example 3 for this case)
- I just need different words in str1 versus str2. (example 4 for this case)
In fact, I have a robot that attacks two news sites daily and copies the news into my database. Then I need an algorithm to compare news titles and identify duplicate news. (As you know, the same news has a different headline from different news websites. But often the same keywords are included in the headline of the same news)
example1: Word movement doesn't matter.
str1= 'hello petter'
str2= 'petter hello'
result: 0
example2: Words with no more than 3 characters are not evaluated
str1= 'hello !!'
str2= 'petter hello'
result: 0 // '!!' are less than 3characters and str1 is 'hello'. then result:0
OR
str1= 'hello petter how are u?'
str2= 'petter hello how are you'
result: 0 // str1 is 'hello petter how are'
example3: Variables must be changed
str1= 'hello petter how are you ?'
str2= 'petter hello how are you?'
// Then
str1= 'hello petter how are you?'
str2= 'petter hello how are you ?'
result:1 // 1 is for 'you' (in str1)
Example4: Different words are not important in str2
str1= 'hello petter how are you?'
str2= 'petter hello how are you ?'
result: 1 // str2 is 'petter hello how are you', then 1 is for: 'you?' (in str1)
Note: "you" (in str2) is not important to me, because it doesn't match with any words in str1.
example: (for more information)
str1= 'petter hello how are you pal?'
str2= 'petter hello how are... !!'
// In first str1 change with str2
str1= 'petter hello how are... !!'
str2= 'petter hello how are you pal?'
// Then remove '!!' (in str1)
str1= 'petter hello how are...'
str2= 'petter hello how are you pal?'
result: 1 // 1 for 'are...' (in str1) - ['are','you','pal?' does not matter (in str2)]
Finally, I need a function to identify duplicate news through the result and the number of keywords (all words with more than 3 characters).
$keywords_numb=7;
$result=2;
function identify_duplicate($keywords_numb,$result){
if($keywords_numb / 3 >= $result){
$Specified = 'this is a new news';
}
else $Specified = 'this is a duplicate news';
return $Specified;
}
echo $Specified;
output:
this is a new news
Does anyone know how I can write this program? Relations
source to share
With the help of @karthik manchala I did it ...
$str1='this news is about a player named Ronaldo';
$str2='The player who called Ronaldo';
function identify_duplicate($str1, $str2){
if(strlen($str1)>strlen($str2)){
list($str1, $str2) = array($str2, $str1); // swap two variables
}
$str1 = explode(" ", $str1);
$str2 = explode(" ", $str2);
$words_numb = sizeof($str1);
$result=$words_numb;
foreach($str1 as $val){
if(in_array($val, $str2) || strlen($val) <= 3){
$result--;
}
}
if($words_numb / 3 >=$result){
$Specified = 'this is a duplicate news';
}
else $Specified = 'this is a new news';
return $Specified;
}
$out=identify_duplicate($str1, $str2);
echo $out;
Output:
this is duplicate news
source to share
You don't need regex for this .. you can use the following function and pass the lines in any order:
function identify_duplicate($var1, $var2){
if(strlen($var1)>=strlen($var2)){
$str1 = $var1;
$str2 = $var2;
}
else{
$str1 = $var2;
$str2 = $var1;
}
$str1 = explode(" ", $str1);
$str2 = explode(" ", $str2);
$return = sizeof($str1);
foreach($str1 as $val){
if(in_array($val, $str2) || strlen($val) <= 3){
$return = $return - 1;
}
}
return $return;
}
source to share