Which R function to use to auto-correct text?
I have a two column csv document that contains a product category and a product name.
Example:
Sl.No. Commodity Category Commodity Name
1 Stationary Pencil
2 Stationary Pen
3 Stationary Marker
4 Office Utensils Chair
5 Office Utensils Drawer
6 Hardware Monitor
7 Hardware CPU
and I have another csv file that contains various product names.
Example:
Sl.No. Commodity Name
1 Pancil
2 Pencil-HB 02
3 Pencil-Apsara
4 Pancil-Nataraj
5 Pen-Parker
6 Pen-Reynolds
7 Monitor-X001RL
The output I would like is to standardize and classify the product names and classify them into their respective product categories as shown below:
Sl.No. Commodity Name Commodity Category
1 Pencil Stationary
2 Pencil Stationary
3 Pencil Stationary
4 Pancil Stationary
5 Pen Stationary
6 Pen Stationary
7 Monitor Hardware
Step 1) First I need to use NLTK (Text Mining Techniques) and clear the data to separate the "Pencil" from "Pencil-HB 02".
Step 2) After cleaning up, I have to use the approximate string matching method ie agrep () to match the "Pencil" patterns or patch "Pancil" to "Pencil".
Step 3) After fixing the template, I have to classify. Do not know how.
This is what I was thinking. I started with step 2 and I'm only stuck with step 2. I can't find an exact method to code this. Is there a way to get the result as needed? If so, please suggest a method for me to proceed with.
source to share
You can use the package stringdist
. The function correct
below will adjust Commodity.Name
in file2 based on the distances of an element to another CName
.
Then a is used to join the two tables left_join
.
I also notice that there are some classifications if I use the default options for stringdistmatrix
. You can try changing the argument weight
stringdistmatrix
for a better correction result.
> library(dplyr)
> library(stringdist)
>
> file1 <- read.csv("/Users/Randy/Desktop/file1.csv")
> file2 <- read.csv("/Users/Randy/Desktop/file2.csv")
>
> head(file1)
Sl.No. Commodity.Category Commodity.Name
1 1 Stationary Pencil
2 2 Stationary Pen
3 3 Stationary Marker
4 4 Office Utensils Chair
5 5 Office Utensils Drawer
6 6 Hardware Monitor
> head(file2)
Sl.No. Commodity.Name
1 1 Pancil
2 2 Pencil-HB 02
3 3 Pencil-Apsara
4 4 Pancil-Nataraj
5 5 Pen-Parker
6 6 Pen-Reynolds
>
> CName <- levels(file1$Commodity.Name)
> correct <- function(x){
+ factor(sapply(x, function(z) CName[which.min(stringdistmatrix(z, CName, weight=c(1,0.1,1,1)))]), CName)
+ }
>
> correctedfile2 <- file2 %>%
+ transmute(Commodity.Name.Old = Commodity.Name, Commodity.Name = correct(Commodity.Name))
>
> correctedfile2 %>%
+ inner_join(file1[,-1], by="Commodity.Name")
Commodity.Name.Old Commodity.Name Commodity.Category
1 Pancil Pencil Stationary
2 Pencil-HB 02 Pencil Stationary
3 Pencil-Apsara Pencil Stationary
4 Pancil-Nataraj Pencil Stationary
5 Pen-Parker Pen Stationary
6 Pen-Reynolds Pen Stationary
7 Monitor-X001RL Monitor Hardware
If you want the "Others" category, you just have to play with weights. I added the line "Diesel" in file2. Then calculate the result using stringdist
the individual weights (you should try changing the values). If the score is greater than 2 (this value has to do with how weights are assigned), it doesn't fix anything.
PS: since we do not know all possible labels, we have to do as.character
for convection factor
in character
.
PS2: I also use tolower
for case insensitivity.
> head(file2)
Sl.No. Commodity.Name
1 1 Diesel
2 2 Pancil
3 3 Pencil-HB 02
4 4 Pencil-Apsara
5 5 Pancil-Nataraj
6 6 Pen-Parker
>
> CName <- levels(file1$Commodity.Name)
> CName.lower <- tolower(CName)
> correct_1 <- function(x){
+ scores = stringdistmatrix(tolower(x), CName.lower, weight=c(1,0.001,1,0.5))
+ if (min(scores)>2) {
+ return(x)
+ } else {
+ return(as.character(CName[which.min(scores)]))
+ }
+ }
> correct <- function(x) {
+ sapply(as.character(x), correct_1)
+ }
>
> correctedfile2 <- file2 %>%
+ transmute(Commodity.Name.Old = Commodity.Name, Commodity.Name = correct(Commodity.Name))
>
> file1$Commodity.Name = as.character(file1$Commodity.Name)
> correctedfile2 %>%
+ left_join(file1[,-1], by="Commodity.Name")
Commodity.Name.Old Commodity.Name Commodity.Category
1 Diesel Diesel <NA>
2 Pancil Pencil Stationary
3 Pencil-HB 02 Pencil Stationary
4 Pencil-Apsara Pencil Stationary
5 Pancil-Nataraj Pencil Stationary
6 Pen-Parker Pen Stationary
7 Pen-Reynolds Pen Stationary
8 Monitor-X001RL Monitor Hardware
source to share
The {stingdist}
(at least in 0.9.4.6) is a function of "approximate matching of strings" amatch()
that returns the most likely matching words from a given set. It has a parameter maxDist
that can be set for the maximum distance to be matched and a parameter nomatch
that can be used for the other category. Otherwise, method, weight, etc. Can be set in the same way stringdistmatrix()
.
So your original problem can be solved with a tidyverse compatible solution:
library(dplyr)
library(stringdist)
# Reading the files
file1 <- readr::read_csv("file1.csv")
file2 <- readr::read_csv("file2.csv")
# Getting the commodity names in a vector
commodities <- file1 %>% distinct(`Commodity Name`) %>% pull()
# Finding the closest string match of the commodities, and joining the file containing the categories
file2 %>%
mutate(`Commodity Name` = commodities[amatch(`Commodity Name`, commodities, maxDist = 5)]) %>%
left_join(file1, by = "Commodity Name")
This will return a data frame containing the revised trade name and category. If the original Commodity name
is more than 5 characters (simplified line length explanation) from any of the possible trade names, the corrected name is NA.
source to share