Multiple results from one variable when using the tm "stemCompletion" method

Question

Multiple results from one variable when using the tm "stemCompletion" method

I have a corpus containing log data from 15 observations of three variables (ID, title, abstract). Using R Studio, I am reading data from a CSV file (one line for each observation). While doing some text mining operations, I am having trouble using the stemCompletion method. After applying stemCompletion, I observed that the results are shown for each .csv line three times. All other tm methods (for example stemDocument) produce only one result. I am wondering why this is happening and how I can solve the problem.

I used the following code:

data.corpus <- Corpus(DataframeSource(data))  
data.corpuscopy <- data.corpus
data.corpus <- tm_map(data.corpus, stemDocument)
data.corpus <- tm_map(data.corpus, stemCompletion, dictionary=data.corpuscopy)

The only results after applying stemDocument are, for example,

"> data.corpus[[1]]

physic environ   sourc  innov investig  attribut  innov space
          investig  physic space intersect  innov  innov     relev attribut  physic space   innov        reflect  chang natur  innov  technolog advanc  servic  mean chang  argu   develop  innov space similar embodi  divers set  valu   collabor open  sustain use  literatur review interview  benchmark    examin  relationship  physic environ  innov         literatur review   interview underlin innov   communic  human centr process   result five attribut  innov space  present collabor enabl modifi smart attract   reflect       provid perspect   challeng    support innov creation  develop physic space   add   conceptu develop  innov space  outlin physic space   innov servic"

And after using stemCompletion, the results appear three times:

"$`1`
physical environment source innovation investigation attributes innovation space investigation physical space intersect innovation innovation relevant attributes physical space innovation reflect changes nature innovation technological advancements service meanwhile changes argues develop innovation space similarity embodies diversified set valuable collaboration open sustainability used literature review interviews benchmarking examine relationships physical environment innovation literature review interviews underline innovation communicative human centred processes result five attributes innovation space present collaboration enablers modifiability smartness attractiveness reflect provide perspectives challenge support innovation creation develop physical space addition conceptual develop innovation space outlines physical space innovation service
physical environment source innovation investigation attributes innovation space investigation physical space intersect innovation innovation relevant attributes physical space innovation reflect changes nature innovation technological advancements service meanwhile changes argues develop innovation space similarity embodies diversified set valuable collaboration open sustainability used literature review interviews benchmarking examine relationships physical environment innovation literature review interviews underline innovation communicative human centred processes result five attributes innovation space present collaboration enablers modifiability smartness attractiveness reflect provide perspectives challenge support innovation creation develop physical space addition conceptual develop innovation space outlines physical space innovation service
physical environment source innovation investigation attributes innovation space investigation physical space intersect innovation innovation relevant attributes physical space innovation reflect changes nature innovation technological advancements service meanwhile changes argues develop innovation space similarity embodies diversified set valuable collaboration open sustainability used literature review interviews benchmarking examine relationships physical environment innovation literature review interviews underline innovation communicative human centred processes result five attributes innovation space present collaboration enablers modifiability smartness attractiveness reflect provide perspectives challenge support innovation creation develop physical space addition conceptual develop innovation space outlines physical space innovation service"

Below is a sample as a reproducible example:

Csv file containing three observations of three variables:

ID;Text A;Text B
1;Below is the first title;Innovation and Knowledge Management
2;And now the second Title;Organizational Performance and Learning are very important
3;The third title;Knowledge plays an important rule in organizations

And below is the method i used

data = read.csv2("Test.csv")
data[,2]=as.character(data[,2])
data[,3]=as.character(data[,3])

corpus <- Corpus(DataframeSource(data)) 
corpuscopy <- corpus
corpus <- tm_map(corpus, stemDocument)
corpus[[1]]

corpus <- tm_map(corpus, stemCompletion, dictionary=corpuscopy)
inspect(corpus[1:3])

It seems to me that it depends on the number of variables used in the .csv, but I have no idea why.

+3

r rstudio tm stemming

Dobby 05 oct. 14 at 16:23

source to share

1 answer

Ben · Accepted Answer · 2014-11-02T05:55:36+0000

There seems to be something weird about the function stemCompletion

. It is not clear how to use stemCompletion

in tm

version 0.6. There is a nice workaround here that I used for this answer.

First, create the CSV file that you have:

dat <- read.csv2( text = 
                  "ID;Text A;Text B
1;Below is the first title;Innovation and Knowledge Management
2;And now the second Title;Organizational Performance and Learning are very important
3;The third title;Knowledge plays an important rule in organizations")

write.csv2(dat, "Test.csv", row.names = FALSE)

Read this, convert to corpus and run the words:

data = read.csv2("Test.csv")
data[,2]=as.character(data[,2])
data[,3]=as.character(data[,3])

corpus <- Corpus(DataframeSource(data)) 
corpuscopy <- corpus
library(SnowballC)
corpus <- tm_map(corpus, stemDocument)

See it to work:

inspect(corpus)

<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
1
Below is the first titl
Innovat and Knowledg Manag

[[2]]
<<PlainTextDocument (metadata: 7)>>
2
And now the second Titl
Organiz Perform and Learn are veri import

[[3]]
<<PlainTextDocument (metadata: 7)>>
3
The third titl
Knowledg play an import rule in organ

Here's a nice workaround to work stemCompletion

:

stemCompletion_mod <- function(x,dict=corpuscopy) {
  PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}

Inspect the output to see if the stems were completed ok:

lapply(corpus, stemCompletion_mod)

[[1]]
<<PlainTextDocument (metadata: 7)>>
1 Below is the first title Innovation and Knowledge Management

[[2]]
<<PlainTextDocument (metadata: 7)>>
2 And now the second Title Organizational Performance and Learning are NA important

[[3]]
<<PlainTextDocument (metadata: 7)>>
3 The third title Knowledge plays an important rule in organizations

Success!

Multiple results from one variable when using the tm "stemCompletion" method

More articles: