Arabic text using R

I am a new user and I just want some help with my work on R. I am developing an Arabic text and I would like to have some help who has experience in this area. So far, I felt like I was normalizing the Arabic text, and even R is not printing Arabic characters in the console. I am stuck now and I do not know if it is correct to change the language, how to do mining in Hueck, or in any other way. Can anyone advise me if anyone can achieve anything in learning Arabic text using R?
By the way, I am working on parsing a dataset in arabic tweets. It took me a month to get the data. And I don't know how long it will take me to preprocess the text.

+3


source to share


1 answer


I don't have much experience in this area, but I have no problem with Arabic characters when I try this:

require(tm)
require(tm.plugin.webmining)
require(SnowballC)

corpus <- WebCorpus(GoogleNewsSource("سلام"))
corpus
inspect(corpus)

tdm <- TermDocumentMatrix(corpus)

      

Make sure to install the correct fonts in your OS and IDE.

```{r}
y <<- dget("file") # get the file ext rated from MongoDB with rmongodb package
a <<- y$tweet_text # extract only the text of the tweets in the dataset
text_df <<- data.frame(a, stringsAsFactors = FALSE) # Save as a data frame
myCorpus_df <<- Corpus(DataframeSource(text_df_2)) # Compute a Corpus from the data frame
```

      

In OS X, Arabic characters are represented appropriately:



```{r}
str(myCorpus_df[1:2])
```

List of 2
 $ 1:List of 2
  ..$ content: chr "The CHRONICLE EYE  Ahrar al#Sham is clearly fighting #ISIS where its men storm some #Manbij buildings #Aleppo "
  ..$ meta   :List of 7
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "2014-07-03 22:42:18"
  .. ..$ description  : chr(0) 
  .. ..$ heading      : chr(0) 
  .. ..$ id           : chr "1"
  .. ..$ language     : chr "en"
  .. ..$ origin       : chr(0) 
  .. ..- attr(*, "class")= chr "TextDocumentMeta"
  ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"


 $ 2:List of 2
  ..$ content: chr "RT @######## جبهة النصرة مهاجرينها وأنصارها  مقراتها مكان آمن لكل من يخشى على نفسه الآذى "
  ..$ meta   :List of 7
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "2014-07-03 22:42:18"
  .. ..$ description  : chr(0) 
  .. ..$ heading      : chr(0) 
  .. ..$ id           : chr "2"
  .. ..$ language     : chr "en"
  .. ..$ origin       : chr(0) 
  .. ..- attr(*, "class")= chr "TextDocumentMeta"
  ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
 - attr(*, "class")= chr [1:2] "VCorpus" "Corpus"

      

When I check the encoding of the Arabic word on both OS (OS X and Win 7) it looks well encoded:

```{r}
Encoding("لمياه_و_الإصحا")
```

[1] "UTF-8"

      

This can also be useful: Reading the text of Arabic data in R and plot ()

+1


source







All Articles