How to use the Rs neuralnet package in the Titanic Kaggle competition

I am trying to run this code for Kaggle's Titanic competition for exercises. Its a lighthearted and budding case. I am using neuralnet package inside R in this package.

This is the train data from the website:

train <- read.csv("train.csv")
m <- model.matrix(  ~ Survived + Pclass + Sex + Age + SibSp, data =train )
head(m)

      

Here I train the neural network, depending on who survived. I want to see if I can predict who survived:

library(neuralnet)

r <- neuralnet( Survived ~ Pclass + Sexmale + Age + SibSp, 
data=m, hidden=10, threshold=0.01,rep=100)

      

The network is training. I download test data and prepare it for testing.

test=read.csv("test.csv")

m2 <- model.matrix(  ~  Pclass + Sex + Age + SibSp, data = test )

      

Ultimate Prediction Test:

res= compute(r, m2)

      

First, I don’t know many hidden neurons that I have to accept. Sometimes it takes a long time, and when I succeed, I cannot execute the test with the test data, because an error occurs that says the two data sets are incompatible:

res= compute(r, m2)

Error in neurons[[i]] %*% weights[[i]] : non-conformable arguments

      

What am I doing wrong here?

All code:

train <- read.csv("train.csv")
m <- model.matrix(  ~ Survived + Pclass + Sex + Age + SibSp, data =train )
head(m)

library(neuralnet)

r <- neuralnet( Survived ~ Pclass + Sexmale + Age + SibSp, 
data=m, hidden=10, threshold=0.01,rep=100)

test=read.csv("test.csv")

m2 <- model.matrix(  ~  Pclass + Sex + Age + SibSp, data = test )

res= compute(r, m2)

      

+3


source to share


1 answer


Try using this to predict instead:

res = compute(r, m2[,c("Pclass", "Sexmale", "Age", "SibSp")])

      

This worked for me and you should get some output.

What seems to have happened: model.matrix

creates additional columns ( (Intercept)

) that are not part of the data that was used to build the neural network, as such in a function compute

that it doesn't know what to do with it. So the solution is to explicitly select the columns needed for use in the compute function. This is because it is neuralnet

trying to do some kind of matrix multiplication, but the matrix is ​​of the wrong size.


For how many neurons or hyperparameters to optimize, you can use Cross-validation and all these other methods. If you use a different package ( nnet

), then you can use the package caret

to determine the optimal parameters for you. It will look like this:

library(caret)
nnet.model <- train(Survived ~ Pclass + Sex + Age + SibSp, 
                    data=train, method="nnet")
plot(nnet.model)
res2 = predict(nnet.model, newdata=test)

      



where the hyperparameter plot is:

enter image description here


You can measure performance with confusionMatrix

in package caret

:

library(neuralnet)
library(caret)
library(dplyr)
train <- read.csv("train.csv")
m <- model.matrix(  ~ Survived + Pclass + Sex + Age + SibSp, data =train )

r <- neuralnet( Survived ~ Pclass + Sexmale + Age + SibSp, 
                data=m, rep=20)

res = neuralnet::compute(r, m[,c("Pclass", "Sexmale", "Age", "SibSp")])
pred_train = round(res$net.result)

# filter only with the ones with a survival prediction, not all records
# were predicted for some reason;
pred_rowid <- as.numeric(row.names(pred_train))
train_survived <- train %>% filter(row_number(Survived) %in% pred_rowid) %>% select(Survived)
confusionMatrix(as.factor(train_survived$Survived), as.factor(pred_train))

      

Output:

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 308 128
         1 164 114

               Accuracy : 0.5910364             
                 95% CI : (0.5539594, 0.6273581)
    No Information Rate : 0.6610644             
    P-Value [Acc > NIR] : 0.99995895            

                  Kappa : 0.119293              
 Mcnemar Test P-Value : 0.04053844            

            Sensitivity : 0.6525424             
            Specificity : 0.4710744             
         Pos Pred Value : 0.7064220             
         Neg Pred Value : 0.4100719             
             Prevalence : 0.6610644             
         Detection Rate : 0.4313725             
   Detection Prevalence : 0.6106443             
      Balanced Accuracy : 0.5618084             

       'Positive' Class : 0    

      

+3


source







All Articles