Batch Gradient Descent for Logistic Regression

I followed Andrew Ng's CSC229 machine learning course and I am now looking into logistic regression. The goal is to maximize the log likelihood function and find optimal theta values ​​for this. Link to the notes for the lecture: [ http://cs229.stanford.edu/notes/cs229-notes1.ps> [1 ] -pages 16-19. Now the code below has been shown on the main page of the course (in Matlab though - I converted it to python).

I am applying it to a dataset with 100 training examples (dataset given on Coursera home page for an introductory machine learning course). The data has two functions, which are two scores on two exams. The result is 1 if the student received input and 0 if the student did not receive input. They showed all the code below. The following code causes the likelihood function to converge to a maximum around -62. The corresponding theta values ​​are [-0.05560301 0.01081111 0.00088362]. Using these values, when I test a training example like [1, 30.28671077, 43.89499752] that should give 0 as output, I get 0.576, which doesn't make any sense to me. If I test the hypothesis function with input [1, 10, 10], I get 0.515, which once again doesn't make sense.These values ​​should correspond to a lower probability. This confuses me.

import numpy as np
import sig as s

def batchlogreg(X, y):
   max_iterations = 800
   alpha = 0.00001

   (m,n) = np.shape(X)

   X = np.insert(X, 0, 1, 1) 
   theta = np.array([0] * (n+1), 'float')
   ll = np.array([0] * max_iterations, 'float')

   for i in range(max_iterations):
       hx = s.sigmoid(np.dot(X, theta))
       d = y - hx
       theta = theta + alpha*np.dot(np.transpose(X),d)
       ll[i] = sum(y * np.log(hx) + (1-y) * np.log(1- hx))

   return (theta, ll)

      

+3


source to share


3 answers


Note that the sigmoidal function has:

sig(0) = 0.5
sig(x > 0) > 0.5
sig(x < 0) < 0.5

      

Since you get all the probabilities higher 0.5

, it says that you never do X * theta

negative or what you do, but your learning rate is too slow to matter.

for i in range(max_iterations):
    hx = s.sigmoid(np.dot(X, theta)) # this will probably be > 0.5 initially
    d = y - hx # then this will be "very" negative when y is 0
    theta = theta + alpha*np.dot(np.transpose(X),d) # (1)
    ll[i] = sum(y * np.log(hx) + (1-y) * np.log(1- hx))

      



The problem is most likely in (1)

. The dot product will be very negative, but yours is alpha

very small and will negate its effect. This way, it theta

will never shrink to properly handle correctly classifying labels 0

.

Positive instances are then only barely classified correctly for the same reason: your algorithm does not detect a reasonable hypothesis on your iteration count and learning rate.

Possible solution: increase alpha

and / or number of iterations or use momentum . p>

+1


source


It sounds like you might be confusing probabilities with assignments.

The probability will be a real number between 0.0 and 1.0. The label will be an integer (0 or 1). Logistic regression is a model that provides the probability that a label is 1 given the input characteristics. To get the value of the label, you need to make a decision using this probability. An easy rule of thumb for decision making is that the label is 0 if the probability is less than 0.5 and 1 if the probability is greater than or equal to 0.5.



So, for the example you gave, the solutions would be 1 (which means the model is wrong for the first example where it should be 0).

0


source


I came to the same question and found the reason.

First normalize X or set a comparable intercept, for example 50.

Otherwise the contours of the cost function are too "narrow". A large alpha makes it overshoot, and a small alpha does not progress.

0


source







All Articles