Machine Science predicts text fields based on text fields

I've been working on machine learning and forecasting for about a month now. I've tried IBM watson with bluemix, amazon machine learning and prediction. What I want to do is predict the textbox based on other fields. My csv file has four text fields

named Question,Summary,Description,Answer

and about 4500 lines / Recrods. The loaded dataset has no numeric fields. A typical entry looks like below.

{'Question':'sys down','Summary':'does not boot after OS update','Description':'Desktop does not boot','Answer':'Switch to safemode and rollback last update'}

      

On IBM watson I found a question on my forums and the answer is that loading a custom corpus is not possible right now. Then I switched to training at Amazon. I followed their documentation and was able to implement prediction in a custom application using api. I tested the movielens data and everything was numeric. I have successfully downloaded the data and got movie recommendations using the python-boto library . When I tried to download the csv file, the problem was that no text field can be selected as target

. Then I added numerical values ​​corresponding to each value in the csv. This statement made the prediction successful, but the accuracy was incorrect. Maybe the csv should be better formatted.

Below is an entry from the movielens data. It says that userID 196 gave movieID 242 a two-star rating in time (unix timestamp) 881250949.

196 242 3   881250949

      

I am currently trying to predictionIO . The movielens database test passed without issue as stated in the documentation using a recommendation pattern. But nevertheless its not clear what possibilities the text field predicts based on other text fields.

Is the prediction predicted for numeric fields only, or can the text field be predicted based on other text fields?

+3


source to share


1 answer


No, prediction not only works in numeric fields. It can be anything, including text. I am assuming the MovieLens data is using an ID instead of the actual user and movie names because

  • it saves storage space (this dataset has been around for a long time and then the storage is definitely a concern) and

  • no need to know the actual username (privacy issue)



In your case, you can look at the text classification template https://docs.prediction.io/demo/textclassification/ . You will need to determine how you want to classify each entry.

+2


source







All Articles