How do I use multiple input functions with appropriate extractors in a pipeline?

I am working on a classification problem with Scikit-learn. I have a dataset where each observation contains two separate text boxes. I want to set up a Pipeline where each textbox is passed in parallel through its own TfidfVectorizer and the outputs of the TfidfVectorizer objects are passed to the classifier. My goal is to optimize the parameters of two TfidfVectorizer objects along with classifier objects using GridSearchCV.

The pipeline can be depicted as follows:

Text 1 -> TfidfVectorizer 1 --------|
                                    +---> Classifier
Text 2 -> TfidfVectorizer 2 --------|

      

I understand how to do this without using Pipeline (just creating TfidfVectorizer objects and working from there), but how do I set this up inside Pipeline?

Thanks for any help,

Rob.

+3


source to share


1 answer


Use classes Pipeline

and FeatureUnion

. The code for your case will look something like this:

pipeline = Pipeline([
  ('features', FeatureUnion([
    ('c1', Pipeline([
      ('text1', ExtractText1()),
      ('tf_idf1', TfidfVectorizer())
    ])),
    ('c2', Pipeline([
      ('text2', ExtractText2()),
      ('tf_idf2', TfidfVectorizer())
    ]))
  ])),
  ('classifier', MultinomialNB())
])

      



You can do a grid search throughout the structure by accessing the parameters using the syntax <estimator1>__<estimator2>__<parameter>

. For example, it features__c1__tf_idf1__min_df

refers to a parameter min_df

TfidfVectorizer 1

in your chart.

+2


source







All Articles