Clustering in Python with docs

Question

Clustering in Python with docs

I am new to clustering and need some advice on how to approach this problem ...

Let's say I have thousands of suggestions, but some of the examples might be:

Networking experience
STRONG sales experience
Strong Networking Skills Preferred
REQUIRED Sales Specialist
Chocolate apples
Jobs are critical to online majors.

In order to group them in the best possible way, what approach could I take?

I have looked at k-means using vectorization of vectors, but when I have thousands of sentences that can contain different words, is it efficient to create a vector of that size and then each one tries to see which sentence those words have?

What other approaches are there that I haven't found?

What I have done so far:

Imported sentences from CSV to DICT with ID: Sentence
I remove stop words from every sentence
Then I count all the words separately to create the main vector and count how many times the word appears.

+3

python machine-learning cluster-analysis k-means

Andy P 23 nov. '14 at 6:33

source to share

2 answers

In fact, I recently put together a tutorial for documenting clustering in Python. I would suggest using a combination of k-means and latent Dirichlet distribution. Take a look and let me know if I can explain anything else: http://brandonrose.org/clustering

+4

brandomr Dec 25. 14 at 14:35

source to share

doug · Accepted Answer · 2014-11-23T08:08:41+0000

There are two related (but different technical questions) here; the first relates to the selection of a clustering method for this data.

The second predicate question relates to the data model - i.e. for each sentence in the raw data, how to convert it to a data vector suitable for input into the clustering algorithm.

Clustering technique

k stands for probably the most popular clustering method, but there are many options; consider how k-kmeans works: the user selects a small number of data points from among the data (cluster centers for the initial iteration in the k-means algorithm, aka centroids). Then the distance between each data point and the set of centroids is determined, and each data point assigned to the centroid is closed; then the new centroids are determined from the mean of the data points assigned to the same cluster. These two steps are repeated until some convergence criterion is reached (for example, between two successive iterations, the combined centroid motion falls below a certain threshold).

The best clustering techniques do much more than just move the cluster centers around - for example, spectral clustering techniques rotate and stretch / compress the data to find one axis of maximum variance, and then determine additional axes that are orthogonal to the original and to each other - i.e. transformed feature space. PCA (Principal Component Analysis), LDA (Linear Discriminant Analysis), and kPCA are members of this class, the defining characteristic of which is that the computation of eigenvector / eigenvector pairs for each function in the original data or in the covariance matrix. Scikit-learn has a module for PCA computation .

Data model

As you noticed, a common dilemma when building a data model from unstructured text data involves a function for every word in the whole corpus (minus stop words), often leads to very high sparsity across the dataset (i.e., each sentence includes only a small part of the common words in all sentences, so each data vector is therefore sparse, on the other hand, if the corpus is cropped so that, for example, only the top 10% of the words are used as functions, then some / many of the sentences have completely unpopulated data vectors.

Here is one general sequence of techniques to help solve this problem that can be especially effective given your data. Combine the related terms in a term from the overall processing sequence normalization , . and synonymization .

This is intuitive: for example,

Normalize: Convert all words to lowercase (Python strings have a lower method, so

REquired.lower()

Obviously this prevents Required, REquired and requires three separate functions in your data vector and collapses them into one term instead.

Stem: upon completion, required, required, and requested, collapse into one token, requir.

Two of the most common stem cells are Porter and Lancaster stem cells (the NLTK described below has both).

Synonymization: Terms such as smart, capable and skilled can, depending on the context, be reduced to a single term, identifying them in a general synonym list.

An excellent Python NLP library , NLTK has (at least) some great synonym compilations or a digital thesaurus (thesaurus?) To help you do all three of them, programmatically.

For example, nltk.corpus.reader.lin is one (only one, there are at least a few crawler synonyms in NLTLK) and it's easy to use - just import that module and call the synonym by converting to a term.

Several stemmers are in the NLTK stack package .

Clustering in Python with docs

More articles: