K-NN on nonlinear data + Dimension reduction

I am trying to use k-NN for a complex simulated dataset. numpy array (1000, 100) hence many dimensions. Before running k-NN for training / classification, I need to preprocess / transform the dataset. PCA doesn't work as the variance of all features is nearly the same. The data as csv is available here as a gist: https://gist.github.com/modqhx/0ab61da16eae8f371a1d6a787f018a64

When plotting the data, it looks like a 3D spherical structure (here a screenshot using "hypertools"):   enter the image

Any thoughts on how to proceed?

EDIT and in response to comment: Yes, I understand if there is no "visible" clustering, why use k-nn. I had to formulate it correctly. However, the raw data does not show some form of dimensionality reduction that may reveal clusters. There are 100 measurements and the PCA does not help as the variance of all 100 functions is the same. The question becomes, how can we reduce the dimension when the variance of all functions is almost the same? ... Again, this is an exercise and the point is to make "knn" work! (if that makes sense). I was told that before the first and second moments you will not find any clusters, but after that (third moment and after) you can.

+3


source to share


1 answer


I basically agree with @ javadba's comment : if your dataset doesn't have an obvious clustering property, if you look at it, applying k-NN or any other clustering algorithm will only give you artifacts and dubious signals. The reason I am writing this answer is because I found some kind of structure in your data.

What I did was download your data first. As I understand it, your (1000,101

) -shaped array corresponds to 1000 points in 100x space (plus a trailing column of zeros / ones, which probably doesn't matter right now). Note that this sounds like a very rare object if you think about it. Consider a line with 2 points, a square with 4 points, a cube with 8 points ... a 100-dimensional regular grid with the same resolution (2 points along each dimension) will contain 2 ^ 100 points. That's ... much more than 1000. Unfortunately, I find it difficult to visualize sparse point clouds in 100-dimensional space.

So, I made a selection of three axes at random and plotted the corresponding 3D scatter plot to see if there are any patterns. And I did it several times at a time. Code:

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

dat = np.loadtxt('foodat.csv',skiprows=1,delimiter=',')
nums = np.arange(dat.shape[1]-1)
for _ in range(5):  
    fig = plt.figure()
    ax = fig.add_subplot(111,projection='3d')
    np.random.shuffle(nums)
    inds = nums[:3]
    plotdat = dat[:,inds]
    ax.scatter(*plotdat.T)
    ax.set_title(str(inds))

      

Here's what a typical plot looks like from every angle:

typical plot

As you can see, this is a mess! Or more scientifically, it is difficult to visually distinguish these scatterplots from instances of uniform distributions in a cube. If all the plots look like this, then it is possible that there is no clustering to begin with, so you should stop before starting. No matter what labels you might assign to your data, it would be pointless.

But there is good news: in the interactive window, I noticed that the above plot looks much more interesting from a certain angle:



the data is split at 42 size!

The data clearly shows the division by size 42 (all numbers!). Now this is promising. Note that the dataset may even be truly clustered, but this may not be obvious from the parallel axial alignment bumps. Imagine the following example scenario in 2d:

example with two non-overlapping diagonal blobs

As long as the data is clearly grouped, it is far from obvious if we only look at axis-aligned projections.

My point is that finding evidence for the existence of clusters in your 100-dimensional dataset is really hard. It is already difficult to find evidence of clustering in low-dimensional subspaces, but even if you cannot find evidence of this, that does not mean that your data is not clustered in a diametrical configuration in 100d space.

I would start by looking at the smaller cuts this way. The fact that your dots are very nicely split in size 42 suggests this isn't possible. You can try to systematically check each dimension to see if any other such separation creates ... but you should keep in mind that even for a dimension such as number 42, such separation can only occur for certain combinations of dimensions.

In case size 42 differs in that it completely separates most of your points, you can try to combine your data along that axis and work with two half datasets in a reduced 99-dimensional space.

+1


source







All Articles