Computational efficiency bar graph

I am trying to plot a 2GB matrix using MATLAB hist

on a computer with 4GB of RAM. The operation takes several hours. Are there ways to improve compute performance, pre-sort the data, pre-size the bin, split data into smaller groups, remove raw data as data is added to cells, etc.?

Also, after the data is plotted, I need to set up binning to ensure a smooth curve. This requires starting and re-merging the original data. My guess is that the strategy involving the least computation would be to punch the data out with very small bins first and then manipulate the size of the output bin rather than re-binning the original data. What is the best way to adjust the silo size after bunning (assuming the silo can grow and not shrink)?

+3


source to share


1 answer


I don't like the answers to StackOverflow questions of the form "okay, although you asked how to do X, you really don't want to do X, you really want to do Y, so here's a solution Y

But that's what I'm going to do here. I think such an answer is warranted in this rare case, because the answer below is consistent with sound practices in statistical analysis and because it avoids the current problem in front of you, which crunches 4GB datda.

If you want to represent the population distribution using a nonparametric density estimate and you avoid poor computational performance, the Kernel Density Estimator (KDE) does a much better job than a histogram.

To begin with, KDE has a clear preference over histograms among most academic and practicing statisticians. Among the many texts on this topic that I find particularly useful, Introduction to Kernel Density Estimation )

Reasons why KDE is preferred over histograms

  • the shape of the histogram strongly depends on the choice of the total number of bins; however, there is no authoritative technique for calculating or even estimating a suitable value. (Any doubts about this, just plot a histogram from some data and then watch the whole form of the histogram changing as the number of boxes changes.)

  • the shape of the histogram is highly dependent on the choice of the location of the edges of the bin.

  • the histogram gives an estimate of density that is not smooth.

KDE completely removes the histogram properties 2 and 3. Although KDE does not generate a density estimate with discrete cells, a similar "bandwidth" parameter must be provided.

To compute and build KDE, you need to pass two parameter values ​​along with your data:

Kernel function: The most common options (all available in the MATLAB kde function) are uniform, triangular, biaxial, three-star, Epanechnikov, and normal. Among them, Gaussian (normal) is the most commonly used.

bandwith: Choosing a value for bandwidth will almost certainly have a huge impact on the quality of your KDE. Therefore, sophisticated computing platforms like MATLAB, R, etc. include utility functions (like rusk or MISE function) to estimate bandwidth given other parameters.




KDE in MATLAB

kde.m is a function in MATLAB that implements KDE:

[h, fhat, xgrid] = kde(x, 401);

      

Note that when calling kde.m. bandwidth and core are not transferred. For bandwitdh: kde.m wraps the bandwidth selection function; and the kernel function uses a Gaussian language.


But would using KDE instead of a histogram solve or substantially eliminate very slow performance given your 2GB dataset?

Sure.

In your question, you stated that there was a lag during plotting. KDE doesn't require thousands of (missions?) Data points to be displayed with a symbol, color, and specific location on the canvas - instead, it displays a single smooth line. And since the entire dataset doesn't need to be displayed one point at a time on the canvas, it doesn't need to be stored (in memory!) While the plot is being built and displayed.

+3


source







All Articles