Parallelizing GPU Pseudocode: Overcoming Inconsistent Memory Access

Question

Parallelizing GPU Pseudocode: Overcoming Inconsistent Memory Access

Here is the pseudocode (taken from word2vec C code) that I am trying to parallelize. First, I listed the data structures with their respective sizes, then pseudocode:

1.  long long sen[MAX_SENTENCE_LENGTH]  
// In the C code, MAX_SENTENCE_LENGTH = 1000. Increasing this should be  
//fine.

2.  float neu1[N] (hidden layer values)
//N is the length of each vector. For now, max N = 400

3.  float neu1e[N] (hidden layer error values)

4.  float syn0[V * N] (input to hidden layer weight matrix)
// For now, we can assume that V * N is small enough to be stored on the GPU
   // In the test data, V = 72k words

5.  float syn1neg[V * N] (back propagation weights used during negative  
sampling)

6. float exptable[1000]

The input to the program is a text file. The program then processes it one word at a time to create a dictionary. For example, if my text file has a sentence

"Parallel programming is very interesting"

then the dictionary will look like this (because the code sorts vocabulary based on word frequency):

            {"Very:2", "Parallel:1", "programming:1", "is:1",    "interesting:1"}
                   0      1               2              3                4

After building the dictionary, the code starts to process the text again, 1000 words at a time. The first 1000 words are stored in sen[MAX_SENTENCE_LENGTH]

, then the neural network is trained for all words in sen

, and the process continues until we reach the end of the file. For the above proposals sen

will be as follows: [1,2,3,0,0,4]

.

Assuming training is only done in one iteration, the pseudocode is below:

for sen in text
{ 
    for word in sen
    {

        for (c = 0; c < N; c++) 
            neu1[c] = 0;

        for (c = 0; c < N; c++) 
            neu1e[c] = 0;   

       /*The variable window is a user supplied parameter. 
        It is used to consider the context  around a word in a sentence. 
        For example, if I am looking at the first word in the sentence
        (target word is word1), and window = 5, then the words in the 
        window = {word2, word3, word4, word5}. 
        If I am looking at the third word in the sentence 
        (target word is word3), then window = {word1, word2, word4, word5}*/    

        for word in window
        {
            for (c = 0; c < N; c++) 
            neu1[c] += syn0[c + word * N];
        }

        for (c = 0; c < N; c++) 
            neu1[c] /= window;

        //negative: number of negative samples to provide (assume it to be 
             //between 5 to 25)
        for (d = 0; d < negative + 1; d++) 
        {

            target = sen[random_index]  
            l2 = target * N;
            f = 0;
            for (c = 0; c < N; c++) 
            f += neu1[c] * syn1neg[c + l2];

           gradient = exptable[function of f] //f is calculated in the loop above

           for (c = 0; c < N; c++) 
              neu1e[c] += gradient * syn1neg[c + l2];

           for (c = 0; c < N; c++) 
              syn1neg[c + l2] += gradient * neu1[c];

          } //Negative Sampling ends    

        for word in window
        {
             for (c = 0; c < N; c++) 
                syn0[c + word * N] += neu1e[c];
        }

   } // word in sen loop ends

 } // sen in text loop ends

I think the best way to parallelize this is to process the words in the sentence in parallel. Considering all loops, I think I should use streams N

for each word, so that one thread accesses global memory ( syn0, syn1neg

) only once per loop. In addition, since all the updates neu1

and neu1e

are independent, they can be in private memory flows and updated independently.

My main concerns right now are:

Global memory is accessed randomly, since syn0

and is syn1neg

accessed based on the value (index in the dictionary) of the variable word

. And as we can see that the words in the sentence do not appear in any order.

This is a big problem? Or can we hide the memory latency by providing enough threads on the GPU? Also, I don't understand if this access pattern is random, because N threads / words will access sequential data in syn0 and syn1neg, but the next set of N threads might access sequential data that is far away in memory.

In a negative sampling cycle, a reduction operation must be performed. The variable f

is the sum of the dot products. The problem is that I plan on storing neu1

in the private memory of each thread, whereas it syn1neg

is in global memory.

Will a negative sample have a separate core? It looks like it requires a different approach than just starting N threads / words, but I'm not sure which approach would work best.

Aside from these issues, please suggest if there are issues with how I approach this code.

+3

parallel-processing opencl gpgpu cuda word2vec

user1274878 31 jul. 15 at 15:46

source to share

1 answer

user3666197 · Answer 1 · 2016-03-29T10:30:49+0000

_{Prologue: you've opened a can of worms (even without being present SLOC

), so hopefully you can accept the comments presented in each part , as Slice on Elephants seems to be the only approach to solving a complex subject altogether, not "slipping away" from the main problems that arise in the comfort zone of individual lines of code of the implementation - where The Big Picture is usually skipped if it is not already lost a priori.}

A1:

Yes , this is the main fact (aka "problem").

GPU

-devices were designed and optimized in silicon as - s ingle - i nstruction m d ata hardware architectures, so they work best with both code + data composites . > that should not exceed (throughout the entire lifecycle) really small areas of memory (kilobytes) that fit into the areas of internal kernel memory on a chip ( -s without spillovers, supported by LRU L1 cache), thus without introducing any "idealized "-levels of destructive latent penalties for . SIMD

SIMD SM

GPU-SM-REGISTER

about 350-700 ns

gloMEM

[B]-Fig.2

:

TESLA

has with 8 [ ] - cores per , pair [ ] per , multithreaded fetch and [ ] TPC (texture / processor cluster) problem <w> one 16KB bank per read-only per each (faster / lower coalescence collisions ) SM

SMX

SM

SFU

SM

MTI

per

shaMEM

SM

conL1cache

SM

_{[B]

TECH.GPU: NVIDIA CUDA C Programming Guide, [PG-02829-001_v7.0]; 2015 / 03sub>}

This works great for bitmap processing (processing 2D arrays in a layout-convolution style with a small grid) where at the same time (well, of course - segmented) time, all threads are doing the same -instruction on ( ideally ) not colliding data cells - this is better for . warpSize

SIMD

GPGPU

This also means that any real operations that do not allow such unambiguously stepped progressive fully aligned SIMD

-ops will naturally have block performance (threads can, but wait for inter-thread divergences, for (re) -sync barriers to access remote memory before until the final transfer of data occurs, and thus latency masking less and less really hides these natural obstacles from contemplating the illusion of true code execution). PARALLEL

A2:

No, not much. In-situ benchmarking techniques and available evidence can quantify their impact and demonstrate the limits of the run time range.

Although there is some help from the kernel-deployment directive : __launch_bounds__()

__global__ void
__launch_bounds__( 1,     // dim3tbGridSIZE <maxThreadsPerBLOCK>         COMPILE-TIME_ADVICE_FOR_OPTIMISING_COMPILER_____________REGISTERs, CACHE_, FETCH_, PROXIMITY_PATTERNs ANALYSES
                   1  /*, // dim3tBlockSIZE <minBlocksPerMULTIPROCESSOR> COMPILE-TIME_ADVICE_FOR_OPTIMISING_COMPILER_____________OPTIMUM_SCHEDULE_TO_FILL_FETCH_LATENCIES
                   ?,     // iAsyncSeqOfCmdsQUEUE_Stream_ID <<- TO LET BE FREELY ASSIGNABLE ... NON-BLOCKING EXEC'd KERNEL
                   0  */  // iSharedMemSIZE
                   )
                 Device_printf_GPU_CLK( int const iTag ){
                        ...
                        return;
}

there are many publications published on extensive ("brute-force checks", etc.) "optimization", but rather adjustments to mechanics, assembly of the kernel for different launch parameterizations (thread-3D- "geometry"), influence on the general assumption for code development should not be overestimated, since the results are always kernel dependent (and in practice just investigate which 3D geometry will suffer the least when deploying limited resources to SMX

within the GPU-off-chip-MEM

-access hierarchy).

While it is possible to change thread geometry - 3D - "code execution", the most critical resources with latency ( GPU-SM-REGISTER

) are limited and cannot "ideally be used" by other threads in context swaps during scheduling. The more threads you schedule, the fewer GPU-SM-REGISTER

can be allocated exclusively for the thread (_{general static limit for appropriate XY computation compatibility is not a problem}), and more access to memory chips will happen during actual code execution (no need to write a SLOC

to get this fact, just follow the published architecture documentation).

A3:

Not. The idea of splitting the core can give the illusion of potential benefit from a different 3D geometry building mechanism, but your code will have more trouble paying additional performance costs for loading / sharing / processing / publishing data structures. Splitting the kernel makes sense in an all- standard code execution. RDMA

Parallelizing GPU Pseudocode: Overcoming Inconsistent Memory Access

More articles: