Hadoop: having threads inside a map function
Can I have streams inside a map function? I have a challenge where threads can really help me. I need to add values to a hashmap at the same time for each line of input. My input string becomes a string array and for each value of that array, I have to add it to the hashmap. I later use this hashmap in a cleanup function.
I am doing this with a for loop and it seems to be a bottleneck for my project. So I thought about using a parallel hashmap and splitting the string array into multiple smaller arrays. Thus, each thread will be responsible for adding the corresponding "smaller" array inside the hashmap. The thing is, I've implemented it in a local Java application and it works. When I use it inside hadoop, no results are expected. I am using Thread.join () for each thread, so for each line of input I have to make sure the threads finish before the next line. Well, that's what I thought. Doesuop handle streams in a special way?
edits for duffymo
Here is google link http://research.google.com/pubs/pub36296.html .
Algorithm 2 is the part I'm talking about. As you can see, there is a for loop for each attribute, and for each attribute, I need to update the memory structure. They only need to predict one value in their approach (single-class training), were in mine, I could have many values to predict (multi-label training). So google says the value y, for them it is a 3 digit array. For me, it could be up to a thousand. Aggregating two 3D vectors is much faster than combining two 10,000 vectors.
If I put only one label in my algorithm, I have no problem at all. The 45 seconds I mentioned is reduced to less than 5. So yes, it only works correctly for one notch.
The 45 seconds I mentioned only apply to the for-loop. I didn't count the parsing and everything else. The for loop is the bottleneck because it is the only thing I do and it takes about 45 seconds, while the whole task takes about 1 minute (including task initialization and more). I want to try to slow down this loop 2 or 3 less for loops and process them at the same time. Trying means that it may work and that it may not work. Sometimes crazy stuff like what I mentioned can be a must. Okay, here's what a well respected programmer told me in a previous article about haop.
I have not provided these details before as I thought I only needed an opinion on the chaop and flows inside the map function. I didn't think anyone would question me that much: P.
source to share
Hadoop itself is built for parallelism. But he does it very roughly. Hadoop parallelism is good when the dataset is large and can be divided into many subsets, which are processed separately and independently (for example, for simplicity, I mean only the Map stage), for example, to find a single pattern in the text.
Now let's consider the following case: we have a lot of data and we want to find 1000 different patterns in it. We now have two use cases for our multi-core processors.
1. Process each file using a separate display module in one thread and have multiple maps per node
2. Define one handler for node and process one file with all cores.
The second way can be much more caching friendly and also more efficient.
Bottom line - for cases where fine-grained, multicore friendly parallelism is warranted by the nature of the processing - using multithreading within the mapper might be beneficial to us.
source to share
You won't need threads if I understand Hadoop correctly and display / minify correctly.
What makes you think that parsing a single line of input is the bottleneck in your project? Do you just think this is a problem, or do you have the data to prove it?
UPDATE: Thanks for the quote. Obviously something that will need to be digested by me and others, so I won't have any quick advice in the short term. But I really appreciate the quote and your patience.
source to share