You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Andrew Nguyen <an...@ucsfcti.org> on 2010/04/16 21:01:33 UTC

Splitting input for mapper and contiguous data

As I may have mentioned, my main goal currently is the processing of physiologic data using hadoop and MR. The steps are:

Convert ADC units to physical units (input is <sample num, raw value>, output is <sample num, physical value>
Perform a peak detection to detect the systolic blood pressure (input is <sample num, physical value>, output is <sample num, physical value> but the output is only a subset of the input)
Calculate the central tendency measure using a sliding window (mapper input is <sample num, physical value>, mapper output is <window ID, (sample num, physical value)>, reducer input is <window ID, central tendency measurement at different radii> )

Each of the above steps builds upon the result of the previous. So, for the first two steps, I have been doing everything in the mapper and specified 0 reduce tasks. The last step, I am performing calculations on a sliding window of N points, skipping forward M points for the next window. N is >> M. So, to implement this, I have a mapper that outputs all of the x,y points (the value) for a particular key (the window ID). The reducer then performs the calculations on each window's data. Everything works pretty well except that I noticed the splitting of the input across different mappers affects the final output. Due to the nature of the calculations, this doesn't affect the end result very much.

However, I'm trying to make sure I understand everything properly, and I want to see if there is a better/proper way of implementing something like this. I'm guessing the problem comes from the fact that I'm trying to use contiguous data points to create a window of N points. The window ID is just the first sample num encountered for the window. As a result, the first sample num encountered will change for everything but the first map task, when compared to a serial execution.

Thanks!

--Andrew