You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Andy Doddington <an...@doddington.net> on 2011/11/10 11:55:25 UTC

Mappers and Reducer not being called, but no errors indicated

Hi,

I have written a fairly straightforward Hadoop program, modelled after the PiEstimator example which is shipped with the distro.

1) I write a series of files to HDFS, each containing the input for a single map task. This amounts to around 20Mb per task.
2) Each of my map tasks reads the input and generates a pair of floating point values.
3) My reduce task scans the list of floating point values produced by the maps and returns the minimum.

Unfortunately, this is not working, but is exhibiting the following symptoms:

Based on log output, I have no evidence that the mappers are actually being called, although the 'percentage complete’ output seems to go down slowly as might be expected if they were being called.
I only ever get a single part-00000 file created, regardless of how many maps I specify.
In the case of my reducer, although its constructor, ‘setConf' and ‘close' methods are called (based on log output), its reduce method never gets called.

I have checked the visibility of all classes and confirmed that the methods signatures are correct (as confirmed by Eclipse and use of the @Override annotation), and I’m at my wits end. To further add to my suffering, the log outputs do not show any errors :-(

I am using the Cloudera CDH3u1 distribution.

As a final query, could somebody explain how it is that the multiple files I create get associated with the various map tasks? This part is a mystery to me (and might even be the underlying source of my problems).

Thanks in anticipation,

	Andy Doddington


Re: Mappers and Reducer not being called, but no errors indicated

Posted by Andy Doddington <an...@doddington.net>.
OK, continuing our earlier conversation...

I have a job that schedules 100 map jobs (small number just for testing), passing data view a set of 100 sequence files. This is based on the PiEstimator example, that is shipped with the distribution.

The data consist of a blob of serialised state, amounting to around 20MB of data. I have added various checks, including checksums,
to reduce the risk of data corruption or misalignment.

The mapper takes the blob of data as its value input and an integer in the range 0-99 as its key (passed as a LongWritable).

Each mapper then does some processing, based upon the deserialised contents of the blob and the integer key value (0-99).

The reducer then selects the minimum value that was produced across all of the mappers.

Unfortunately, this process is generating an incorrect value, when compared to a simple iterative solution.

After inspecting the results it seems that the mappers are generating correct values for even-numbered keys, but incorrect
values for odd-numbered keys. I am logging the values of the keys, so I am confident that these are correct. My serialisation
checks also make me confident that the ‘value’ blobs are not getting corrupted, so it’s all something of a mystery.

Harsh J: Previously, you indicated that this might be a “...key/val data issue… ...Perhaps bad partitioning/grouping is happening as a result of that”. I apologise for the lack of detail, but do you think this still might be the case? If so, could you refer me to some place that gives more detail on this type of issue?

With apologies for continuing to be a nuisance :-(

Andy D


Re: Mappers and Reducer not being called, but no errors indicated

Posted by Harsh J <ha...@cloudera.com>.
Hey Andy,

Inline.

On 10-Nov-2011, at 10:03 PM, Andy Doddington wrote:

> Thanks for your kind words - it still feels like pulling teeth at times :-(
> 
> Following on from your comments, here are a few more questions - hope you don’t find them too dumb…
> 
> 1) How does each mapper ‘know’ which file name to associate itself with?

The client creates a special file on the HDFS while submitting a job to the JT that, dumbly speaking, carries an array of filenames (along with offset and length info for splits -- this is called a 'FileSplit'). This is used by JobTracker to determine # of tasks and such, and then later on used by each scheduled mapper to lookup its own index (map [0] gets file a, map [1] gets file b, etc.) and initialize its appropriate reads. Again, this is a very dumb explanation -- the truth is slightly more complicated but this is how the mechanism works (pull, not push).

> 2) Is it important that I name my files part<n> or will any unique name suffice?

HDFS is like any other filesystem. Filenames do not matter.

The "part" is short for "partition", and is used for output files by the default APIs to indicate that each part-XXXXX file is a partition of the whole output. It is just a terminology used by MR, by default (again, like everything, the default output name is configurable as well).

Naming with numbers gets you free sorting though, when you list out files of a directory.

> 3) I’m using binary serialisation with Sequence files - are these ‘split’ across multiple mappers? What happens if the split occurs in the middle of a binary object?

Record splits will never happen. This is guaranteed. See the second para of the 'Map' section at http://wiki.apache.org/hadoop/HadoopMapReduce to understand how this is ensured.

For sequence files, instead of 'newlines', there are 'magic' byte markers that serve the same purpose (aligning record readers to start from a proper point, that's not between a record). These markers are placed at regular intervals in your sequence file already.

> Current state of play is that the mappers are being called the correct number of times and are generating the correct result for the first half of the number of mappers (e.g. ~502 out of 100 mappers, running small test), but are then generating bad results after that. The reducer is then correctly selecting the minimum - it just happens to be a bad value due to the mapper problem. Ho hum…

Unfortunately I have no clue what you are talking about here. Looks like a key/val data issue to me by the sound of it. Perhaps bad partitioning/grouping is happening as a result of that.

P.s. If its better that way for you, you can also contact me off-list.


Re: Mappers and Reducer not being called, but no errors indicated - resolved :-)

Posted by Andy Doddington <an...@doddington.net>.
Oh dear, I feel such a fool. However, in the spirit of knowledge-sharing I thought I’d pass back my results (I hate it
when I find a thread where somebody has exactly the same problem I’m having and they then just close it by saying
they’ve fixed it, without saying *how*).

It seems that my problems were down to threading issues in my mappers, pretty much as I’d surmised. I’d confused
myself by thinking that it was down to some devious subtlety of Hadoop, when in fact it was just good old-fashioned
threading and non-thread-safe classes. Fixed by creating a new instance for each mapper. I have an outstanding
problem wrt reducers, but I think I can sort that myself, based on all of the hadoop research I’ve done in the past
few weeks :-) Clouds and silver linings, eh?

Thanks to everyone who helped me on this though - hopefully one day I’ll be able to return the favour.

Cheers,

	Andy D

On 15 Nov 2011, at 18:15, Mathijs Homminga wrote:

> (see below)
> 
> Mathijs Homminga
> 
> On Nov 15, 2011, at 18:51, Andy Doddington <an...@doddington.net> wrote:
> 
>> Unfortunately, changing the data to LongWritable and Text would change the nature of the problem to such an extent that
>> any results would be meaningless :-(
> 
> Yes, I understand. Is it possible for you to post some code? The job setup lines for example, and some configuration?
> 
>> I did try changing everything to use SequenceFileAsBinary, but apart from causing more
>> aggravation in having to convert back and forth between BinaryWritable it made no difference to my problem - i.e. it still failed
>> in exactly the same way as before.
>> Incidentally, I’m assuming your second paragraph should read “...you should *not* worry about splits…”.
> 
> Yes, sorry about that.
> 
>> I’m coming to the conclusion that the problem must be due to multi-threading issues in some of the third-party libraries that
>> I am using, so my next plan of attack is to look at what threading options I can configure in hadoop. As always, any pointers
>> in this direction would be really appreciated :-)
> 
> Any insights or luck when you try a different number of mappers or input files? With just one? 
> Just playing trial-and-error here, let me know if you've done this all before..
> 
>> Cheers,
>> 
>>   Andy D
>> 
>> ———————————————
>> 
>> On 15 Nov 2011, at 14:07, Mathijs Homminga wrote:
>> 
>>> Can you reproduce this behavior with more simple SequenceFiles, which contain for example <LongWritable, Text> pairs?
>>> (I know, you have to adjust your mapper and reducer).
>>> 
>>> In general: when you InputFormat (and RecordReader) are properly configured/written, you should worry about splits in the middle of a binary object.
>>> 
>>> Mathijs
>>> 
>>> 
>>> On Nov 15, 2011, at 14:35 , Andy Doddington wrote:
>>> 
>>>> Sigh… still no success and I’m tearing my hair out :-(
>>>> 
>>>> One thought I’ve had is whether I’d be advised to use the SequenceFileAsBinary input and output classes? I’m not entirely
>>>> clear on how these differ from the ‘normal’ SequenceFile classes, since these already claim to be able to support binary data,
>>>> but at the moment I’m clutching at straws.
>>>> 
>>>> I did try changing to these but then got loads of exceptions claiming that I was trying to cast LongWritable (my key class)
>>>> and my value class to BytesWritable, which is what the SequenceFileAsBinary classes use, I believe.
>>>> 
>>>> If possible, could somebody indicate whether this change is worthwhile and, if so, how I can migrate my code to use BytesWritable
>>>> instead of LongWritable etc?
>>>> 
>>>> Thanks in anticipation,
>>>> 
>>>>   Andy Doddington
>>>> 
>>>> On 10 Nov 2011, at 16:33, Andy Doddington wrote:
>>>> 
>>>>> Thanks for your kind words - it still feels like pulling teeth at times :-(
>>>>> 
>>>>> Following on from your comments, here are a few more questions - hope you don’t find them too dumb…
>>>>> 
>>>>> 1) How does each mapper ‘know’ which file name to associate itself with?
>>>>> 2) Is it important that I name my files part<n> or will any unique name suffice?
>>>>> 3) I’m using binary serialisation with Sequence files - are these ‘split’ across multiple mappers? What happens if the split occurs in the middle of a binary object?
>>>>> 
>>>>> Current state of play is that the mappers are being called the correct number of times and are generating the correct result for the first half of the number of mappers (e.g. ~502 out of 100 mappers, running small test), but are then generating bad results after that. The reducer is then correctly selecting the minimum - it just happens to be a bad value due to the mapper problem. Ho hum…
>>>>> 
>>>>> Regards,
>>>>> 
>>>>>   Andy D
>>>>> 
>>>>> ——————————————
>>>>> 
>>>>> On 10 Nov 2011, at 15:17, Harsh J wrote:
>>>>> 
>>>>>> Hey Andy,
>>>>>> 
>>>>>> You seem to be making good progress already! Some comments inline.
>>>>>> 
>>>>>> On 10-Nov-2011, at 7:28 PM, Andy Doddington wrote:
>>>>>> 
>>>>>>> Unfortunately my employer blocks any attempt to transfer data outside of the company - I realise this makes me look pretty
>>>>>>> foolish/uncooperative, but I hope you understand there’s little I can do about it :-(
>>>>>>> 
>>>>>>> On a more positive note, I've found a few issues which have moved me forward a bit:
>>>>>>> 
>>>>>>> I first noticed that the PiEstimator used files named part<n> to transfer data to each of the Mappers - I had changed this name to be something more meaningful to my app. I am aware that Hadoop uses some files that are similarly named, and hoped that this might be the cause. Sadly, this fix made no difference.
>>>>>>> While looking at this area of the code, I realised that although I was writing data to these files, I was failing to close them! This fix did make a difference, in that the mappers now actually appear to be getting called. However, the final result from the reduce was still incorrect. What seemed to be happening (based on the mapper logs) was that the reducers was getting called once for each mapper - which is not exactly optimal in my case.
>>>>>>> I therefore removed the jobConf call which I had made to set my reducer to also be the combiner - and suddenly the results started looking a lot healthier - although they are still not 100% correct. I had naively assumed that the minimum of a set of minimums of a series of subsets of the data would be the same as the minimum of the entire set, but I’ve clearly misunderstood how combiners work. Will investigate the doc’n on this a bit more. Maybe some subtle interaction wrt combiners and partitioners?
>>>>>> 
>>>>>> Combiners would work on sorted map outputs. That is, after they are already partitioned out.
>>>>>> 
>>>>>>> I’m still confused as to how the mappers get passed the data that I put into the part<n> files, but I *think* I’m now heading in the right direction. If you can see the cause of my problems (despite lack of log output)  then I’d be more than happy to hear from you :-)
>>>>>> 
>>>>>> Naively describing, one file would go to one mapper. Each mapper invocation (map IDs -- 0,1,2,…) would have a file name to associate itself with, and it would begin reading that file off the DFS using a record-reader and begin calling map() on each record read (lines, in the most common case).
>>>>>> 
>>>>>> To add to the complexity now, is that HDFS stores files as blocks, and hence you may have multiple mappers for a single file, working on different offsets (0-mid, mid-len for a simple 2-block split, say). This is configurable though -- you can choose not to have data input splits at the expense of losing some data locality.
>>>>>> 
>>>>>> Is this the explanation you were looking for?
>>>>>> 
>>>>>> And yes, looks like you're headed in the right direction already :)
>>>>>> 
>>>>>>> Regards,
>>>>>>> 
>>>>>>>   Andy D
>>>>>>> On 10 Nov 2011, at 11:52, Harsh J wrote:
>>>>>>> 
>>>>>>>> Hey Andy,
>>>>>>>> 
>>>>>>>> Can you pastebin the whole runlog of your job after you invoke it via 'hadoop jar'/etc.?
>>>>>>>> 
>>>>>>>> On 10-Nov-2011, at 4:25 PM, Andy Doddington wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> I have written a fairly straightforward Hadoop program, modelled after the PiEstimator example which is shipped with the distro.
>>>>>>>>> 
>>>>>>>>> 1) I write a series of files to HDFS, each containing the input for a single map task. This amounts to around 20Mb per task.
>>>>>>>>> 2) Each of my map tasks reads the input and generates a pair of floating point values.
>>>>>>>>> 3) My reduce task scans the list of floating point values produced by the maps and returns the minimum.
>>>>>>>>> 
>>>>>>>>> Unfortunately, this is not working, but is exhibiting the following symptoms:
>>>>>>>>> 
>>>>>>>>> Based on log output, I have no evidence that the mappers are actually being called, although the 'percentage complete’ output seems to go down slowly as might be expected if they were being called.
>>>>>>>>> I only ever get a single part-00000 file created, regardless of how many maps I specify.
>>>>>>>>> In the case of my reducer, although its constructor, ‘setConf' and ‘close' methods are called (based on log output), its reduce method never gets called.
>>>>>>>>> 
>>>>>>>>> I have checked the visibility of all classes and confirmed that the methods signatures are correct (as confirmed by Eclipse and use of the @Override annotation), and I’m at my wits end. To further add to my suffering, the log outputs do not show any errors :-(
>>>>>>>>> 
>>>>>>>>> I am using the Cloudera CDH3u1 distribution.
>>>>>>>>> 
>>>>>>>>> As a final query, could somebody explain how it is that the multiple files I create get associated with the various map tasks? This part is a mystery to me (and might even be the underlying source of my problems).
>>>>>>>>> 
>>>>>>>>> Thanks in anticipation,
>>>>>>>>> 
>>>>>>>>>   Andy Doddington
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 


Re: Mappers and Reducer not being called, but no errors indicated

Posted by Mathijs Homminga <ma...@knowlogy.nl>.
(see below)

Mathijs Homminga

On Nov 15, 2011, at 18:51, Andy Doddington <an...@doddington.net> wrote:

> Unfortunately, changing the data to LongWritable and Text would change the nature of the problem to such an extent that
> any results would be meaningless :-(

Yes, I understand. Is it possible for you to post some code? The job setup lines for example, and some configuration?

> I did try changing everything to use SequenceFileAsBinary, but apart from causing more
> aggravation in having to convert back and forth between BinaryWritable it made no difference to my problem - i.e. it still failed
> in exactly the same way as before.
> Incidentally, I’m assuming your second paragraph should read “...you should *not* worry about splits…”.

Yes, sorry about that.

> I’m coming to the conclusion that the problem must be due to multi-threading issues in some of the third-party libraries that
> I am using, so my next plan of attack is to look at what threading options I can configure in hadoop. As always, any pointers
> in this direction would be really appreciated :-)

Any insights or luck when you try a different number of mappers or input files? With just one? 
Just playing trial-and-error here, let me know if you've done this all before..

> Cheers,
> 
>    Andy D
> 
> ———————————————
> 
> On 15 Nov 2011, at 14:07, Mathijs Homminga wrote:
> 
>> Can you reproduce this behavior with more simple SequenceFiles, which contain for example <LongWritable, Text> pairs?
>> (I know, you have to adjust your mapper and reducer).
>> 
>> In general: when you InputFormat (and RecordReader) are properly configured/written, you should worry about splits in the middle of a binary object.
>> 
>> Mathijs
>> 
>> 
>> On Nov 15, 2011, at 14:35 , Andy Doddington wrote:
>> 
>>> Sigh… still no success and I’m tearing my hair out :-(
>>> 
>>> One thought I’ve had is whether I’d be advised to use the SequenceFileAsBinary input and output classes? I’m not entirely
>>> clear on how these differ from the ‘normal’ SequenceFile classes, since these already claim to be able to support binary data,
>>> but at the moment I’m clutching at straws.
>>> 
>>> I did try changing to these but then got loads of exceptions claiming that I was trying to cast LongWritable (my key class)
>>> and my value class to BytesWritable, which is what the SequenceFileAsBinary classes use, I believe.
>>> 
>>> If possible, could somebody indicate whether this change is worthwhile and, if so, how I can migrate my code to use BytesWritable
>>> instead of LongWritable etc?
>>> 
>>> Thanks in anticipation,
>>> 
>>>    Andy Doddington
>>> 
>>> On 10 Nov 2011, at 16:33, Andy Doddington wrote:
>>> 
>>>> Thanks for your kind words - it still feels like pulling teeth at times :-(
>>>> 
>>>> Following on from your comments, here are a few more questions - hope you don’t find them too dumb…
>>>> 
>>>> 1) How does each mapper ‘know’ which file name to associate itself with?
>>>> 2) Is it important that I name my files part<n> or will any unique name suffice?
>>>> 3) I’m using binary serialisation with Sequence files - are these ‘split’ across multiple mappers? What happens if the split occurs in the middle of a binary object?
>>>> 
>>>> Current state of play is that the mappers are being called the correct number of times and are generating the correct result for the first half of the number of mappers (e.g. ~502 out of 100 mappers, running small test), but are then generating bad results after that. The reducer is then correctly selecting the minimum - it just happens to be a bad value due to the mapper problem. Ho hum…
>>>> 
>>>> Regards,
>>>> 
>>>>    Andy D
>>>> 
>>>> ——————————————
>>>> 
>>>> On 10 Nov 2011, at 15:17, Harsh J wrote:
>>>> 
>>>>> Hey Andy,
>>>>> 
>>>>> You seem to be making good progress already! Some comments inline.
>>>>> 
>>>>> On 10-Nov-2011, at 7:28 PM, Andy Doddington wrote:
>>>>> 
>>>>>> Unfortunately my employer blocks any attempt to transfer data outside of the company - I realise this makes me look pretty
>>>>>> foolish/uncooperative, but I hope you understand there’s little I can do about it :-(
>>>>>> 
>>>>>> On a more positive note, I've found a few issues which have moved me forward a bit:
>>>>>> 
>>>>>> I first noticed that the PiEstimator used files named part<n> to transfer data to each of the Mappers - I had changed this name to be something more meaningful to my app. I am aware that Hadoop uses some files that are similarly named, and hoped that this might be the cause. Sadly, this fix made no difference.
>>>>>> While looking at this area of the code, I realised that although I was writing data to these files, I was failing to close them! This fix did make a difference, in that the mappers now actually appear to be getting called. However, the final result from the reduce was still incorrect. What seemed to be happening (based on the mapper logs) was that the reducers was getting called once for each mapper - which is not exactly optimal in my case.
>>>>>> I therefore removed the jobConf call which I had made to set my reducer to also be the combiner - and suddenly the results started looking a lot healthier - although they are still not 100% correct. I had naively assumed that the minimum of a set of minimums of a series of subsets of the data would be the same as the minimum of the entire set, but I’ve clearly misunderstood how combiners work. Will investigate the doc’n on this a bit more. Maybe some subtle interaction wrt combiners and partitioners?
>>>>> 
>>>>> Combiners would work on sorted map outputs. That is, after they are already partitioned out.
>>>>> 
>>>>>> I’m still confused as to how the mappers get passed the data that I put into the part<n> files, but I *think* I’m now heading in the right direction. If you can see the cause of my problems (despite lack of log output)  then I’d be more than happy to hear from you :-)
>>>>> 
>>>>> Naively describing, one file would go to one mapper. Each mapper invocation (map IDs -- 0,1,2,…) would have a file name to associate itself with, and it would begin reading that file off the DFS using a record-reader and begin calling map() on each record read (lines, in the most common case).
>>>>> 
>>>>> To add to the complexity now, is that HDFS stores files as blocks, and hence you may have multiple mappers for a single file, working on different offsets (0-mid, mid-len for a simple 2-block split, say). This is configurable though -- you can choose not to have data input splits at the expense of losing some data locality.
>>>>> 
>>>>> Is this the explanation you were looking for?
>>>>> 
>>>>> And yes, looks like you're headed in the right direction already :)
>>>>> 
>>>>>> Regards,
>>>>>> 
>>>>>>    Andy D
>>>>>> On 10 Nov 2011, at 11:52, Harsh J wrote:
>>>>>> 
>>>>>>> Hey Andy,
>>>>>>> 
>>>>>>> Can you pastebin the whole runlog of your job after you invoke it via 'hadoop jar'/etc.?
>>>>>>> 
>>>>>>> On 10-Nov-2011, at 4:25 PM, Andy Doddington wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I have written a fairly straightforward Hadoop program, modelled after the PiEstimator example which is shipped with the distro.
>>>>>>>> 
>>>>>>>> 1) I write a series of files to HDFS, each containing the input for a single map task. This amounts to around 20Mb per task.
>>>>>>>> 2) Each of my map tasks reads the input and generates a pair of floating point values.
>>>>>>>> 3) My reduce task scans the list of floating point values produced by the maps and returns the minimum.
>>>>>>>> 
>>>>>>>> Unfortunately, this is not working, but is exhibiting the following symptoms:
>>>>>>>> 
>>>>>>>> Based on log output, I have no evidence that the mappers are actually being called, although the 'percentage complete’ output seems to go down slowly as might be expected if they were being called.
>>>>>>>> I only ever get a single part-00000 file created, regardless of how many maps I specify.
>>>>>>>> In the case of my reducer, although its constructor, ‘setConf' and ‘close' methods are called (based on log output), its reduce method never gets called.
>>>>>>>> 
>>>>>>>> I have checked the visibility of all classes and confirmed that the methods signatures are correct (as confirmed by Eclipse and use of the @Override annotation), and I’m at my wits end. To further add to my suffering, the log outputs do not show any errors :-(
>>>>>>>> 
>>>>>>>> I am using the Cloudera CDH3u1 distribution.
>>>>>>>> 
>>>>>>>> As a final query, could somebody explain how it is that the multiple files I create get associated with the various map tasks? This part is a mystery to me (and might even be the underlying source of my problems).
>>>>>>>> 
>>>>>>>> Thanks in anticipation,
>>>>>>>> 
>>>>>>>>    Andy Doddington
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 

Re: Mappers and Reducer not being called, but no errors indicated

Posted by Andy Doddington <an...@doddington.net>.
Unfortunately, changing the data to LongWritable and Text would change the nature of the problem to such an extent that
any results would be meaningless :-( I did try changing everything to use SequenceFileAsBinary, but apart from causing more
aggravation in having to convert back and forth between BinaryWritable it made no difference to my problem - i.e. it still failed
in exactly the same way as before.

Incidentally, I’m assuming your second paragraph should read “...you should *not* worry about splits…”.

I’m coming to the conclusion that the problem must be due to multi-threading issues in some of the third-party libraries that
I am using, so my next plan of attack is to look at what threading options I can configure in hadoop. As always, any pointers
in this direction would be really appreciated :-)

Cheers,

	Andy D

———————————————

On 15 Nov 2011, at 14:07, Mathijs Homminga wrote:

> Can you reproduce this behavior with more simple SequenceFiles, which contain for example <LongWritable, Text> pairs?
> (I know, you have to adjust your mapper and reducer).
> 
> In general: when you InputFormat (and RecordReader) are properly configured/written, you should worry about splits in the middle of a binary object.
> 
> Mathijs
> 
> 
> On Nov 15, 2011, at 14:35 , Andy Doddington wrote:
> 
>> Sigh… still no success and I’m tearing my hair out :-(
>> 
>> One thought I’ve had is whether I’d be advised to use the SequenceFileAsBinary input and output classes? I’m not entirely
>> clear on how these differ from the ‘normal’ SequenceFile classes, since these already claim to be able to support binary data,
>> but at the moment I’m clutching at straws.
>> 
>> I did try changing to these but then got loads of exceptions claiming that I was trying to cast LongWritable (my key class)
>> and my value class to BytesWritable, which is what the SequenceFileAsBinary classes use, I believe.
>> 
>> If possible, could somebody indicate whether this change is worthwhile and, if so, how I can migrate my code to use BytesWritable
>> instead of LongWritable etc?
>> 
>> Thanks in anticipation,
>> 
>> 	Andy Doddington
>> 
>> On 10 Nov 2011, at 16:33, Andy Doddington wrote:
>> 
>>> Thanks for your kind words - it still feels like pulling teeth at times :-(
>>> 
>>> Following on from your comments, here are a few more questions - hope you don’t find them too dumb…
>>> 
>>> 1) How does each mapper ‘know’ which file name to associate itself with?
>>> 2) Is it important that I name my files part<n> or will any unique name suffice?
>>> 3) I’m using binary serialisation with Sequence files - are these ‘split’ across multiple mappers? What happens if the split occurs in the middle of a binary object?
>>> 
>>> Current state of play is that the mappers are being called the correct number of times and are generating the correct result for the first half of the number of mappers (e.g. ~502 out of 100 mappers, running small test), but are then generating bad results after that. The reducer is then correctly selecting the minimum - it just happens to be a bad value due to the mapper problem. Ho hum…
>>> 
>>> Regards,
>>> 
>>> 	Andy D
>>> 
>>> ——————————————
>>> 
>>> On 10 Nov 2011, at 15:17, Harsh J wrote:
>>> 
>>>> Hey Andy,
>>>> 
>>>> You seem to be making good progress already! Some comments inline.
>>>> 
>>>> On 10-Nov-2011, at 7:28 PM, Andy Doddington wrote:
>>>> 
>>>>> Unfortunately my employer blocks any attempt to transfer data outside of the company - I realise this makes me look pretty
>>>>> foolish/uncooperative, but I hope you understand there’s little I can do about it :-(
>>>>> 
>>>>> On a more positive note, I've found a few issues which have moved me forward a bit:
>>>>> 
>>>>> I first noticed that the PiEstimator used files named part<n> to transfer data to each of the Mappers - I had changed this name to be something more meaningful to my app. I am aware that Hadoop uses some files that are similarly named, and hoped that this might be the cause. Sadly, this fix made no difference.
>>>>> While looking at this area of the code, I realised that although I was writing data to these files, I was failing to close them! This fix did make a difference, in that the mappers now actually appear to be getting called. However, the final result from the reduce was still incorrect. What seemed to be happening (based on the mapper logs) was that the reducers was getting called once for each mapper - which is not exactly optimal in my case.
>>>>> I therefore removed the jobConf call which I had made to set my reducer to also be the combiner - and suddenly the results started looking a lot healthier - although they are still not 100% correct. I had naively assumed that the minimum of a set of minimums of a series of subsets of the data would be the same as the minimum of the entire set, but I’ve clearly misunderstood how combiners work. Will investigate the doc’n on this a bit more. Maybe some subtle interaction wrt combiners and partitioners?
>>>> 
>>>> Combiners would work on sorted map outputs. That is, after they are already partitioned out.
>>>> 
>>>>> I’m still confused as to how the mappers get passed the data that I put into the part<n> files, but I *think* I’m now heading in the right direction. If you can see the cause of my problems (despite lack of log output)  then I’d be more than happy to hear from you :-)
>>>> 
>>>> Naively describing, one file would go to one mapper. Each mapper invocation (map IDs -- 0,1,2,…) would have a file name to associate itself with, and it would begin reading that file off the DFS using a record-reader and begin calling map() on each record read (lines, in the most common case).
>>>> 
>>>> To add to the complexity now, is that HDFS stores files as blocks, and hence you may have multiple mappers for a single file, working on different offsets (0-mid, mid-len for a simple 2-block split, say). This is configurable though -- you can choose not to have data input splits at the expense of losing some data locality.
>>>> 
>>>> Is this the explanation you were looking for?
>>>> 
>>>> And yes, looks like you're headed in the right direction already :)
>>>> 
>>>>> Regards,
>>>>> 
>>>>> 	Andy D
>>>>> On 10 Nov 2011, at 11:52, Harsh J wrote:
>>>>> 
>>>>>> Hey Andy,
>>>>>> 
>>>>>> Can you pastebin the whole runlog of your job after you invoke it via 'hadoop jar'/etc.?
>>>>>> 
>>>>>> On 10-Nov-2011, at 4:25 PM, Andy Doddington wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I have written a fairly straightforward Hadoop program, modelled after the PiEstimator example which is shipped with the distro.
>>>>>>> 
>>>>>>> 1) I write a series of files to HDFS, each containing the input for a single map task. This amounts to around 20Mb per task.
>>>>>>> 2) Each of my map tasks reads the input and generates a pair of floating point values.
>>>>>>> 3) My reduce task scans the list of floating point values produced by the maps and returns the minimum.
>>>>>>> 
>>>>>>> Unfortunately, this is not working, but is exhibiting the following symptoms:
>>>>>>> 
>>>>>>> Based on log output, I have no evidence that the mappers are actually being called, although the 'percentage complete’ output seems to go down slowly as might be expected if they were being called.
>>>>>>> I only ever get a single part-00000 file created, regardless of how many maps I specify.
>>>>>>> In the case of my reducer, although its constructor, ‘setConf' and ‘close' methods are called (based on log output), its reduce method never gets called.
>>>>>>> 
>>>>>>> I have checked the visibility of all classes and confirmed that the methods signatures are correct (as confirmed by Eclipse and use of the @Override annotation), and I’m at my wits end. To further add to my suffering, the log outputs do not show any errors :-(
>>>>>>> 
>>>>>>> I am using the Cloudera CDH3u1 distribution.
>>>>>>> 
>>>>>>> As a final query, could somebody explain how it is that the multiple files I create get associated with the various map tasks? This part is a mystery to me (and might even be the underlying source of my problems).
>>>>>>> 
>>>>>>> Thanks in anticipation,
>>>>>>> 
>>>>>>> 	Andy Doddington
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 


Re: Mappers and Reducer not being called, but no errors indicated

Posted by Mathijs Homminga <ma...@knowlogy.nl>.
Can you reproduce this behavior with more simple SequenceFiles, which contain for example <LongWritable, Text> pairs?
(I know, you have to adjust your mapper and reducer).

In general: when you InputFormat (and RecordReader) are properly configured/written, you should worry about splits in the middle of a binary object.

Mathijs


On Nov 15, 2011, at 14:35 , Andy Doddington wrote:

> Sigh… still no success and I’m tearing my hair out :-(
> 
> One thought I’ve had is whether I’d be advised to use the SequenceFileAsBinary input and output classes? I’m not entirely
> clear on how these differ from the ‘normal’ SequenceFile classes, since these already claim to be able to support binary data,
> but at the moment I’m clutching at straws.
> 
> I did try changing to these but then got loads of exceptions claiming that I was trying to cast LongWritable (my key class)
> and my value class to BytesWritable, which is what the SequenceFileAsBinary classes use, I believe.
> 
> If possible, could somebody indicate whether this change is worthwhile and, if so, how I can migrate my code to use BytesWritable
> instead of LongWritable etc?
> 
> Thanks in anticipation,
> 
> 	Andy Doddington
> 
> On 10 Nov 2011, at 16:33, Andy Doddington wrote:
> 
>> Thanks for your kind words - it still feels like pulling teeth at times :-(
>> 
>> Following on from your comments, here are a few more questions - hope you don’t find them too dumb…
>> 
>> 1) How does each mapper ‘know’ which file name to associate itself with?
>> 2) Is it important that I name my files part<n> or will any unique name suffice?
>> 3) I’m using binary serialisation with Sequence files - are these ‘split’ across multiple mappers? What happens if the split occurs in the middle of a binary object?
>> 
>> Current state of play is that the mappers are being called the correct number of times and are generating the correct result for the first half of the number of mappers (e.g. ~502 out of 100 mappers, running small test), but are then generating bad results after that. The reducer is then correctly selecting the minimum - it just happens to be a bad value due to the mapper problem. Ho hum…
>> 
>> Regards,
>> 
>> 	Andy D
>> 
>> ——————————————
>> 
>> On 10 Nov 2011, at 15:17, Harsh J wrote:
>> 
>>> Hey Andy,
>>> 
>>> You seem to be making good progress already! Some comments inline.
>>> 
>>> On 10-Nov-2011, at 7:28 PM, Andy Doddington wrote:
>>> 
>>>> Unfortunately my employer blocks any attempt to transfer data outside of the company - I realise this makes me look pretty
>>>> foolish/uncooperative, but I hope you understand there’s little I can do about it :-(
>>>> 
>>>> On a more positive note, I've found a few issues which have moved me forward a bit:
>>>> 
>>>> I first noticed that the PiEstimator used files named part<n> to transfer data to each of the Mappers - I had changed this name to be something more meaningful to my app. I am aware that Hadoop uses some files that are similarly named, and hoped that this might be the cause. Sadly, this fix made no difference.
>>>> While looking at this area of the code, I realised that although I was writing data to these files, I was failing to close them! This fix did make a difference, in that the mappers now actually appear to be getting called. However, the final result from the reduce was still incorrect. What seemed to be happening (based on the mapper logs) was that the reducers was getting called once for each mapper - which is not exactly optimal in my case.
>>>> I therefore removed the jobConf call which I had made to set my reducer to also be the combiner - and suddenly the results started looking a lot healthier - although they are still not 100% correct. I had naively assumed that the minimum of a set of minimums of a series of subsets of the data would be the same as the minimum of the entire set, but I’ve clearly misunderstood how combiners work. Will investigate the doc’n on this a bit more. Maybe some subtle interaction wrt combiners and partitioners?
>>> 
>>> Combiners would work on sorted map outputs. That is, after they are already partitioned out.
>>> 
>>>> I’m still confused as to how the mappers get passed the data that I put into the part<n> files, but I *think* I’m now heading in the right direction. If you can see the cause of my problems (despite lack of log output)  then I’d be more than happy to hear from you :-)
>>> 
>>> Naively describing, one file would go to one mapper. Each mapper invocation (map IDs -- 0,1,2,…) would have a file name to associate itself with, and it would begin reading that file off the DFS using a record-reader and begin calling map() on each record read (lines, in the most common case).
>>> 
>>> To add to the complexity now, is that HDFS stores files as blocks, and hence you may have multiple mappers for a single file, working on different offsets (0-mid, mid-len for a simple 2-block split, say). This is configurable though -- you can choose not to have data input splits at the expense of losing some data locality.
>>> 
>>> Is this the explanation you were looking for?
>>> 
>>> And yes, looks like you're headed in the right direction already :)
>>> 
>>>> Regards,
>>>> 
>>>> 	Andy D
>>>> On 10 Nov 2011, at 11:52, Harsh J wrote:
>>>> 
>>>>> Hey Andy,
>>>>> 
>>>>> Can you pastebin the whole runlog of your job after you invoke it via 'hadoop jar'/etc.?
>>>>> 
>>>>> On 10-Nov-2011, at 4:25 PM, Andy Doddington wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I have written a fairly straightforward Hadoop program, modelled after the PiEstimator example which is shipped with the distro.
>>>>>> 
>>>>>> 1) I write a series of files to HDFS, each containing the input for a single map task. This amounts to around 20Mb per task.
>>>>>> 2) Each of my map tasks reads the input and generates a pair of floating point values.
>>>>>> 3) My reduce task scans the list of floating point values produced by the maps and returns the minimum.
>>>>>> 
>>>>>> Unfortunately, this is not working, but is exhibiting the following symptoms:
>>>>>> 
>>>>>> Based on log output, I have no evidence that the mappers are actually being called, although the 'percentage complete’ output seems to go down slowly as might be expected if they were being called.
>>>>>> I only ever get a single part-00000 file created, regardless of how many maps I specify.
>>>>>> In the case of my reducer, although its constructor, ‘setConf' and ‘close' methods are called (based on log output), its reduce method never gets called.
>>>>>> 
>>>>>> I have checked the visibility of all classes and confirmed that the methods signatures are correct (as confirmed by Eclipse and use of the @Override annotation), and I’m at my wits end. To further add to my suffering, the log outputs do not show any errors :-(
>>>>>> 
>>>>>> I am using the Cloudera CDH3u1 distribution.
>>>>>> 
>>>>>> As a final query, could somebody explain how it is that the multiple files I create get associated with the various map tasks? This part is a mystery to me (and might even be the underlying source of my problems).
>>>>>> 
>>>>>> Thanks in anticipation,
>>>>>> 
>>>>>> 	Andy Doddington
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 


Re: Mappers and Reducer not being called, but no errors indicated

Posted by Andy Doddington <an...@doddington.net>.
Sigh… still no success and I’m tearing my hair out :-(

One thought I’ve had is whether I’d be advised to use the SequenceFileAsBinary input and output classes? I’m not entirely
clear on how these differ from the ‘normal’ SequenceFile classes, since these already claim to be able to support binary data,
but at the moment I’m clutching at straws.

I did try changing to these but then got loads of exceptions claiming that I was trying to cast LongWritable (my key class)
and my value class to BytesWritable, which is what the SequenceFileAsBinary classes use, I believe.

If possible, could somebody indicate whether this change is worthwhile and, if so, how I can migrate my code to use BytesWritable
instead of LongWritable etc?

Thanks in anticipation,

	Andy Doddington

On 10 Nov 2011, at 16:33, Andy Doddington wrote:

> Thanks for your kind words - it still feels like pulling teeth at times :-(
> 
> Following on from your comments, here are a few more questions - hope you don’t find them too dumb…
> 
> 1) How does each mapper ‘know’ which file name to associate itself with?
> 2) Is it important that I name my files part<n> or will any unique name suffice?
> 3) I’m using binary serialisation with Sequence files - are these ‘split’ across multiple mappers? What happens if the split occurs in the middle of a binary object?
> 
> Current state of play is that the mappers are being called the correct number of times and are generating the correct result for the first half of the number of mappers (e.g. ~502 out of 100 mappers, running small test), but are then generating bad results after that. The reducer is then correctly selecting the minimum - it just happens to be a bad value due to the mapper problem. Ho hum…
> 
> Regards,
> 
> 	Andy D
> 
> ——————————————
> 
> On 10 Nov 2011, at 15:17, Harsh J wrote:
> 
>> Hey Andy,
>> 
>> You seem to be making good progress already! Some comments inline.
>> 
>> On 10-Nov-2011, at 7:28 PM, Andy Doddington wrote:
>> 
>>> Unfortunately my employer blocks any attempt to transfer data outside of the company - I realise this makes me look pretty
>>> foolish/uncooperative, but I hope you understand there’s little I can do about it :-(
>>> 
>>> On a more positive note, I've found a few issues which have moved me forward a bit:
>>> 
>>> I first noticed that the PiEstimator used files named part<n> to transfer data to each of the Mappers - I had changed this name to be something more meaningful to my app. I am aware that Hadoop uses some files that are similarly named, and hoped that this might be the cause. Sadly, this fix made no difference.
>>> While looking at this area of the code, I realised that although I was writing data to these files, I was failing to close them! This fix did make a difference, in that the mappers now actually appear to be getting called. However, the final result from the reduce was still incorrect. What seemed to be happening (based on the mapper logs) was that the reducers was getting called once for each mapper - which is not exactly optimal in my case.
>>> I therefore removed the jobConf call which I had made to set my reducer to also be the combiner - and suddenly the results started looking a lot healthier - although they are still not 100% correct. I had naively assumed that the minimum of a set of minimums of a series of subsets of the data would be the same as the minimum of the entire set, but I’ve clearly misunderstood how combiners work. Will investigate the doc’n on this a bit more. Maybe some subtle interaction wrt combiners and partitioners?
>> 
>> Combiners would work on sorted map outputs. That is, after they are already partitioned out.
>> 
>>> I’m still confused as to how the mappers get passed the data that I put into the part<n> files, but I *think* I’m now heading in the right direction. If you can see the cause of my problems (despite lack of log output)  then I’d be more than happy to hear from you :-)
>> 
>> Naively describing, one file would go to one mapper. Each mapper invocation (map IDs -- 0,1,2,…) would have a file name to associate itself with, and it would begin reading that file off the DFS using a record-reader and begin calling map() on each record read (lines, in the most common case).
>> 
>> To add to the complexity now, is that HDFS stores files as blocks, and hence you may have multiple mappers for a single file, working on different offsets (0-mid, mid-len for a simple 2-block split, say). This is configurable though -- you can choose not to have data input splits at the expense of losing some data locality.
>> 
>> Is this the explanation you were looking for?
>> 
>> And yes, looks like you're headed in the right direction already :)
>> 
>>> Regards,
>>> 
>>> 	Andy D
>>> On 10 Nov 2011, at 11:52, Harsh J wrote:
>>> 
>>>> Hey Andy,
>>>> 
>>>> Can you pastebin the whole runlog of your job after you invoke it via 'hadoop jar'/etc.?
>>>> 
>>>> On 10-Nov-2011, at 4:25 PM, Andy Doddington wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I have written a fairly straightforward Hadoop program, modelled after the PiEstimator example which is shipped with the distro.
>>>>> 
>>>>> 1) I write a series of files to HDFS, each containing the input for a single map task. This amounts to around 20Mb per task.
>>>>> 2) Each of my map tasks reads the input and generates a pair of floating point values.
>>>>> 3) My reduce task scans the list of floating point values produced by the maps and returns the minimum.
>>>>> 
>>>>> Unfortunately, this is not working, but is exhibiting the following symptoms:
>>>>> 
>>>>> Based on log output, I have no evidence that the mappers are actually being called, although the 'percentage complete’ output seems to go down slowly as might be expected if they were being called.
>>>>> I only ever get a single part-00000 file created, regardless of how many maps I specify.
>>>>> In the case of my reducer, although its constructor, ‘setConf' and ‘close' methods are called (based on log output), its reduce method never gets called.
>>>>> 
>>>>> I have checked the visibility of all classes and confirmed that the methods signatures are correct (as confirmed by Eclipse and use of the @Override annotation), and I’m at my wits end. To further add to my suffering, the log outputs do not show any errors :-(
>>>>> 
>>>>> I am using the Cloudera CDH3u1 distribution.
>>>>> 
>>>>> As a final query, could somebody explain how it is that the multiple files I create get associated with the various map tasks? This part is a mystery to me (and might even be the underlying source of my problems).
>>>>> 
>>>>> Thanks in anticipation,
>>>>> 
>>>>> 	Andy Doddington
>>>>> 
>>>> 
>>> 
>> 
> 


Re: Mappers and Reducer not being called, but no errors indicated

Posted by Andy Doddington <an...@doddington.net>.
Thanks for your kind words - it still feels like pulling teeth at times :-(

Following on from your comments, here are a few more questions - hope you don’t find them too dumb…

1) How does each mapper ‘know’ which file name to associate itself with?
2) Is it important that I name my files part<n> or will any unique name suffice?
3) I’m using binary serialisation with Sequence files - are these ‘split’ across multiple mappers? What happens if the split occurs in the middle of a binary object?

Current state of play is that the mappers are being called the correct number of times and are generating the correct result for the first half of the number of mappers (e.g. ~502 out of 100 mappers, running small test), but are then generating bad results after that. The reducer is then correctly selecting the minimum - it just happens to be a bad value due to the mapper problem. Ho hum…

Regards,

	Andy D

——————————————

On 10 Nov 2011, at 15:17, Harsh J wrote:

> Hey Andy,
> 
> You seem to be making good progress already! Some comments inline.
> 
> On 10-Nov-2011, at 7:28 PM, Andy Doddington wrote:
> 
>> Unfortunately my employer blocks any attempt to transfer data outside of the company - I realise this makes me look pretty
>> foolish/uncooperative, but I hope you understand there’s little I can do about it :-(
>> 
>> On a more positive note, I've found a few issues which have moved me forward a bit:
>> 
>> I first noticed that the PiEstimator used files named part<n> to transfer data to each of the Mappers - I had changed this name to be something more meaningful to my app. I am aware that Hadoop uses some files that are similarly named, and hoped that this might be the cause. Sadly, this fix made no difference.
>> While looking at this area of the code, I realised that although I was writing data to these files, I was failing to close them! This fix did make a difference, in that the mappers now actually appear to be getting called. However, the final result from the reduce was still incorrect. What seemed to be happening (based on the mapper logs) was that the reducers was getting called once for each mapper - which is not exactly optimal in my case.
>> I therefore removed the jobConf call which I had made to set my reducer to also be the combiner - and suddenly the results started looking a lot healthier - although they are still not 100% correct. I had naively assumed that the minimum of a set of minimums of a series of subsets of the data would be the same as the minimum of the entire set, but I’ve clearly misunderstood how combiners work. Will investigate the doc’n on this a bit more. Maybe some subtle interaction wrt combiners and partitioners?
> 
> Combiners would work on sorted map outputs. That is, after they are already partitioned out.
> 
>> I’m still confused as to how the mappers get passed the data that I put into the part<n> files, but I *think* I’m now heading in the right direction. If you can see the cause of my problems (despite lack of log output)  then I’d be more than happy to hear from you :-)
> 
> Naively describing, one file would go to one mapper. Each mapper invocation (map IDs -- 0,1,2,…) would have a file name to associate itself with, and it would begin reading that file off the DFS using a record-reader and begin calling map() on each record read (lines, in the most common case).
> 
> To add to the complexity now, is that HDFS stores files as blocks, and hence you may have multiple mappers for a single file, working on different offsets (0-mid, mid-len for a simple 2-block split, say). This is configurable though -- you can choose not to have data input splits at the expense of losing some data locality.
> 
> Is this the explanation you were looking for?
> 
> And yes, looks like you're headed in the right direction already :)
> 
>> Regards,
>> 
>> 	Andy D
>> On 10 Nov 2011, at 11:52, Harsh J wrote:
>> 
>>> Hey Andy,
>>> 
>>> Can you pastebin the whole runlog of your job after you invoke it via 'hadoop jar'/etc.?
>>> 
>>> On 10-Nov-2011, at 4:25 PM, Andy Doddington wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I have written a fairly straightforward Hadoop program, modelled after the PiEstimator example which is shipped with the distro.
>>>> 
>>>> 1) I write a series of files to HDFS, each containing the input for a single map task. This amounts to around 20Mb per task.
>>>> 2) Each of my map tasks reads the input and generates a pair of floating point values.
>>>> 3) My reduce task scans the list of floating point values produced by the maps and returns the minimum.
>>>> 
>>>> Unfortunately, this is not working, but is exhibiting the following symptoms:
>>>> 
>>>> Based on log output, I have no evidence that the mappers are actually being called, although the 'percentage complete’ output seems to go down slowly as might be expected if they were being called.
>>>> I only ever get a single part-00000 file created, regardless of how many maps I specify.
>>>> In the case of my reducer, although its constructor, ‘setConf' and ‘close' methods are called (based on log output), its reduce method never gets called.
>>>> 
>>>> I have checked the visibility of all classes and confirmed that the methods signatures are correct (as confirmed by Eclipse and use of the @Override annotation), and I’m at my wits end. To further add to my suffering, the log outputs do not show any errors :-(
>>>> 
>>>> I am using the Cloudera CDH3u1 distribution.
>>>> 
>>>> As a final query, could somebody explain how it is that the multiple files I create get associated with the various map tasks? This part is a mystery to me (and might even be the underlying source of my problems).
>>>> 
>>>> Thanks in anticipation,
>>>> 
>>>> 	Andy Doddington
>>>> 
>>> 
>> 
> 


Re: Mappers and Reducer not being called, but no errors indicated

Posted by Harsh J <ha...@cloudera.com>.
Hey Andy,

You seem to be making good progress already! Some comments inline.

On 10-Nov-2011, at 7:28 PM, Andy Doddington wrote:

> Unfortunately my employer blocks any attempt to transfer data outside of the company - I realise this makes me look pretty
> foolish/uncooperative, but I hope you understand there’s little I can do about it :-(
> 
> On a more positive note, I've found a few issues which have moved me forward a bit:
> 
> I first noticed that the PiEstimator used files named part<n> to transfer data to each of the Mappers - I had changed this name to be something more meaningful to my app. I am aware that Hadoop uses some files that are similarly named, and hoped that this might be the cause. Sadly, this fix made no difference.
> While looking at this area of the code, I realised that although I was writing data to these files, I was failing to close them! This fix did make a difference, in that the mappers now actually appear to be getting called. However, the final result from the reduce was still incorrect. What seemed to be happening (based on the mapper logs) was that the reducers was getting called once for each mapper - which is not exactly optimal in my case.
> I therefore removed the jobConf call which I had made to set my reducer to also be the combiner - and suddenly the results started looking a lot healthier - although they are still not 100% correct. I had naively assumed that the minimum of a set of minimums of a series of subsets of the data would be the same as the minimum of the entire set, but I’ve clearly misunderstood how combiners work. Will investigate the doc’n on this a bit more. Maybe some subtle interaction wrt combiners and partitioners?

Combiners would work on sorted map outputs. That is, after they are already partitioned out.

> I’m still confused as to how the mappers get passed the data that I put into the part<n> files, but I *think* I’m now heading in the right direction. If you can see the cause of my problems (despite lack of log output)  then I’d be more than happy to hear from you :-)

Naively describing, one file would go to one mapper. Each mapper invocation (map IDs -- 0,1,2,…) would have a file name to associate itself with, and it would begin reading that file off the DFS using a record-reader and begin calling map() on each record read (lines, in the most common case).

To add to the complexity now, is that HDFS stores files as blocks, and hence you may have multiple mappers for a single file, working on different offsets (0-mid, mid-len for a simple 2-block split, say). This is configurable though -- you can choose not to have data input splits at the expense of losing some data locality.

Is this the explanation you were looking for?

And yes, looks like you're headed in the right direction already :)

> Regards,
> 
> 	Andy D
> On 10 Nov 2011, at 11:52, Harsh J wrote:
> 
>> Hey Andy,
>> 
>> Can you pastebin the whole runlog of your job after you invoke it via 'hadoop jar'/etc.?
>> 
>> On 10-Nov-2011, at 4:25 PM, Andy Doddington wrote:
>> 
>>> Hi,
>>> 
>>> I have written a fairly straightforward Hadoop program, modelled after the PiEstimator example which is shipped with the distro.
>>> 
>>> 1) I write a series of files to HDFS, each containing the input for a single map task. This amounts to around 20Mb per task.
>>> 2) Each of my map tasks reads the input and generates a pair of floating point values.
>>> 3) My reduce task scans the list of floating point values produced by the maps and returns the minimum.
>>> 
>>> Unfortunately, this is not working, but is exhibiting the following symptoms:
>>> 
>>> Based on log output, I have no evidence that the mappers are actually being called, although the 'percentage complete’ output seems to go down slowly as might be expected if they were being called.
>>> I only ever get a single part-00000 file created, regardless of how many maps I specify.
>>> In the case of my reducer, although its constructor, ‘setConf' and ‘close' methods are called (based on log output), its reduce method never gets called.
>>> 
>>> I have checked the visibility of all classes and confirmed that the methods signatures are correct (as confirmed by Eclipse and use of the @Override annotation), and I’m at my wits end. To further add to my suffering, the log outputs do not show any errors :-(
>>> 
>>> I am using the Cloudera CDH3u1 distribution.
>>> 
>>> As a final query, could somebody explain how it is that the multiple files I create get associated with the various map tasks? This part is a mystery to me (and might even be the underlying source of my problems).
>>> 
>>> Thanks in anticipation,
>>> 
>>> 	Andy Doddington
>>> 
>> 
> 


Re: Mappers and Reducer not being called, but no errors indicated

Posted by Andy Doddington <an...@doddington.net>.
Unfortunately my employer blocks any attempt to transfer data outside of the company - I realise this makes me look pretty
foolish/uncooperative, but I hope you understand there’s little I can do about it :-(

On a more positive note, I've found a few issues which have moved me forward a bit:

I first noticed that the PiEstimator used files named part<n> to transfer data to each of the Mappers - I had changed this name to be something more meaningful to my app. I am aware that Hadoop uses some files that are similarly named, and hoped that this might be the cause. Sadly, this fix made no difference.
While looking at this area of the code, I realised that although I was writing data to these files, I was failing to close them! This fix did make a difference, in that the mappers now actually appear to be getting called. However, the final result from the reduce was still incorrect. What seemed to be happening (based on the mapper logs) was that the reducers was getting called once for each mapper - which is not exactly optimal in my case.
I therefore removed the jobConf call which I had made to set my reducer to also be the combiner - and suddenly the results started looking a lot healthier - although they are still not 100% correct. I had naively assumed that the minimum of a set of minimums of a series of subsets of the data would be the same as the minimum of the entire set, but I’ve clearly misunderstood how combiners work. Will investigate the doc’n on this a bit more. Maybe some subtle interaction wrt combiners and partitioners?

I’m still confused as to how the mappers get passed the data that I put into the part<n> files, but I *think* I’m now heading in the right direction. If you can see the cause of my problems (despite lack of log output)  then I’d be more than happy to hear from you :-)

Regards,

	Andy D
On 10 Nov 2011, at 11:52, Harsh J wrote:

> Hey Andy,
> 
> Can you pastebin the whole runlog of your job after you invoke it via 'hadoop jar'/etc.?
> 
> On 10-Nov-2011, at 4:25 PM, Andy Doddington wrote:
> 
>> Hi,
>> 
>> I have written a fairly straightforward Hadoop program, modelled after the PiEstimator example which is shipped with the distro.
>> 
>> 1) I write a series of files to HDFS, each containing the input for a single map task. This amounts to around 20Mb per task.
>> 2) Each of my map tasks reads the input and generates a pair of floating point values.
>> 3) My reduce task scans the list of floating point values produced by the maps and returns the minimum.
>> 
>> Unfortunately, this is not working, but is exhibiting the following symptoms:
>> 
>> Based on log output, I have no evidence that the mappers are actually being called, although the 'percentage complete’ output seems to go down slowly as might be expected if they were being called.
>> I only ever get a single part-00000 file created, regardless of how many maps I specify.
>> In the case of my reducer, although its constructor, ‘setConf' and ‘close' methods are called (based on log output), its reduce method never gets called.
>> 
>> I have checked the visibility of all classes and confirmed that the methods signatures are correct (as confirmed by Eclipse and use of the @Override annotation), and I’m at my wits end. To further add to my suffering, the log outputs do not show any errors :-(
>> 
>> I am using the Cloudera CDH3u1 distribution.
>> 
>> As a final query, could somebody explain how it is that the multiple files I create get associated with the various map tasks? This part is a mystery to me (and might even be the underlying source of my problems).
>> 
>> Thanks in anticipation,
>> 
>> 	Andy Doddington
>> 
> 


Re: Mappers and Reducer not being called, but no errors indicated

Posted by Harsh J <ha...@cloudera.com>.
Hey Andy,

Can you pastebin the whole runlog of your job after you invoke it via 'hadoop jar'/etc.?

On 10-Nov-2011, at 4:25 PM, Andy Doddington wrote:

> Hi,
> 
> I have written a fairly straightforward Hadoop program, modelled after the PiEstimator example which is shipped with the distro.
> 
> 1) I write a series of files to HDFS, each containing the input for a single map task. This amounts to around 20Mb per task.
> 2) Each of my map tasks reads the input and generates a pair of floating point values.
> 3) My reduce task scans the list of floating point values produced by the maps and returns the minimum.
> 
> Unfortunately, this is not working, but is exhibiting the following symptoms:
> 
> Based on log output, I have no evidence that the mappers are actually being called, although the 'percentage complete’ output seems to go down slowly as might be expected if they were being called.
> I only ever get a single part-00000 file created, regardless of how many maps I specify.
> In the case of my reducer, although its constructor, ‘setConf' and ‘close' methods are called (based on log output), its reduce method never gets called.
> 
> I have checked the visibility of all classes and confirmed that the methods signatures are correct (as confirmed by Eclipse and use of the @Override annotation), and I’m at my wits end. To further add to my suffering, the log outputs do not show any errors :-(
> 
> I am using the Cloudera CDH3u1 distribution.
> 
> As a final query, could somebody explain how it is that the multiple files I create get associated with the various map tasks? This part is a mystery to me (and might even be the underlying source of my problems).
> 
> Thanks in anticipation,
> 
> 	Andy Doddington
>