You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by robpd <Ro...@yahoo.co.uk> on 2011/10/21 09:46:13 UTC

Could someone please explain the "part-m-00000" thing?

I am very new to Mahout and trying to learn about it. I managed to get the
MeanShiftClusterer to work and print out my output using SequenceFile.Reader
with a small number of points. I'm using a pseudo-distributed configuration
under cygwin on my laptop.

Although it worked I really do not understand the reason for the
'part-m-00000' required in the path of the reader. From what I have read
the 'm' stands for 'map' and the 00000 means it's the first map.
Apparently there can also be a 'part-r-00000' for the first reduce. Is that
correct? Here's where my confusion starts.

1) Why is there not a 'part-r-00000' present after I run my code? Surely the
finished clusters should have been subject to a reduce after the maps?

2) Does this mean that my clusterer only did the map, but not the reduce (so
is not correct)?

3) If I had a proper distributed hardware setup would I also find that there
were 'part-m-00001', 'part-m-00002', 'part-m-0000n'? So would I need to read
them all or would there be one or more 'part-r-000s'

4) The bottom line is I WANT TO READ ALL CLUSTERS irrespective of the number
of hardware nodes and I want them to be fully 'reduced'. Given my confusion
over the above, what's the best way of doing this? Is there any sample code
out there to do this?

Any help would be gratefully received.

Rob

--
View this message in context: http://lucene.472066.n3.nabble.com/Could-someone-please-explain-the-part-m-00000-thing-tp3440174p3440174.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: Could someone please explain the "part-m-00000" thing?

Posted by Sean Owen <sr...@gmail.com>.

To be honest, I am not sure what the "m" in part-m-00000 means, but I
do not think it's just for map output. This is just Hadoop's naming
convention for output. This is the normal (reduce) output.

Yes, you need to be prepared to read all part-m-* files. In Hadoop,
you can never expect anything is in just one file.

See SequenceFileDirIterable for an easy way to iterate over all
records in all files.

Sean

On Fri, Oct 21, 2011 at 8:46 AM, robpd <Ro...@yahoo.co.uk> wrote:
> Hi
>
> I am very new to Mahout and trying to learn about it.  I managed to get the
> MeanShiftClusterer to work and print out my output using SequenceFile.Reader
> with a small number of points. I'm using a pseudo-distributed configuration
> under cygwin on my laptop.
>
> Although it worked I really do not understand the reason for the
> 'part-m-00000' required in the path of the reader.  From what I have read
> the 'm' stands for 'map' and the 00000 means it's the first map.
> Apparently there can also be a 'part-r-00000' for the first reduce. Is that
> correct?  Here's where my confusion starts.
>
> 1) Why is there not a 'part-r-00000' present after I run my code? Surely the
> finished clusters should have been subject to a reduce after the maps?
>
> 2) Does this mean that my clusterer only did the map, but not the reduce (so
> is not correct)?
>
> 3) If I had a proper distributed hardware setup would I also find that there
> were 'part-m-00001', 'part-m-00002', 'part-m-0000n'? So would I need to read
> them all or would there be one or more 'part-r-000s'
>
> 4) The bottom line is I WANT TO READ ALL CLUSTERS irrespective of the number
> of hardware nodes and I want them to be fully 'reduced'. Given my confusion
> over the above, what's the best way of doing this? Is there any sample code
> out there to do this?
>
> Any help would be gratefully received.
>
> Rob
>
>
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Could-someone-please-explain-the-part-m-00000-thing-tp3440174p3440174.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>

Re: Could someone please explain the "part-m-00000" thing?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

In hadoop 0.20.2 there used to be output only in a form of part-00000. At
some point during 0.21 work they started to make distinction between outputs
produced by the map only jobs and full Mr jobs. I am not sure when that
happened,but cdh3 releases already produce output with m and r in the
middle.

There s always only one set of files produced by standard output formats per
job, the only distinction is that those with m are output of map only jobs
and represent individual map task (not the same as task attempt)  output,
whereas those with r represent individual reduce task output of fully
fledged map reduce job setup.

The number of map tasks in most default scenarios is determined by number of
splits in your input which again in most scenarios translates into number of
individual hdfs blocks taken by the input.

The number of reduce tasks on the other hand is meant to be set explicitly
and depends on available capacity of the cluster. For maximum thruput
(assuming you have a sizeable task to do)  the recommended number of
reducers is to fill the entire cluster capacity minus a small (~5% or so)
margin to allow for opportunistic execution.
On Oct 21, 2011 1:01 AM, "robpd" <Ro...@yahoo.co.uk> wrote:

> Hi
>
> I am very new to Mahout and trying to learn about it.  I managed to get the
> MeanShiftClusterer to work and print out my output using
> SequenceFile.Reader
> with a small number of points. I'm using a pseudo-distributed configuration
> under cygwin on my laptop.
>
> Although it worked I really do not understand the reason for the
> 'part-m-00000' required in the path of the reader.  From what I have read
> the 'm' stands for 'map' and the 00000 means it's the first map.
> Apparently there can also be a 'part-r-00000' for the first reduce. Is that
> correct?  Here's where my confusion starts.
>
> 1) Why is there not a 'part-r-00000' present after I run my code? Surely
> the
> finished clusters should have been subject to a reduce after the maps?
>
> 2) Does this mean that my clusterer only did the map, but not the reduce
> (so
> is not correct)?
>
> 3) If I had a proper distributed hardware setup would I also find that
> there
> were 'part-m-00001', 'part-m-00002', 'part-m-0000n'? So would I need to
> read
> them all or would there be one or more 'part-r-000s'
>
> 4) The bottom line is I WANT TO READ ALL CLUSTERS irrespective of the
> number
> of hardware nodes and I want them to be fully 'reduced'. Given my confusion
> over the above, what's the best way of doing this? Is there any sample code
> out there to do this?
>
> Any help would be gratefully received.
>
> Rob
>
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Could-someone-please-explain-the-part-m-00000-thing-tp3440174p3440174.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>