You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Mark Kerzner <ma...@gmail.com> on 2011/09/07 03:42:23 UTC

Too many maps?

Hi,

I am testing my Hadoop-based FreeEed <http://freeeed.org/>, an open source
tool for eDiscovery, and I am using the Enron data
set<http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2>for
that. In my processing, each email with its attachments becomes a map,
and it is later collected by a reducer and written to the output. With the
(PST) mailboxes of around 2-5 Gigs, I begin to the see the numbers of emails
of about 50,000. I remember in Yahoo best practices that the number of maps
should not exceed 75,000, and I can see that I can break this barrier soon.

I could, potentially, combine a few emails into one map, but I would be
doing it only to circumvent the size problem, not because my processing
requires it. Besides, my keys are the MD5 hashes of the files, and I use
them to find duplicates. If I combine a few emails into a map, I cannot use
the hashes as keys in a meaningful way anymore.

So my question is, can't I have millions of maps, if that's how many
artifacts I need to process, and why not?

Thank you. Sincerely,
Mark

Re: Too many maps?

Posted by Mark Kerzner <ma...@gmail.com>.

Thank you, Sonal,

at least that big job I was looking at just finished :)

Mark

On Tue, Sep 6, 2011 at 11:56 PM, Sonal Goyal <so...@gmail.com> wrote:

> Mark,
>
> Having a large number of emitted key values from the mapper should not be a
> problem. Just make sure that you have enough reducers to handle the data so
> that the reduce stage does not become a bottleneck.
>
> Best Regards,
> Sonal
> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
> Nube Technologies <http://www.nubetech.co>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>
>
>
> On Wed, Sep 7, 2011 at 8:44 AM, Mark Kerzner <ma...@gmail.com>
> wrote:
>
> > Harsh,
> >
> > I read one PST file, which contains many emails. But then I emit many
> maps,
> > like this
> >
> >        MapWritable mapWritable = createMapWritable(metadata, fileName);
> >        // use MD5 of the input file as Hadoop key
> >        FileInputStream fileInputStream = new FileInputStream(fileName);
> >        MD5Hash key = MD5Hash.digest(fileInputStream);
> >        fileInputStream.close();
> >        // emit map
> >        context.write(key, mapWritable);
> >
> > and it is this context.write calls that I have a great number of. Is that
> a
> > problem?
> >
> > Mark
> >
> > On Tue, Sep 6, 2011 at 10:06 PM, Harsh J <ha...@cloudera.com> wrote:
> >
> > > You can use an input format that lets you read multiple files per map
> > > (like say, all local files. See CombineFileInputFormat for one
> > > implementation that does this). This way you get reduced map #s and
> > > you don't really have to clump your files. One record reader would be
> > > initialized per file, so I believe you should be free to generate
> > > unique identities per file/email with this approach (whenever a new
> > > record reader is initialized)?
> > >
> > > On Wed, Sep 7, 2011 at 7:12 AM, Mark Kerzner <ma...@gmail.com>
> > > wrote:
> > > > Hi,
> > > >
> > > > I am testing my Hadoop-based FreeEed <http://freeeed.org/>, an open
> > > source
> > > > tool for eDiscovery, and I am using the Enron data
> > > > set<
> > http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2
> > > >for
> > > > that. In my processing, each email with its attachments becomes a
> map,
> > > > and it is later collected by a reducer and written to the output.
> With
> > > the
> > > > (PST) mailboxes of around 2-5 Gigs, I begin to the see the numbers of
> > > emails
> > > > of about 50,000. I remember in Yahoo best practices that the number
> of
> > > maps
> > > > should not exceed 75,000, and I can see that I can break this barrier
> > > soon.
> > > >
> > > > I could, potentially, combine a few emails into one map, but I would
> be
> > > > doing it only to circumvent the size problem, not because my
> processing
> > > > requires it. Besides, my keys are the MD5 hashes of the files, and I
> > use
> > > > them to find duplicates. If I combine a few emails into a map, I
> cannot
> > > use
> > > > the hashes as keys in a meaningful way anymore.
> > > >
> > > > So my question is, can't I have millions of maps, if that's how many
> > > > artifacts I need to process, and why not?
> > > >
> > > > Thank you. Sincerely,
> > > > Mark
> > > >
> > >
> > >
> > >
> > > --
> > > Harsh J
> > >
> >
>

Re: Too many maps?

Posted by Sonal Goyal <so...@gmail.com>.

Mark,

Having a large number of emitted key values from the mapper should not be a
problem. Just make sure that you have enough reducers to handle the data so
that the reduce stage does not become a bottleneck.

Best Regards,
Sonal
Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>





On Wed, Sep 7, 2011 at 8:44 AM, Mark Kerzner <ma...@gmail.com> wrote:

> Harsh,
>
> I read one PST file, which contains many emails. But then I emit many maps,
> like this
>
>        MapWritable mapWritable = createMapWritable(metadata, fileName);
>        // use MD5 of the input file as Hadoop key
>        FileInputStream fileInputStream = new FileInputStream(fileName);
>        MD5Hash key = MD5Hash.digest(fileInputStream);
>        fileInputStream.close();
>        // emit map
>        context.write(key, mapWritable);
>
> and it is this context.write calls that I have a great number of. Is that a
> problem?
>
> Mark
>
> On Tue, Sep 6, 2011 at 10:06 PM, Harsh J <ha...@cloudera.com> wrote:
>
> > You can use an input format that lets you read multiple files per map
> > (like say, all local files. See CombineFileInputFormat for one
> > implementation that does this). This way you get reduced map #s and
> > you don't really have to clump your files. One record reader would be
> > initialized per file, so I believe you should be free to generate
> > unique identities per file/email with this approach (whenever a new
> > record reader is initialized)?
> >
> > On Wed, Sep 7, 2011 at 7:12 AM, Mark Kerzner <ma...@gmail.com>
> > wrote:
> > > Hi,
> > >
> > > I am testing my Hadoop-based FreeEed <http://freeeed.org/>, an open
> > source
> > > tool for eDiscovery, and I am using the Enron data
> > > set<
> http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2
> > >for
> > > that. In my processing, each email with its attachments becomes a map,
> > > and it is later collected by a reducer and written to the output. With
> > the
> > > (PST) mailboxes of around 2-5 Gigs, I begin to the see the numbers of
> > emails
> > > of about 50,000. I remember in Yahoo best practices that the number of
> > maps
> > > should not exceed 75,000, and I can see that I can break this barrier
> > soon.
> > >
> > > I could, potentially, combine a few emails into one map, but I would be
> > > doing it only to circumvent the size problem, not because my processing
> > > requires it. Besides, my keys are the MD5 hashes of the files, and I
> use
> > > them to find duplicates. If I combine a few emails into a map, I cannot
> > use
> > > the hashes as keys in a meaningful way anymore.
> > >
> > > So my question is, can't I have millions of maps, if that's how many
> > > artifacts I need to process, and why not?
> > >
> > > Thank you. Sincerely,
> > > Mark
> > >
> >
> >
> >
> > --
> > Harsh J
> >
>

Re: Too many maps?

Posted by Mark Kerzner <ma...@gmail.com>.

Harsh,

I read one PST file, which contains many emails. But then I emit many maps,
like this

        MapWritable mapWritable = createMapWritable(metadata, fileName);
        // use MD5 of the input file as Hadoop key
        FileInputStream fileInputStream = new FileInputStream(fileName);
        MD5Hash key = MD5Hash.digest(fileInputStream);
        fileInputStream.close();
        // emit map
        context.write(key, mapWritable);

and it is this context.write calls that I have a great number of. Is that a
problem?

Mark

On Tue, Sep 6, 2011 at 10:06 PM, Harsh J <ha...@cloudera.com> wrote:

> You can use an input format that lets you read multiple files per map
> (like say, all local files. See CombineFileInputFormat for one
> implementation that does this). This way you get reduced map #s and
> you don't really have to clump your files. One record reader would be
> initialized per file, so I believe you should be free to generate
> unique identities per file/email with this approach (whenever a new
> record reader is initialized)?
>
> On Wed, Sep 7, 2011 at 7:12 AM, Mark Kerzner <ma...@gmail.com>
> wrote:
> > Hi,
> >
> > I am testing my Hadoop-based FreeEed <http://freeeed.org/>, an open
> source
> > tool for eDiscovery, and I am using the Enron data
> > set<http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2
> >for
> > that. In my processing, each email with its attachments becomes a map,
> > and it is later collected by a reducer and written to the output. With
> the
> > (PST) mailboxes of around 2-5 Gigs, I begin to the see the numbers of
> emails
> > of about 50,000. I remember in Yahoo best practices that the number of
> maps
> > should not exceed 75,000, and I can see that I can break this barrier
> soon.
> >
> > I could, potentially, combine a few emails into one map, but I would be
> > doing it only to circumvent the size problem, not because my processing
> > requires it. Besides, my keys are the MD5 hashes of the files, and I use
> > them to find duplicates. If I combine a few emails into a map, I cannot
> use
> > the hashes as keys in a meaningful way anymore.
> >
> > So my question is, can't I have millions of maps, if that's how many
> > artifacts I need to process, and why not?
> >
> > Thank you. Sincerely,
> > Mark
> >
>
>
>
> --
> Harsh J
>

Re: Too many maps?

Posted by Harsh J <ha...@cloudera.com>.

You can use an input format that lets you read multiple files per map
(like say, all local files. See CombineFileInputFormat for one
implementation that does this). This way you get reduced map #s and
you don't really have to clump your files. One record reader would be
initialized per file, so I believe you should be free to generate
unique identities per file/email with this approach (whenever a new
record reader is initialized)?

On Wed, Sep 7, 2011 at 7:12 AM, Mark Kerzner <ma...@gmail.com> wrote:
> Hi,
>
> I am testing my Hadoop-based FreeEed <http://freeeed.org/>, an open source
> tool for eDiscovery, and I am using the Enron data
> set<http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2>for
> that. In my processing, each email with its attachments becomes a map,
> and it is later collected by a reducer and written to the output. With the
> (PST) mailboxes of around 2-5 Gigs, I begin to the see the numbers of emails
> of about 50,000. I remember in Yahoo best practices that the number of maps
> should not exceed 75,000, and I can see that I can break this barrier soon.
>
> I could, potentially, combine a few emails into one map, but I would be
> doing it only to circumvent the size problem, not because my processing
> requires it. Besides, my keys are the MD5 hashes of the files, and I use
> them to find duplicates. If I combine a few emails into a map, I cannot use
> the hashes as keys in a meaningful way anymore.
>
> So my question is, can't I have millions of maps, if that's how many
> artifacts I need to process, and why not?
>
> Thank you. Sincerely,
> Mark
>



-- 
Harsh J