You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Devin Suiter RDX <ds...@rdx.com> on 2013/12/30 19:29:49 UTC

MapReduce MIME Input type?

Hi,

I am trying to puzzle this out, and am hoping for some insight - I have an
IMAP inbox dump that I am analyzing - I need to track how many times a
given item is referred to in the inbox, i.e. how many emails came in about
that thing and over what time. I can load it into MapReduce as
TextInputFormat and parse it properly, and have managed to crudely
concatenate lines that represent an email together as my final output, so,
basically, it is working now, but my program is seeing each line as an
InputSplit, and I so it is only working reliably with one InputFileSplit.
If I had a bigger file, with multiple InputFileSplits presenting
line-by-line InputSplits, I have no way to be sure that the lines that make
one email will not end up in two different splits - does that make sense?

Someone I work with suggested that I attempt to read each email as a
record, since they have their MIME encoding intact in the text dump, rather
than each line as a record.

Does anyone know of a MIME MapReduce input type? I can't be sure this will
help anyway, since the file is already text-encoded - I may have to get the
email from the original inbox as individual messages somehow to utilize the
MIME header information.

Googling this has been challenging, mainly because the words you have to
use are somewhat overloaded - but I am finding some good clown schools in
my research...so, any help is appreciated.

Thanks!
*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com

Re: MapReduce MIME Input type?

Posted by Devin Suiter RDX <ds...@rdx.com>.
Thanks Harsh! Looks like something that might be useful! I appreciate it!

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Tue, Dec 31, 2013 at 1:08 AM, Harsh J <ha...@cloudera.com> wrote:

> Hey Devin,
>
> Are you perhaps looking for http://james.apache.org/mime4j/? You may have
> to adapt it for MR but I don't imagine that would be too difficult to do.
>
> On Mon, Dec 30, 2013 at 11:59 PM, Devin Suiter RDX <ds...@rdx.com>wrote:
>
>> Hi,
>>
>> I am trying to puzzle this out, and am hoping for some insight - I have
>> an IMAP inbox dump that I am analyzing - I need to track how many times a
>> given item is referred to in the inbox, i.e. how many emails came in about
>> that thing and over what time. I can load it into MapReduce as
>> TextInputFormat and parse it properly, and have managed to crudely
>> concatenate lines that represent an email together as my final output, so,
>> basically, it is working now, but my program is seeing each line as an
>> InputSplit, and I so it is only working reliably with one InputFileSplit.
>> If I had a bigger file, with multiple InputFileSplits presenting
>> line-by-line InputSplits, I have no way to be sure that the lines that make
>> one email will not end up in two different splits - does that make sense?
>>
>> Someone I work with suggested that I attempt to read each email as a
>> record, since they have their MIME encoding intact in the text dump, rather
>> than each line as a record.
>>
>> Does anyone know of a MIME MapReduce input type? I can't be sure this
>> will help anyway, since the file is already text-encoded - I may have to
>> get the email from the original inbox as individual messages somehow to
>> utilize the MIME header information.
>>
>> Googling this has been challenging, mainly because the words you have to
>> use are somewhat overloaded - but I am finding some good clown schools in
>> my research...so, any help is appreciated.
>>
>> Thanks!
>> *Devin Suiter*
>> Jr. Data Solutions Software Engineer
>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>> Google Voice: 412-256-8556 | www.rdx.com
>>
>
>
>
> --
> Harsh J
>

Re: MapReduce MIME Input type?

Posted by Devin Suiter RDX <ds...@rdx.com>.
Thanks Harsh! Looks like something that might be useful! I appreciate it!

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Tue, Dec 31, 2013 at 1:08 AM, Harsh J <ha...@cloudera.com> wrote:

> Hey Devin,
>
> Are you perhaps looking for http://james.apache.org/mime4j/? You may have
> to adapt it for MR but I don't imagine that would be too difficult to do.
>
> On Mon, Dec 30, 2013 at 11:59 PM, Devin Suiter RDX <ds...@rdx.com>wrote:
>
>> Hi,
>>
>> I am trying to puzzle this out, and am hoping for some insight - I have
>> an IMAP inbox dump that I am analyzing - I need to track how many times a
>> given item is referred to in the inbox, i.e. how many emails came in about
>> that thing and over what time. I can load it into MapReduce as
>> TextInputFormat and parse it properly, and have managed to crudely
>> concatenate lines that represent an email together as my final output, so,
>> basically, it is working now, but my program is seeing each line as an
>> InputSplit, and I so it is only working reliably with one InputFileSplit.
>> If I had a bigger file, with multiple InputFileSplits presenting
>> line-by-line InputSplits, I have no way to be sure that the lines that make
>> one email will not end up in two different splits - does that make sense?
>>
>> Someone I work with suggested that I attempt to read each email as a
>> record, since they have their MIME encoding intact in the text dump, rather
>> than each line as a record.
>>
>> Does anyone know of a MIME MapReduce input type? I can't be sure this
>> will help anyway, since the file is already text-encoded - I may have to
>> get the email from the original inbox as individual messages somehow to
>> utilize the MIME header information.
>>
>> Googling this has been challenging, mainly because the words you have to
>> use are somewhat overloaded - but I am finding some good clown schools in
>> my research...so, any help is appreciated.
>>
>> Thanks!
>> *Devin Suiter*
>> Jr. Data Solutions Software Engineer
>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>> Google Voice: 412-256-8556 | www.rdx.com
>>
>
>
>
> --
> Harsh J
>

Re: MapReduce MIME Input type?

Posted by Devin Suiter RDX <ds...@rdx.com>.
Thanks Harsh! Looks like something that might be useful! I appreciate it!

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Tue, Dec 31, 2013 at 1:08 AM, Harsh J <ha...@cloudera.com> wrote:

> Hey Devin,
>
> Are you perhaps looking for http://james.apache.org/mime4j/? You may have
> to adapt it for MR but I don't imagine that would be too difficult to do.
>
> On Mon, Dec 30, 2013 at 11:59 PM, Devin Suiter RDX <ds...@rdx.com>wrote:
>
>> Hi,
>>
>> I am trying to puzzle this out, and am hoping for some insight - I have
>> an IMAP inbox dump that I am analyzing - I need to track how many times a
>> given item is referred to in the inbox, i.e. how many emails came in about
>> that thing and over what time. I can load it into MapReduce as
>> TextInputFormat and parse it properly, and have managed to crudely
>> concatenate lines that represent an email together as my final output, so,
>> basically, it is working now, but my program is seeing each line as an
>> InputSplit, and I so it is only working reliably with one InputFileSplit.
>> If I had a bigger file, with multiple InputFileSplits presenting
>> line-by-line InputSplits, I have no way to be sure that the lines that make
>> one email will not end up in two different splits - does that make sense?
>>
>> Someone I work with suggested that I attempt to read each email as a
>> record, since they have their MIME encoding intact in the text dump, rather
>> than each line as a record.
>>
>> Does anyone know of a MIME MapReduce input type? I can't be sure this
>> will help anyway, since the file is already text-encoded - I may have to
>> get the email from the original inbox as individual messages somehow to
>> utilize the MIME header information.
>>
>> Googling this has been challenging, mainly because the words you have to
>> use are somewhat overloaded - but I am finding some good clown schools in
>> my research...so, any help is appreciated.
>>
>> Thanks!
>> *Devin Suiter*
>> Jr. Data Solutions Software Engineer
>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>> Google Voice: 412-256-8556 | www.rdx.com
>>
>
>
>
> --
> Harsh J
>

Re: MapReduce MIME Input type?

Posted by Devin Suiter RDX <ds...@rdx.com>.
Thanks Harsh! Looks like something that might be useful! I appreciate it!

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Tue, Dec 31, 2013 at 1:08 AM, Harsh J <ha...@cloudera.com> wrote:

> Hey Devin,
>
> Are you perhaps looking for http://james.apache.org/mime4j/? You may have
> to adapt it for MR but I don't imagine that would be too difficult to do.
>
> On Mon, Dec 30, 2013 at 11:59 PM, Devin Suiter RDX <ds...@rdx.com>wrote:
>
>> Hi,
>>
>> I am trying to puzzle this out, and am hoping for some insight - I have
>> an IMAP inbox dump that I am analyzing - I need to track how many times a
>> given item is referred to in the inbox, i.e. how many emails came in about
>> that thing and over what time. I can load it into MapReduce as
>> TextInputFormat and parse it properly, and have managed to crudely
>> concatenate lines that represent an email together as my final output, so,
>> basically, it is working now, but my program is seeing each line as an
>> InputSplit, and I so it is only working reliably with one InputFileSplit.
>> If I had a bigger file, with multiple InputFileSplits presenting
>> line-by-line InputSplits, I have no way to be sure that the lines that make
>> one email will not end up in two different splits - does that make sense?
>>
>> Someone I work with suggested that I attempt to read each email as a
>> record, since they have their MIME encoding intact in the text dump, rather
>> than each line as a record.
>>
>> Does anyone know of a MIME MapReduce input type? I can't be sure this
>> will help anyway, since the file is already text-encoded - I may have to
>> get the email from the original inbox as individual messages somehow to
>> utilize the MIME header information.
>>
>> Googling this has been challenging, mainly because the words you have to
>> use are somewhat overloaded - but I am finding some good clown schools in
>> my research...so, any help is appreciated.
>>
>> Thanks!
>> *Devin Suiter*
>> Jr. Data Solutions Software Engineer
>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>> Google Voice: 412-256-8556 | www.rdx.com
>>
>
>
>
> --
> Harsh J
>

Re: MapReduce MIME Input type?

Posted by Harsh J <ha...@cloudera.com>.
Hey Devin,

Are you perhaps looking for http://james.apache.org/mime4j/? You may have
to adapt it for MR but I don't imagine that would be too difficult to do.

On Mon, Dec 30, 2013 at 11:59 PM, Devin Suiter RDX <ds...@rdx.com> wrote:

> Hi,
>
> I am trying to puzzle this out, and am hoping for some insight - I have an
> IMAP inbox dump that I am analyzing - I need to track how many times a
> given item is referred to in the inbox, i.e. how many emails came in about
> that thing and over what time. I can load it into MapReduce as
> TextInputFormat and parse it properly, and have managed to crudely
> concatenate lines that represent an email together as my final output, so,
> basically, it is working now, but my program is seeing each line as an
> InputSplit, and I so it is only working reliably with one InputFileSplit.
> If I had a bigger file, with multiple InputFileSplits presenting
> line-by-line InputSplits, I have no way to be sure that the lines that make
> one email will not end up in two different splits - does that make sense?
>
> Someone I work with suggested that I attempt to read each email as a
> record, since they have their MIME encoding intact in the text dump, rather
> than each line as a record.
>
> Does anyone know of a MIME MapReduce input type? I can't be sure this will
> help anyway, since the file is already text-encoded - I may have to get the
> email from the original inbox as individual messages somehow to utilize the
> MIME header information.
>
> Googling this has been challenging, mainly because the words you have to
> use are somewhat overloaded - but I am finding some good clown schools in
> my research...so, any help is appreciated.
>
> Thanks!
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>



-- 
Harsh J

Re: MapReduce MIME Input type?

Posted by Harsh J <ha...@cloudera.com>.
Hey Devin,

Are you perhaps looking for http://james.apache.org/mime4j/? You may have
to adapt it for MR but I don't imagine that would be too difficult to do.

On Mon, Dec 30, 2013 at 11:59 PM, Devin Suiter RDX <ds...@rdx.com> wrote:

> Hi,
>
> I am trying to puzzle this out, and am hoping for some insight - I have an
> IMAP inbox dump that I am analyzing - I need to track how many times a
> given item is referred to in the inbox, i.e. how many emails came in about
> that thing and over what time. I can load it into MapReduce as
> TextInputFormat and parse it properly, and have managed to crudely
> concatenate lines that represent an email together as my final output, so,
> basically, it is working now, but my program is seeing each line as an
> InputSplit, and I so it is only working reliably with one InputFileSplit.
> If I had a bigger file, with multiple InputFileSplits presenting
> line-by-line InputSplits, I have no way to be sure that the lines that make
> one email will not end up in two different splits - does that make sense?
>
> Someone I work with suggested that I attempt to read each email as a
> record, since they have their MIME encoding intact in the text dump, rather
> than each line as a record.
>
> Does anyone know of a MIME MapReduce input type? I can't be sure this will
> help anyway, since the file is already text-encoded - I may have to get the
> email from the original inbox as individual messages somehow to utilize the
> MIME header information.
>
> Googling this has been challenging, mainly because the words you have to
> use are somewhat overloaded - but I am finding some good clown schools in
> my research...so, any help is appreciated.
>
> Thanks!
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>



-- 
Harsh J

Re: MapReduce MIME Input type?

Posted by Harsh J <ha...@cloudera.com>.
Hey Devin,

Are you perhaps looking for http://james.apache.org/mime4j/? You may have
to adapt it for MR but I don't imagine that would be too difficult to do.

On Mon, Dec 30, 2013 at 11:59 PM, Devin Suiter RDX <ds...@rdx.com> wrote:

> Hi,
>
> I am trying to puzzle this out, and am hoping for some insight - I have an
> IMAP inbox dump that I am analyzing - I need to track how many times a
> given item is referred to in the inbox, i.e. how many emails came in about
> that thing and over what time. I can load it into MapReduce as
> TextInputFormat and parse it properly, and have managed to crudely
> concatenate lines that represent an email together as my final output, so,
> basically, it is working now, but my program is seeing each line as an
> InputSplit, and I so it is only working reliably with one InputFileSplit.
> If I had a bigger file, with multiple InputFileSplits presenting
> line-by-line InputSplits, I have no way to be sure that the lines that make
> one email will not end up in two different splits - does that make sense?
>
> Someone I work with suggested that I attempt to read each email as a
> record, since they have their MIME encoding intact in the text dump, rather
> than each line as a record.
>
> Does anyone know of a MIME MapReduce input type? I can't be sure this will
> help anyway, since the file is already text-encoded - I may have to get the
> email from the original inbox as individual messages somehow to utilize the
> MIME header information.
>
> Googling this has been challenging, mainly because the words you have to
> use are somewhat overloaded - but I am finding some good clown schools in
> my research...so, any help is appreciated.
>
> Thanks!
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>



-- 
Harsh J

Re: MapReduce MIME Input type?

Posted by Harsh J <ha...@cloudera.com>.
Hey Devin,

Are you perhaps looking for http://james.apache.org/mime4j/? You may have
to adapt it for MR but I don't imagine that would be too difficult to do.

On Mon, Dec 30, 2013 at 11:59 PM, Devin Suiter RDX <ds...@rdx.com> wrote:

> Hi,
>
> I am trying to puzzle this out, and am hoping for some insight - I have an
> IMAP inbox dump that I am analyzing - I need to track how many times a
> given item is referred to in the inbox, i.e. how many emails came in about
> that thing and over what time. I can load it into MapReduce as
> TextInputFormat and parse it properly, and have managed to crudely
> concatenate lines that represent an email together as my final output, so,
> basically, it is working now, but my program is seeing each line as an
> InputSplit, and I so it is only working reliably with one InputFileSplit.
> If I had a bigger file, with multiple InputFileSplits presenting
> line-by-line InputSplits, I have no way to be sure that the lines that make
> one email will not end up in two different splits - does that make sense?
>
> Someone I work with suggested that I attempt to read each email as a
> record, since they have their MIME encoding intact in the text dump, rather
> than each line as a record.
>
> Does anyone know of a MIME MapReduce input type? I can't be sure this will
> help anyway, since the file is already text-encoded - I may have to get the
> email from the original inbox as individual messages somehow to utilize the
> MIME header information.
>
> Googling this has been challenging, mainly because the words you have to
> use are somewhat overloaded - but I am finding some good clown schools in
> my research...so, any help is appreciated.
>
> Thanks!
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>



-- 
Harsh J