You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Devin Suiter RDX <ds...@rdx.com> on 2014/01/02 13:27:13 UTC

Re: MapReduce MIME Input type?

Thanks Harsh! Looks like something that might be useful! I appreciate it!

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Tue, Dec 31, 2013 at 1:08 AM, Harsh J <ha...@cloudera.com> wrote:

> Hey Devin,
>
> Are you perhaps looking for http://james.apache.org/mime4j/? You may have
> to adapt it for MR but I don't imagine that would be too difficult to do.
>
> On Mon, Dec 30, 2013 at 11:59 PM, Devin Suiter RDX <ds...@rdx.com>wrote:
>
>> Hi,
>>
>> I am trying to puzzle this out, and am hoping for some insight - I have
>> an IMAP inbox dump that I am analyzing - I need to track how many times a
>> given item is referred to in the inbox, i.e. how many emails came in about
>> that thing and over what time. I can load it into MapReduce as
>> TextInputFormat and parse it properly, and have managed to crudely
>> concatenate lines that represent an email together as my final output, so,
>> basically, it is working now, but my program is seeing each line as an
>> InputSplit, and I so it is only working reliably with one InputFileSplit.
>> If I had a bigger file, with multiple InputFileSplits presenting
>> line-by-line InputSplits, I have no way to be sure that the lines that make
>> one email will not end up in two different splits - does that make sense?
>>
>> Someone I work with suggested that I attempt to read each email as a
>> record, since they have their MIME encoding intact in the text dump, rather
>> than each line as a record.
>>
>> Does anyone know of a MIME MapReduce input type? I can't be sure this
>> will help anyway, since the file is already text-encoded - I may have to
>> get the email from the original inbox as individual messages somehow to
>> utilize the MIME header information.
>>
>> Googling this has been challenging, mainly because the words you have to
>> use are somewhat overloaded - but I am finding some good clown schools in
>> my research...so, any help is appreciated.
>>
>> Thanks!
>> *Devin Suiter*
>> Jr. Data Solutions Software Engineer
>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>> Google Voice: 412-256-8556 | www.rdx.com
>>
>
>
>
> --
> Harsh J
>