You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Felix Chern <id...@gmail.com> on 2014/01/29 23:46:41 UTC

Capture Directory Context in Hadoop Mapper

Hi all,

I wrote a tutorial of how to receive path information in Mapper class. It's useful in our hadoop use case where we need to apply different logic on different input source directory. Enjoy!

http://www.idryman.org/blog/2014/01/26/capture-directory-context-in-hadoop-mapper/
http://www.idryman.org/blog/2014/01/27/capture-path-info-in-hadoop-inputformat-class/

Felix

Re: Capture Directory Context in Hadoop Mapper

Posted by Felix Chern <id...@gmail.com>.
MultipleInputs is nice. Most of the time, I use it for reduce-side join.
It's great, however, you'll need to specify different Mapper class per input directory.
In our case, we try to let the Mapper itself to capture the directory information, because these directories might contain
data across months, and the the file structures may differ a bit time by time.
Finally, this is the solution I came up with, and it's fun to hack on lower level APIs. :D

Yet, thanks for suggesting!

Felix

On Jan 29, 2014, at 10:15 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi,
> 
> These posts are nicely written - thanks for sharing! Have you also
> taken a look at the MultipleInputs feature, which gives you a cleaner
> approach? http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/lib/input/MultipleInputs.html
> 
> On Thu, Jan 30, 2014 at 4:16 AM, Felix Chern <id...@gmail.com> wrote:
>> Hi all,
>> 
>> I wrote a tutorial of how to receive path information in Mapper class. It's
>> useful in our hadoop use case where we need to apply different logic on
>> different input source directory. Enjoy!
>> 
>> http://www.idryman.org/blog/2014/01/26/capture-directory-context-in-hadoop-mapper/
>> http://www.idryman.org/blog/2014/01/27/capture-path-info-in-hadoop-inputformat-class/
>> 
>> Felix
> 
> 
> 
> -- 
> Harsh J


Re: Capture Directory Context in Hadoop Mapper

Posted by Felix Chern <id...@gmail.com>.
MultipleInputs is nice. Most of the time, I use it for reduce-side join.
It's great, however, you'll need to specify different Mapper class per input directory.
In our case, we try to let the Mapper itself to capture the directory information, because these directories might contain
data across months, and the the file structures may differ a bit time by time.
Finally, this is the solution I came up with, and it's fun to hack on lower level APIs. :D

Yet, thanks for suggesting!

Felix

On Jan 29, 2014, at 10:15 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi,
> 
> These posts are nicely written - thanks for sharing! Have you also
> taken a look at the MultipleInputs feature, which gives you a cleaner
> approach? http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/lib/input/MultipleInputs.html
> 
> On Thu, Jan 30, 2014 at 4:16 AM, Felix Chern <id...@gmail.com> wrote:
>> Hi all,
>> 
>> I wrote a tutorial of how to receive path information in Mapper class. It's
>> useful in our hadoop use case where we need to apply different logic on
>> different input source directory. Enjoy!
>> 
>> http://www.idryman.org/blog/2014/01/26/capture-directory-context-in-hadoop-mapper/
>> http://www.idryman.org/blog/2014/01/27/capture-path-info-in-hadoop-inputformat-class/
>> 
>> Felix
> 
> 
> 
> -- 
> Harsh J


Re: Capture Directory Context in Hadoop Mapper

Posted by Felix Chern <id...@gmail.com>.
MultipleInputs is nice. Most of the time, I use it for reduce-side join.
It's great, however, you'll need to specify different Mapper class per input directory.
In our case, we try to let the Mapper itself to capture the directory information, because these directories might contain
data across months, and the the file structures may differ a bit time by time.
Finally, this is the solution I came up with, and it's fun to hack on lower level APIs. :D

Yet, thanks for suggesting!

Felix

On Jan 29, 2014, at 10:15 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi,
> 
> These posts are nicely written - thanks for sharing! Have you also
> taken a look at the MultipleInputs feature, which gives you a cleaner
> approach? http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/lib/input/MultipleInputs.html
> 
> On Thu, Jan 30, 2014 at 4:16 AM, Felix Chern <id...@gmail.com> wrote:
>> Hi all,
>> 
>> I wrote a tutorial of how to receive path information in Mapper class. It's
>> useful in our hadoop use case where we need to apply different logic on
>> different input source directory. Enjoy!
>> 
>> http://www.idryman.org/blog/2014/01/26/capture-directory-context-in-hadoop-mapper/
>> http://www.idryman.org/blog/2014/01/27/capture-path-info-in-hadoop-inputformat-class/
>> 
>> Felix
> 
> 
> 
> -- 
> Harsh J


Re: Capture Directory Context in Hadoop Mapper

Posted by Felix Chern <id...@gmail.com>.
MultipleInputs is nice. Most of the time, I use it for reduce-side join.
It's great, however, you'll need to specify different Mapper class per input directory.
In our case, we try to let the Mapper itself to capture the directory information, because these directories might contain
data across months, and the the file structures may differ a bit time by time.
Finally, this is the solution I came up with, and it's fun to hack on lower level APIs. :D

Yet, thanks for suggesting!

Felix

On Jan 29, 2014, at 10:15 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi,
> 
> These posts are nicely written - thanks for sharing! Have you also
> taken a look at the MultipleInputs feature, which gives you a cleaner
> approach? http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/lib/input/MultipleInputs.html
> 
> On Thu, Jan 30, 2014 at 4:16 AM, Felix Chern <id...@gmail.com> wrote:
>> Hi all,
>> 
>> I wrote a tutorial of how to receive path information in Mapper class. It's
>> useful in our hadoop use case where we need to apply different logic on
>> different input source directory. Enjoy!
>> 
>> http://www.idryman.org/blog/2014/01/26/capture-directory-context-in-hadoop-mapper/
>> http://www.idryman.org/blog/2014/01/27/capture-path-info-in-hadoop-inputformat-class/
>> 
>> Felix
> 
> 
> 
> -- 
> Harsh J


Re: Capture Directory Context in Hadoop Mapper

Posted by Harsh J <ha...@cloudera.com>.
Hi,

These posts are nicely written - thanks for sharing! Have you also
taken a look at the MultipleInputs feature, which gives you a cleaner
approach? http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/lib/input/MultipleInputs.html

On Thu, Jan 30, 2014 at 4:16 AM, Felix Chern <id...@gmail.com> wrote:
> Hi all,
>
> I wrote a tutorial of how to receive path information in Mapper class. It's
> useful in our hadoop use case where we need to apply different logic on
> different input source directory. Enjoy!
>
> http://www.idryman.org/blog/2014/01/26/capture-directory-context-in-hadoop-mapper/
> http://www.idryman.org/blog/2014/01/27/capture-path-info-in-hadoop-inputformat-class/
>
> Felix



-- 
Harsh J

Re: Capture Directory Context in Hadoop Mapper

Posted by Harsh J <ha...@cloudera.com>.
Hi,

These posts are nicely written - thanks for sharing! Have you also
taken a look at the MultipleInputs feature, which gives you a cleaner
approach? http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/lib/input/MultipleInputs.html

On Thu, Jan 30, 2014 at 4:16 AM, Felix Chern <id...@gmail.com> wrote:
> Hi all,
>
> I wrote a tutorial of how to receive path information in Mapper class. It's
> useful in our hadoop use case where we need to apply different logic on
> different input source directory. Enjoy!
>
> http://www.idryman.org/blog/2014/01/26/capture-directory-context-in-hadoop-mapper/
> http://www.idryman.org/blog/2014/01/27/capture-path-info-in-hadoop-inputformat-class/
>
> Felix



-- 
Harsh J

Re: Capture Directory Context in Hadoop Mapper

Posted by Harsh J <ha...@cloudera.com>.
Hi,

These posts are nicely written - thanks for sharing! Have you also
taken a look at the MultipleInputs feature, which gives you a cleaner
approach? http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/lib/input/MultipleInputs.html

On Thu, Jan 30, 2014 at 4:16 AM, Felix Chern <id...@gmail.com> wrote:
> Hi all,
>
> I wrote a tutorial of how to receive path information in Mapper class. It's
> useful in our hadoop use case where we need to apply different logic on
> different input source directory. Enjoy!
>
> http://www.idryman.org/blog/2014/01/26/capture-directory-context-in-hadoop-mapper/
> http://www.idryman.org/blog/2014/01/27/capture-path-info-in-hadoop-inputformat-class/
>
> Felix



-- 
Harsh J

Re: Capture Directory Context in Hadoop Mapper

Posted by Harsh J <ha...@cloudera.com>.
Hi,

These posts are nicely written - thanks for sharing! Have you also
taken a look at the MultipleInputs feature, which gives you a cleaner
approach? http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/lib/input/MultipleInputs.html

On Thu, Jan 30, 2014 at 4:16 AM, Felix Chern <id...@gmail.com> wrote:
> Hi all,
>
> I wrote a tutorial of how to receive path information in Mapper class. It's
> useful in our hadoop use case where we need to apply different logic on
> different input source directory. Enjoy!
>
> http://www.idryman.org/blog/2014/01/26/capture-directory-context-in-hadoop-mapper/
> http://www.idryman.org/blog/2014/01/27/capture-path-info-in-hadoop-inputformat-class/
>
> Felix



-- 
Harsh J