You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by John Hancock <jh...@gmail.com> on 2012/05/17 12:10:14 UTC

custom FileInputFormat class

All,

Can anyone on the list point me in the right direction as to how to write
my own FileInputFormat class?

Perhaps this is not even the way I should go, but my goal is to write a
MapReduce job that gets its input from a binary file of integers and longs.

-John

Re: custom FileInputFormat class

Posted by Harsh J <ha...@cloudera.com>.

Hello John,

I covered two resources you can use to read up on these custom
extensions previously at http://search-hadoop.com/m/98TH8MPsTK. Hope
this helps you get started. Let us know if you have specific
issues/questions once you do :)

On Thu, May 17, 2012 at 3:40 PM, John Hancock <jh...@gmail.com> wrote:
> All,
>
> Can anyone on the list point me in the right direction as to how to write
> my own FileInputFormat class?
>
> Perhaps this is not even the way I should go, but my goal is to write a
> MapReduce job that gets its input from a binary file of integers and longs.
>
> -John

-- 
Harsh J

Re: custom FileInputFormat class

Posted by John Hancock <jh...@gmail.com>.

Devaraj,

Thanks for the pointer.

I ended up extending FileInputFormat.

I made some notes about the program I wrote to use the custom
FileInputFormat here:

https://cakephp.rootser.com/posts/view/64

I think it may be because I'm using 1.0.1, but I did not need to write a
getSplits() method.  However, I did need to write an IsSplittable(), where
I just went with the default implementation.  Is the way that one makes
one's input splittable to assign the job configuration a codec?

Also, I think that if I now take my FileInputFormat object
(RootserFileInputFormat in the page I link to above) and change the
nextKeyValue() method to use and ObjectInputStream, and modify
RootserFileInputFormat to have a type parameter, and make the type of the
object nextKeyValue() reads out of the split the same type as the parameter
of RootserFileInputFormat, I will have a FileInputFormat object that can
read any kind of (serializable) object out of a split.  While this is cool,
I can't believe I am the first person who thought of something like this.
Do you know if there is already a way to do this using the Hadoop framework?

Thanks for the pointer on how to get started.

On Thu, May 17, 2012 at 6:32 AM, Devaraj k <de...@huawei.com> wrote:

> Hi John,
>
>
> You can extend  FileInputFormat(or implement InputFormat) and then you
> need to implement below methods.
>
> 1. InputSplit[] getSplits(JobConf job, int numSplits)  : For splitting the
> input files logically for the job. If FileInputFormat.getSplits(JobConf
> job, int numSplits) suits for your requirement, you can make use of it.
> Otherwise you can implement it based on your need.
>
> 2. RecordReader<K,V> RecordReader(InputSplit split, JobConf job, Reporter
> reporter) : For reading the input split.
>
>
> Thanks
> Devaraj
>
> ________________________________________
> From: John Hancock [jhancock1975@gmail.com]
> Sent: Thursday, May 17, 2012 3:40 PM
> To: common-user@hadoop.apache.org
> Subject: custom FileInputFormat class
>
> All,
>
> Can anyone on the list point me in the right direction as to how to write
> my own FileInputFormat class?
>
> Perhaps this is not even the way I should go, but my goal is to write a
> MapReduce job that gets its input from a binary file of integers and longs.
>
> -John
>

RE: custom FileInputFormat class

Posted by Devaraj k <de...@huawei.com>.

Hi John,

   
You can extend  FileInputFormat(or implement InputFormat) and then you need to implement below methods.

1. InputSplit[] getSplits(JobConf job, int numSplits)  : For splitting the input files logically for the job. If FileInputFormat.getSplits(JobConf job, int numSplits) suits for your requirement, you can make use of it. Otherwise you can implement it based on your need.

2. RecordReader<K,V> RecordReader(InputSplit split, JobConf job, Reporter reporter) : For reading the input split.


Thanks
Devaraj

________________________________________
From: John Hancock [jhancock1975@gmail.com]
Sent: Thursday, May 17, 2012 3:40 PM
To: common-user@hadoop.apache.org
Subject: custom FileInputFormat class

All,

Can anyone on the list point me in the right direction as to how to write
my own FileInputFormat class?

Perhaps this is not even the way I should go, but my goal is to write a
MapReduce job that gets its input from a binary file of integers and longs.

-John