You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Rong-en Fan <gr...@gmail.com> on 2008/03/20 10:06:45 UTC

MapFile and MapFileOutputFormat

Hi,

I have two questions regarding the mapfile in hadoop/hdfs. First, when using
MapFileOutputFormat as reducer's output, is there any way to change
the index interval (i.e., able to call setIndexInterval() on the
output MapFile)?
Second, is it possible to tell what is the position in data file for a given
key, assuming index interval is 1 and # of keys are small?

Thanks,
Rong-En Fan

Re: MapFile and MapFileOutputFormat

Posted by Rong-en Fan <gr...@gmail.com>.

On Fri, Mar 21, 2008 at 12:42 AM, Doug Cutting <cu...@apache.org> wrote:
> Rong-en Fan wrote:
>  > I have two questions regarding the mapfile in hadoop/hdfs. First, when using
>  > MapFileOutputFormat as reducer's output, is there any way to change
>  > the index interval (i.e., able to call setIndexInterval() on the
>  > output MapFile)?
>
>  Not at present.  It would probably be good to change MapFile to get this
>  value from the Configuration.  A static method could be added,
>  MapFile#setIndexInterval(Configuration conf, int interval), that sets
>  "io.mapfile.index.interval", and the MapFile constructor could read this
>  property from the Configuration.  One could then use the static method
>  to set this on jobs.
>
>  If you need this, please file an issue in Jira.  If possible, include a
>  patch too.
>
>  http://wiki.apache.org/hadoop/HowToContribute

Thanks, I will consider this.

>  > Second, is it possible to tell what is the position in data file for a given
>  > key, assuming index interval is 1 and # of keys are small?
>
>  One could read the "index" file explicitly.  It's just a SequenceFile,
>  listing keys and positions in the "data" file.  But why would you set
>  the index interval to 1?  And why do you need to know the position?

I want to move my computation to the datanode that has my data.
As there are some overheads of launching map-reduce job, I want to
run a persistent daemon on each datanode to do my computation.
Any suggestions?

Regards,
Rong-En Fan

Re: MapFile and MapFileOutputFormat

Posted by Doug Cutting <cu...@apache.org>.

Rong-en Fan wrote:
> I have two questions regarding the mapfile in hadoop/hdfs. First, when using
> MapFileOutputFormat as reducer's output, is there any way to change
> the index interval (i.e., able to call setIndexInterval() on the
> output MapFile)?

Not at present.  It would probably be good to change MapFile to get this 
value from the Configuration.  A static method could be added, 
MapFile#setIndexInterval(Configuration conf, int interval), that sets 
"io.mapfile.index.interval", and the MapFile constructor could read this 
property from the Configuration.  One could then use the static method 
to set this on jobs.

If you need this, please file an issue in Jira.  If possible, include a 
patch too.

http://wiki.apache.org/hadoop/HowToContribute

> Second, is it possible to tell what is the position in data file for a given
> key, assuming index interval is 1 and # of keys are small?

One could read the "index" file explicitly.  It's just a SequenceFile, 
listing keys and positions in the "data" file.  But why would you set 
the index interval to 1?  And why do you need to know the position?

Doug