You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Xavier Stevens <Xa...@fox.com> on 2008/03/03 23:23:53 UTC

What's the best way to get to a single key?

I am curious how others might be solving this problem.  I want to
retrieve a record from HDFS based on its key.  Are there any methods
that can shortcut this type of search to avoid parsing all data until
you find it?  Obviously Hbase would do this as well, but I wanted to
know if there is a way to do it using just Map/Reduce and HDFS.

Thanks,

-Xavier


Re: What's the best way to get to a single key?

Posted by Miles Osborne <mi...@inf.ed.ac.uk>.
It should be possible to use the hash of a key to work-out which shard it is
present in;  you would then search over all entries in the relevant shard.

Miles

On 03/03/2008, Xavier Stevens <Xa...@fox.com> wrote:
>
> I am curious how others might be solving this problem.  I want to
> retrieve a record from HDFS based on its key.  Are there any methods
> that can shortcut this type of search to avoid parsing all data until
> you find it?  Obviously Hbase would do this as well, but I wanted to
> know if there is a way to do it using just Map/Reduce and HDFS.
>
> Thanks,
>
>
> -Xavier
>
>


-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

RE: What's the best way to get to a single key?

Posted by Xavier Stevens <Xa...@fox.com>.
Disreguard.  I figured this one out.  It was an error caused by calling 

MapFile.Reader[] readers = MapFileOutputFormat.getReaders(fileSys,
outDir, defaults);

With the wrong path for outDir.

Just in case anyone wants an example to do this later on.  I also had to
pass a non-null value to:

Text myEntry = new Text();
MapFileOutputFormat.getEntry(readers, part, new Text("mykey"), myEntry);

This method's Javadocs should be updated to make things a bit more
clear.  It both fills out the value object passed in as well as
returning it.  Or better yet change the method.  Unless I am missing
something I don't see why you should have to pass in a value at all,
since we really want to retrieve by key.

Cheers,

-Xavier


-----Original Message-----
From: Xavier Stevens
Sent: Monday, March 10, 2008 5:09 PM
To: core-user@hadoop.apache.org
Subject: RE: What's the best way to get to a single key?

So I read some more through the Javadocs.  I had 11 reducers on my
original job leaving me 11 MapFile directories.  I am passing in their
parent directory here as "outDir".

MapFile.Reader[] readers = MapFileOutputFormat.getReaders(fileSys,
outDir, defaults); Partitioner part =
(Partitioner)ReflectionUtils.newInstance(conf.getPartitionerClass(),
conf); Text entryValue = (Text)MapFileOutputFormat.getEntry(readers,
part, new Text("mykey"), null); System.out.println("My Entry's Value:
"); System.out.println(entryValue.toString());

But I am getting an exception:

Exception in thread "main" java.lang.ArithmeticException: / by zero
        at
org.apache.hadoop.mapred.lib.HashPartitioner.getPartition(HashPartitione
r.java:35)
        at
org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputForma
t.java:85)
        at mypackage.MyClass.main(MyClass.java:110)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
a:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
Impl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

I am assuming I am doing something wrong, but I'm not sure what it is
yet.  Any ideas?


-Xavier


-----Original Message-----
From: Xavier Stevens
Sent: Mon 3/10/2008 3:49 PM
To: core-user@hadoop.apache.org
Subject: RE: What's the best way to get to a single key?
 
I was thinking because it would be easier to search a single-index.
Unless I don't have to worry and hadoop searches all my indexes at the
same time.  Is this the case?

-Xavier
 

-----Original Message-----
From: Doug Cutting
Sent: Monday, March 10, 2008 3:45 PM
To: core-user@hadoop.apache.org
Subject: Re: What's the best way to get to a single key?

Xavier Stevens wrote:
> Thanks for everything so far.  It has been really helpful.  I have one

> more question.  Is there a way to merge MapFile index/data files?

No.

To append text files you can use 'bin/hadoop fs -getmerge'.

To merge sorted SequenceFiles (like MapFile/index files) you can use:

http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/Sequ
enceFile.Sorter.html#merge(org.apache.hadoop.fs.Path[], org.apache.had
oop.fs.Path, boolean)

But this doesn't generate a MapFile.

Why is a single file preferable?

Doug




 


RE: What's the best way to get to a single key?

Posted by Xavier Stevens <Xa...@fox.com>.
So I read some more through the Javadocs.  I had 11 reducers on my original job leaving me 11 MapFile directories.  I am passing in their parent directory here as "outDir".

MapFile.Reader[] readers = MapFileOutputFormat.getReaders(fileSys, outDir, defaults);
Partitioner part = (Partitioner)ReflectionUtils.newInstance(conf.getPartitionerClass(), conf);
Text entryValue = (Text)MapFileOutputFormat.getEntry(readers, part, new Text("mykey"), null);
System.out.println("My Entry's Value: ");
System.out.println(entryValue.toString());

But I am getting an exception:

Exception in thread "main" java.lang.ArithmeticException: / by zero
        at org.apache.hadoop.mapred.lib.HashPartitioner.getPartition(HashPartitioner.java:35)
        at org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:85)
        at mypackage.MyClass.main(ProfileReader.java:110)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

I am assuming I am doing something wrong, but I'm not sure what it is yet.  Any ideas?


-Xavier


-----Original Message-----
From: Xavier Stevens
Sent: Mon 3/10/2008 3:49 PM
To: core-user@hadoop.apache.org
Subject: RE: What's the best way to get to a single key?
 
I was thinking because it would be easier to search a single-index.
Unless I don't have to worry and hadoop searches all my indexes at the
same time.  Is this the case?

-Xavier
 

-----Original Message-----
From: Doug Cutting [mailto:cutting@apache.org] 
Sent: Monday, March 10, 2008 3:45 PM
To: core-user@hadoop.apache.org
Subject: Re: What's the best way to get to a single key?

Xavier Stevens wrote:
> Thanks for everything so far.  It has been really helpful.  I have one

> more question.  Is there a way to merge MapFile index/data files?

No.

To append text files you can use 'bin/hadoop fs -getmerge'.

To merge sorted SequenceFiles (like MapFile/index files) you can use:

http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/Sequ
enceFile.Sorter.html#merge(org.apache.hadoop.fs.Path[], org.apache.had
oop.fs.Path, boolean)

But this doesn't generate a MapFile.

Why is a single file preferable?

Doug




 

RE: What's the best way to get to a single key?

Posted by Xavier Stevens <Xa...@fox.com>.
I was thinking because it would be easier to search a single-index.
Unless I don't have to worry and hadoop searches all my indexes at the
same time.  Is this the case?

-Xavier
 

-----Original Message-----
From: Doug Cutting [mailto:cutting@apache.org] 
Sent: Monday, March 10, 2008 3:45 PM
To: core-user@hadoop.apache.org
Subject: Re: What's the best way to get to a single key?

Xavier Stevens wrote:
> Thanks for everything so far.  It has been really helpful.  I have one

> more question.  Is there a way to merge MapFile index/data files?

No.

To append text files you can use 'bin/hadoop fs -getmerge'.

To merge sorted SequenceFiles (like MapFile/index files) you can use:

http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/Sequ
enceFile.Sorter.html#merge(org.apache.hadoop.fs.Path[],%20org.apache.had
oop.fs.Path,%20boolean)

But this doesn't generate a MapFile.

Why is a single file preferable?

Doug



Re: What's the best way to get to a single key?

Posted by Doug Cutting <cu...@apache.org>.
Xavier Stevens wrote:
> Thanks for everything so far.  It has been really helpful.  I have one
> more question.  Is there a way to merge MapFile index/data files?

No.

To append text files you can use 'bin/hadoop fs -getmerge'.

To merge sorted SequenceFiles (like MapFile/index files) you can use:

http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/SequenceFile.Sorter.html#merge(org.apache.hadoop.fs.Path[],%20org.apache.hadoop.fs.Path,%20boolean)

But this doesn't generate a MapFile.

Why is a single file preferable?

Doug

RE: What's the best way to get to a single key?

Posted by Xavier Stevens <Xa...@fox.com>.
Thanks for everything so far.  It has been really helpful.  I have one
more question.  Is there a way to merge MapFile index/data files?
Assuming there is, what is the best way to do so?  I was reading the
Java docs on it and it looked like this is possible but it wasn't very
explicit.  Obviously I could specify to use a single reducer, but with
my data size that would be really slow.

Thanks,

-Xavier


-----Original Message-----
From: Doug Cutting [mailto:cutting@apache.org] 
Sent: Tuesday, March 04, 2008 12:53 PM
To: core-user@hadoop.apache.org
Subject: Re: What's the best way to get to a single key?

Xavier Stevens wrote:
> Is there a way to do this when your input data is using SequenceFile 
> compression?

Yes.  A MapFile is simply a directory containing two SequenceFiles named
"data" and "index".  MapFileOutputFormat uses the same compression
parameters as SequenceFileOutputFormat.  SequenceFileInputFormat
recognizes MapFiles and reads the "data" file.  So you should be able to
just switch from specifying SequenceFileOutputFormat to
MapFileOutputFormat in your jobs and everything should work the same
except you'll have index files that permit random access.

Doug



Re: What's the best way to get to a single key?

Posted by Ted Dunning <td...@veoh.com>.
And this, btw, provides a rationale for having a key in the reducer output.


On 3/4/08 12:53 PM, "Doug Cutting" <cu...@apache.org> wrote:

> So you should be able to
> just switch from specifying SequenceFileOutputFormat to
> MapFileOutputFormat in your jobs and everything should work the same
> except you'll have index files that permit random access.


Re: What's the best way to get to a single key?

Posted by Doug Cutting <cu...@apache.org>.
Xavier Stevens wrote:
> Is there a way to do this when your input data is using SequenceFile
> compression?

Yes.  A MapFile is simply a directory containing two SequenceFiles named 
"data" and "index".  MapFileOutputFormat uses the same compression 
parameters as SequenceFileOutputFormat.  SequenceFileInputFormat 
recognizes MapFiles and reads the "data" file.  So you should be able to 
just switch from specifying SequenceFileOutputFormat to 
MapFileOutputFormat in your jobs and everything should work the same 
except you'll have index files that permit random access.

Doug

RE: What's the best way to get to a single key?

Posted by Xavier Stevens <Xa...@fox.com>.
Is there a way to do this when your input data is using SequenceFile
compression?

Thanks,

-Xavier 

-----Original Message-----
From: Doug Cutting [mailto:cutting@apache.org] 
Sent: Monday, March 03, 2008 2:52 PM
To: core-user@hadoop.apache.org
Subject: Re: What's the best way to get to a single key?

Use MapFileOutputFormat to write your data, then call:

http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/
MapFileOutputFormat.html#getEntry(org.apache.hadoop.io.MapFile.Reader[],
%20org.apache.hadoop.mapred.Partitioner,%20K,%20V)

The documentation is pretty sparse, but the intent is that you open a
MapFile.Reader for each mapreduce output, pass the partitioner used, the
key, and the value to be read into.

A MapFile maintains an index of keys, so the entire file need not be
scanned.  If you really only need the value of a single key then you
might avoid opening all of the output files.  In that case you could
might use the Partitioner and the MapFile API directly.

Doug


Xavier Stevens wrote:
> I am curious how others might be solving this problem.  I want to 
> retrieve a record from HDFS based on its key.  Are there any methods 
> that can shortcut this type of search to avoid parsing all data until 
> you find it?  Obviously Hbase would do this as well, but I wanted to 
> know if there is a way to do it using just Map/Reduce and HDFS.
> 
> Thanks,
> 
> -Xavier
> 




Re: What's the best way to get to a single key?

Posted by Doug Cutting <cu...@apache.org>.
Use MapFileOutputFormat to write your data, then call:

http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/MapFileOutputFormat.html#getEntry(org.apache.hadoop.io.MapFile.Reader[],%20org.apache.hadoop.mapred.Partitioner,%20K,%20V)

The documentation is pretty sparse, but the intent is that you open a 
MapFile.Reader for each mapreduce output, pass the partitioner used, the 
key, and the value to be read into.

A MapFile maintains an index of keys, so the entire file need not be 
scanned.  If you really only need the value of a single key then you 
might avoid opening all of the output files.  In that case you could 
might use the Partitioner and the MapFile API directly.

Doug


Xavier Stevens wrote:
> I am curious how others might be solving this problem.  I want to
> retrieve a record from HDFS based on its key.  Are there any methods
> that can shortcut this type of search to avoid parsing all data until
> you find it?  Obviously Hbase would do this as well, but I wanted to
> know if there is a way to do it using just Map/Reduce and HDFS.
> 
> Thanks,
> 
> -Xavier
>