You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@hadoop.apache.org by Xueling Shu <xs...@systemsbiology.org> on 2009/12/12 04:19:14 UTC

Which Hadoop product is more appropriate for a quick query on a large data set?

 Hi there:

I am researching Hadoop to see which of its products suits our need for
quick queries against large data sets (billions of records per set)

The queries will be performed against chip sequencing data. Each record is
one line in a file. To be clear below shows a sample record in the data set.


one line (record) looks like: 1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC U2 0 0
1 4 *103570835* F .. 23G 24C

The highlighted field is called "position of match" and the query we are
interested in is the # of sequences in a certain range of this "position of
match". For instance the range can be "position of match" > 200 and
"position of match" + 36 < 200,000.

Any suggestions on the Hadoop product I should start with to accomplish the
task? HBase,Pig,Hive, or ...?

Thanks!

Xueling

Re: Which Hadoop product is more appropriate for a quick query on a large data set?

Posted by stack <st...@duboce.net>.

You might also consider hbase, particularly If you find that your data is
being updated with some regularity, particularly if the updates are randomly
distributed over the data set.  See
http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulkfor
how to do a fast bulk load of your billiions of rows of data.

Yours,
St.Ack


On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Xueling,
>
> In that case, I would recommend the following:
>
> 1) Put all of your data on HDFS
> 2) Write a MapReduce job that sorts the data by position of match
> 3) As a second output of this job, you can write a "sparse index" -
> basically a set of entries like this:
>
> <position of match> <offset into file> <number of entries following>
>
> where you're basically giving offsets into every 10K records or so. If
> you index every 10K records, then 5 billion total will mean 100,000
> index entries. Each index entry shouldn't be more than 20 bytes, so
> 100,000 entries will be 2MB. This is super easy to fit into memory.
> (you could probably index every 100th record instead and end up with
> 200MB, still easy to fit in memory)
>
> Then to satisfy your count-range query, you can simply scan your
> in-memory sparse index. Some of the indexed blocks will be completely
> included in the range, in which case you just add up the "number of
> entries following" column. The start and finish block will be
> partially covered, so you can use the file offset info to load that
> file off HDFS, start reading at that offset, and finish the count.
>
> Total time per query should be <100ms no problem.
>
> -Todd
>
> On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu <xs...@systemsbiology.org>
> wrote:
> > Hi Todd:
> >
> > Thank you for your reply.
> >
> > The datasets wont be updated often. But the query against a data set is
> > frequent. The quicker the query, the better. For example we have done
> > testing on a Mysql database (5 billion records randomly scattered into 24
> > tables) and the slowest query against the biggest table (400,000,000
> > records) is around 12 mins. So if using any Hadoop product can speed up
> the
> > search then the product is what we are looking for.
> >
> > Cheers,
> > Xueling
> >
> > On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon <to...@cloudera.com> wrote:
> >
> >> Hi Xueling,
> >>
> >> One important question that can really change the answer:
> >>
> >> How often does the dataset change? Can the changes be merged in in
> >> bulk every once in a while, or do you need to actually update them
> >> randomly very often?
> >>
> >> Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 second,
> or
> >> 10ms?
> >>
> >> Thanks
> >> -Todd
> >>
> >> On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <xs...@systemsbiology.org>
> >> wrote:
> >> >  Hi there:
> >> >
> >> > I am researching Hadoop to see which of its products suits our need
> for
> >> > quick queries against large data sets (billions of records per set)
> >> >
> >> > The queries will be performed against chip sequencing data. Each
> record
> >> is
> >> > one line in a file. To be clear below shows a sample record in the
> data
> >> set.
> >> >
> >> >
> >> > one line (record) looks like: 1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC
> U2
> >> 0 0
> >> > 1 4 *103570835* F .. 23G 24C
> >> >
> >> > The highlighted field is called "position of match" and the query we
> are
> >> > interested in is the # of sequences in a certain range of this
> "position
> >> of
> >> > match". For instance the range can be "position of match" > 200 and
> >> > "position of match" + 36 < 200,000.
> >> >
> >> > Any suggestions on the Hadoop product I should start with to
> accomplish
> >> the
> >> > task? HBase,Pig,Hive, or ...?
> >> >
> >> > Thanks!
> >> >
> >> > Xueling
> >> >
> >>
> >
>