You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@hadoop.apache.org by Xueling Shu <xs...@systemsbiology.org> on 2009/12/12 04:19:14 UTC

Which Hadoop product is more appropriate for a quick query on a large data set?

 Hi there:

I am researching Hadoop to see which of its products suits our need for
quick queries against large data sets (billions of records per set)

The queries will be performed against chip sequencing data. Each record is
one line in a file. To be clear below shows a sample record in the data set.


one line (record) looks like: 1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC U2 0 0
1 4 *103570835* F .. 23G 24C

The highlighted field is called "position of match" and the query we are
interested in is the # of sequences in a certain range of this "position of
match". For instance the range can be "position of match" > 200 and
"position of match" + 36 < 200,000.

Any suggestions on the Hadoop product I should start with to accomplish the
task? HBase,Pig,Hive, or ...?

Thanks!

Xueling

Re: Which Hadoop product is more appropriate for a quick query on a large data set?

Posted by stack <st...@duboce.net>.
You might also consider hbase, particularly If you find that your data is
being updated with some regularity, particularly if the updates are randomly
distributed over the data set.  See
http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulkfor
how to do a fast bulk load of your billiions of rows of data.

Yours,
St.Ack


On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Xueling,
>
> In that case, I would recommend the following:
>
> 1) Put all of your data on HDFS
> 2) Write a MapReduce job that sorts the data by position of match
> 3) As a second output of this job, you can write a "sparse index" -
> basically a set of entries like this:
>
> <position of match> <offset into file> <number of entries following>
>
> where you're basically giving offsets into every 10K records or so. If
> you index every 10K records, then 5 billion total will mean 100,000
> index entries. Each index entry shouldn't be more than 20 bytes, so
> 100,000 entries will be 2MB. This is super easy to fit into memory.
> (you could probably index every 100th record instead and end up with
> 200MB, still easy to fit in memory)
>
> Then to satisfy your count-range query, you can simply scan your
> in-memory sparse index. Some of the indexed blocks will be completely
> included in the range, in which case you just add up the "number of
> entries following" column. The start and finish block will be
> partially covered, so you can use the file offset info to load that
> file off HDFS, start reading at that offset, and finish the count.
>
> Total time per query should be <100ms no problem.
>
> -Todd
>
> On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu <xs...@systemsbiology.org>
> wrote:
> > Hi Todd:
> >
> > Thank you for your reply.
> >
> > The datasets wont be updated often. But the query against a data set is
> > frequent. The quicker the query, the better. For example we have done
> > testing on a Mysql database (5 billion records randomly scattered into 24
> > tables) and the slowest query against the biggest table (400,000,000
> > records) is around 12 mins. So if using any Hadoop product can speed up
> the
> > search then the product is what we are looking for.
> >
> > Cheers,
> > Xueling
> >
> > On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon <to...@cloudera.com> wrote:
> >
> >> Hi Xueling,
> >>
> >> One important question that can really change the answer:
> >>
> >> How often does the dataset change? Can the changes be merged in in
> >> bulk every once in a while, or do you need to actually update them
> >> randomly very often?
> >>
> >> Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 second,
> or
> >> 10ms?
> >>
> >> Thanks
> >> -Todd
> >>
> >> On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <xs...@systemsbiology.org>
> >> wrote:
> >> >  Hi there:
> >> >
> >> > I am researching Hadoop to see which of its products suits our need
> for
> >> > quick queries against large data sets (billions of records per set)
> >> >
> >> > The queries will be performed against chip sequencing data. Each
> record
> >> is
> >> > one line in a file. To be clear below shows a sample record in the
> data
> >> set.
> >> >
> >> >
> >> > one line (record) looks like: 1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC
> U2
> >> 0 0
> >> > 1 4 *103570835* F .. 23G 24C
> >> >
> >> > The highlighted field is called "position of match" and the query we
> are
> >> > interested in is the # of sequences in a certain range of this
> "position
> >> of
> >> > match". For instance the range can be "position of match" > 200 and
> >> > "position of match" + 36 < 200,000.
> >> >
> >> > Any suggestions on the Hadoop product I should start with to
> accomplish
> >> the
> >> > task? HBase,Pig,Hive, or ...?
> >> >
> >> > Thanks!
> >> >
> >> > Xueling
> >> >
> >>
> >
>

Re: Which Hadoop product is more appropriate for a quick query on a large data set?

Posted by Fred Zappert <fz...@gmail.com>.
+1 for hbase

On Sat, Dec 12, 2009 at 2:56 PM, Xueling Shu <xs...@systemsbiology.org>wrote:

> Great information! Thank you for your help, Todd.
>
> Xueling
>
> On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
> > Hi Xueling,
> >
> > In that case, I would recommend the following:
> >
> > 1) Put all of your data on HDFS
> > 2) Write a MapReduce job that sorts the data by position of match
> > 3) As a second output of this job, you can write a "sparse index" -
> > basically a set of entries like this:
> >
> > <position of match> <offset into file> <number of entries following>
> >
> > where you're basically giving offsets into every 10K records or so. If
> > you index every 10K records, then 5 billion total will mean 100,000
> > index entries. Each index entry shouldn't be more than 20 bytes, so
> > 100,000 entries will be 2MB. This is super easy to fit into memory.
> > (you could probably index every 100th record instead and end up with
> > 200MB, still easy to fit in memory)
> >
> > Then to satisfy your count-range query, you can simply scan your
> > in-memory sparse index. Some of the indexed blocks will be completely
> > included in the range, in which case you just add up the "number of
> > entries following" column. The start and finish block will be
> > partially covered, so you can use the file offset info to load that
> > file off HDFS, start reading at that offset, and finish the count.
> >
> > Total time per query should be <100ms no problem.
> >
> > -Todd
> >
> > On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu <xs...@systemsbiology.org>
> > wrote:
> > > Hi Todd:
> > >
> > > Thank you for your reply.
> > >
> > > The datasets wont be updated often. But the query against a data set is
> > > frequent. The quicker the query, the better. For example we have done
> > > testing on a Mysql database (5 billion records randomly scattered into
> 24
> > > tables) and the slowest query against the biggest table (400,000,000
> > > records) is around 12 mins. So if using any Hadoop product can speed up
> > the
> > > search then the product is what we are looking for.
> > >
> > > Cheers,
> > > Xueling
> > >
> > > On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon <to...@cloudera.com>
> wrote:
> > >
> > >> Hi Xueling,
> > >>
> > >> One important question that can really change the answer:
> > >>
> > >> How often does the dataset change? Can the changes be merged in in
> > >> bulk every once in a while, or do you need to actually update them
> > >> randomly very often?
> > >>
> > >> Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 second,
> > or
> > >> 10ms?
> > >>
> > >> Thanks
> > >> -Todd
> > >>
> > >> On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <xshu@systemsbiology.org
> >
> > >> wrote:
> > >> >  Hi there:
> > >> >
> > >> > I am researching Hadoop to see which of its products suits our need
> > for
> > >> > quick queries against large data sets (billions of records per set)
> > >> >
> > >> > The queries will be performed against chip sequencing data. Each
> > record
> > >> is
> > >> > one line in a file. To be clear below shows a sample record in the
> > data
> > >> set.
> > >> >
> > >> >
> > >> > one line (record) looks like: 1-1-174-418
> TGTGTCCCTTTGTAATGAATCACTATC
> > U2
> > >> 0 0
> > >> > 1 4 *103570835* F .. 23G 24C
> > >> >
> > >> > The highlighted field is called "position of match" and the query we
> > are
> > >> > interested in is the # of sequences in a certain range of this
> > "position
> > >> of
> > >> > match". For instance the range can be "position of match" > 200 and
> > >> > "position of match" + 36 < 200,000.
> > >> >
> > >> > Any suggestions on the Hadoop product I should start with to
> > accomplish
> > >> the
> > >> > task? HBase,Pig,Hive, or ...?
> > >> >
> > >> > Thanks!
> > >> >
> > >> > Xueling
> > >> >
> > >>
> > >
> >
>

Re: Which Hadoop product is more appropriate for a quick query on a large data set?

Posted by Xueling Shu <xs...@systemsbiology.org>.
Great information! Thank you for your help, Todd.

Xueling

On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Xueling,
>
> In that case, I would recommend the following:
>
> 1) Put all of your data on HDFS
> 2) Write a MapReduce job that sorts the data by position of match
> 3) As a second output of this job, you can write a "sparse index" -
> basically a set of entries like this:
>
> <position of match> <offset into file> <number of entries following>
>
> where you're basically giving offsets into every 10K records or so. If
> you index every 10K records, then 5 billion total will mean 100,000
> index entries. Each index entry shouldn't be more than 20 bytes, so
> 100,000 entries will be 2MB. This is super easy to fit into memory.
> (you could probably index every 100th record instead and end up with
> 200MB, still easy to fit in memory)
>
> Then to satisfy your count-range query, you can simply scan your
> in-memory sparse index. Some of the indexed blocks will be completely
> included in the range, in which case you just add up the "number of
> entries following" column. The start and finish block will be
> partially covered, so you can use the file offset info to load that
> file off HDFS, start reading at that offset, and finish the count.
>
> Total time per query should be <100ms no problem.
>
> -Todd
>
> On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu <xs...@systemsbiology.org>
> wrote:
> > Hi Todd:
> >
> > Thank you for your reply.
> >
> > The datasets wont be updated often. But the query against a data set is
> > frequent. The quicker the query, the better. For example we have done
> > testing on a Mysql database (5 billion records randomly scattered into 24
> > tables) and the slowest query against the biggest table (400,000,000
> > records) is around 12 mins. So if using any Hadoop product can speed up
> the
> > search then the product is what we are looking for.
> >
> > Cheers,
> > Xueling
> >
> > On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon <to...@cloudera.com> wrote:
> >
> >> Hi Xueling,
> >>
> >> One important question that can really change the answer:
> >>
> >> How often does the dataset change? Can the changes be merged in in
> >> bulk every once in a while, or do you need to actually update them
> >> randomly very often?
> >>
> >> Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 second,
> or
> >> 10ms?
> >>
> >> Thanks
> >> -Todd
> >>
> >> On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <xs...@systemsbiology.org>
> >> wrote:
> >> >  Hi there:
> >> >
> >> > I am researching Hadoop to see which of its products suits our need
> for
> >> > quick queries against large data sets (billions of records per set)
> >> >
> >> > The queries will be performed against chip sequencing data. Each
> record
> >> is
> >> > one line in a file. To be clear below shows a sample record in the
> data
> >> set.
> >> >
> >> >
> >> > one line (record) looks like: 1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC
> U2
> >> 0 0
> >> > 1 4 *103570835* F .. 23G 24C
> >> >
> >> > The highlighted field is called "position of match" and the query we
> are
> >> > interested in is the # of sequences in a certain range of this
> "position
> >> of
> >> > match". For instance the range can be "position of match" > 200 and
> >> > "position of match" + 36 < 200,000.
> >> >
> >> > Any suggestions on the Hadoop product I should start with to
> accomplish
> >> the
> >> > task? HBase,Pig,Hive, or ...?
> >> >
> >> > Thanks!
> >> >
> >> > Xueling
> >> >
> >>
> >
>

RE: Which Hadoop product is more appropriate for a quick query on a large data set?

Posted by "Gibbon, Robert, VF-Group" <Ro...@vodafone.com>.
Isn't this what Hadoop Hbase is supposed to do? The partioning M/R implementation - "sharding" in street - is the sideways scaling that Hbase is designed to excel at! Also the indexed hbase flavour could allow very fast ad-hoc queries for Xueling using the new yet familiar HBQL sql-dialect?

Sorry, my 10 pence worth!

-----Original Message-----
From: Xueling Shu [mailto:xshu@systemsbiology.org]
Sent: Wed 1/6/2010 8:41 PM
To: general@hadoop.apache.org
Subject: Re: Which Hadoop product is more appropriate for a quick query on a large data set?
 
Thanks for the information! I will start to try.

Xueling

On Wed, Jan 6, 2010 at 11:32 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Xueling,
>
> Here's a general outline:
>
> My guess is that your "position of match" field is bounded (perhaps by the
> number of base pairs in the human genome?) Given this, you can probably
> write a very simple Partitioner implementation that divides this field into
> ranges, each with an approximately equal number of records.
>
> Next, write a simple MR job which takes in a line of data, and outputs the
> same line, but with the position-of-match as the key. This will get
> partitioned by the above function, so you end up with each reducer
> receiving
> all of the records in a given range.
>
> In the reducer, simply output every 1000th position into your "sparse"
> output file (along with the non-sparse output file offset), and every
> position into the non-sparse output file.
>
> In your realtime query server (not part of Hadoop), load the "sparse" file
> into RAM and perform binary search, etc - find the "bins" which the range
> endpoints land in, and then open the non-sparse output on HDFS to finish
> the
> count.
>
> Hope that helps.
>
> Thanks
> -Todd
>
> On Tue, Jan 5, 2010 at 5:26 PM, Xueling Shu <xs...@systemsbiology.org>
> wrote:
>
> > Rephrase the sentence "Or what APIs I should start with for my testing?":
> I
> > mean "What HDFS APIs I should start to look into for my testing?
> >
> > Thanks,
> > Xueling
> >
> > On Tue, Jan 5, 2010 at 5:24 PM, Xueling Shu <xs...@systemsbiology.org>
> > wrote:
> >
> > > Hi Todd:
> > >
> > > After finishing some tasks I finally get back to HDFS testing.
> > >
> > > One question for your last reply to this thread: Are there any code
> > > examples close to your second and third recommendations? Or what APIs I
> > > should start with for my testing?
> > >
> > > Thanks.
> > > Xueling
> > >
> > >
> > > On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon <to...@cloudera.com>
> wrote:
> > >
> > >> Hi Xueling,
> > >>
> > >> In that case, I would recommend the following:
> > >>
> > >> 1) Put all of your data on HDFS
> > >> 2) Write a MapReduce job that sorts the data by position of match
> > >> 3) As a second output of this job, you can write a "sparse index" -
> > >> basically a set of entries like this:
> > >>
> > >> <position of match> <offset into file> <number of entries following>
> > >>
> > >> where you're basically giving offsets into every 10K records or so. If
> > >> you index every 10K records, then 5 billion total will mean 100,000
> > >> index entries. Each index entry shouldn't be more than 20 bytes, so
> > >> 100,000 entries will be 2MB. This is super easy to fit into memory.
> > >> (you could probably index every 100th record instead and end up with
> > >> 200MB, still easy to fit in memory)
> > >>
> > >> Then to satisfy your count-range query, you can simply scan your
> > >> in-memory sparse index. Some of the indexed blocks will be completely
> > >> included in the range, in which case you just add up the "number of
> > >> entries following" column. The start and finish block will be
> > >> partially covered, so you can use the file offset info to load that
> > >> file off HDFS, start reading at that offset, and finish the count.
> > >>
> > >> Total time per query should be <100ms no problem.
> > >>
> > >> -Todd
> > >>
> > >> On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu <
> xshu@systemsbiology.org>
> > >> wrote:
> > >> > Hi Todd:
> > >> >
> > >> > Thank you for your reply.
> > >> >
> > >> > The datasets wont be updated often. But the query against a data set
> > is
> > >> > frequent. The quicker the query, the better. For example we have
> done
> > >> > testing on a Mysql database (5 billion records randomly scattered
> into
> > >> 24
> > >> > tables) and the slowest query against the biggest table (400,000,000
> > >> > records) is around 12 mins. So if using any Hadoop product can speed
> > up
> > >> the
> > >> > search then the product is what we are looking for.
> > >> >
> > >> > Cheers,
> > >> > Xueling
> > >> >
> > >> > On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon <to...@cloudera.com>
> > wrote:
> > >> >
> > >> >> Hi Xueling,
> > >> >>
> > >> >> One important question that can really change the answer:
> > >> >>
> > >> >> How often does the dataset change? Can the changes be merged in in
> > >> >> bulk every once in a while, or do you need to actually update them
> > >> >> randomly very often?
> > >> >>
> > >> >> Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1
> > second,
> > >> or
> > >> >> 10ms?
> > >> >>
> > >> >> Thanks
> > >> >> -Todd
> > >> >>
> > >> >> On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <
> > xshu@systemsbiology.org>
> > >> >> wrote:
> > >> >> >  Hi there:
> > >> >> >
> > >> >> > I am researching Hadoop to see which of its products suits our
> need
> > >> for
> > >> >> > quick queries against large data sets (billions of records per
> set)
> > >> >> >
> > >> >> > The queries will be performed against chip sequencing data. Each
> > >> record
> > >> >> is
> > >> >> > one line in a file. To be clear below shows a sample record in
> the
> > >> data
> > >> >> set.
> > >> >> >
> > >> >> >
> > >> >> > one line (record) looks like: 1-1-174-418
> > TGTGTCCCTTTGTAATGAATCACTATC
> > >> U2
> > >> >> 0 0
> > >> >> > 1 4 *103570835* F .. 23G 24C
> > >> >> >
> > >> >> > The highlighted field is called "position of match" and the query
> > we
> > >> are
> > >> >> > interested in is the # of sequences in a certain range of this
> > >> "position
> > >> >> of
> > >> >> > match". For instance the range can be "position of match" > 200
> and
> > >> >> > "position of match" + 36 < 200,000.
> > >> >> >
> > >> >> > Any suggestions on the Hadoop product I should start with to
> > >> accomplish
> > >> >> the
> > >> >> > task? HBase,Pig,Hive, or ...?
> > >> >> >
> > >> >> > Thanks!
> > >> >> >
> > >> >> > Xueling
> > >> >> >
> > >> >>
> > >> >
> > >>
> > >
> > >
> >
>


Re: Which Hadoop product is more appropriate for a quick query on a large data set?

Posted by Xueling Shu <xs...@systemsbiology.org>.
Thanks for the information! I will start to try.

Xueling

On Wed, Jan 6, 2010 at 11:32 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Xueling,
>
> Here's a general outline:
>
> My guess is that your "position of match" field is bounded (perhaps by the
> number of base pairs in the human genome?) Given this, you can probably
> write a very simple Partitioner implementation that divides this field into
> ranges, each with an approximately equal number of records.
>
> Next, write a simple MR job which takes in a line of data, and outputs the
> same line, but with the position-of-match as the key. This will get
> partitioned by the above function, so you end up with each reducer
> receiving
> all of the records in a given range.
>
> In the reducer, simply output every 1000th position into your "sparse"
> output file (along with the non-sparse output file offset), and every
> position into the non-sparse output file.
>
> In your realtime query server (not part of Hadoop), load the "sparse" file
> into RAM and perform binary search, etc - find the "bins" which the range
> endpoints land in, and then open the non-sparse output on HDFS to finish
> the
> count.
>
> Hope that helps.
>
> Thanks
> -Todd
>
> On Tue, Jan 5, 2010 at 5:26 PM, Xueling Shu <xs...@systemsbiology.org>
> wrote:
>
> > Rephrase the sentence "Or what APIs I should start with for my testing?":
> I
> > mean "What HDFS APIs I should start to look into for my testing?
> >
> > Thanks,
> > Xueling
> >
> > On Tue, Jan 5, 2010 at 5:24 PM, Xueling Shu <xs...@systemsbiology.org>
> > wrote:
> >
> > > Hi Todd:
> > >
> > > After finishing some tasks I finally get back to HDFS testing.
> > >
> > > One question for your last reply to this thread: Are there any code
> > > examples close to your second and third recommendations? Or what APIs I
> > > should start with for my testing?
> > >
> > > Thanks.
> > > Xueling
> > >
> > >
> > > On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon <to...@cloudera.com>
> wrote:
> > >
> > >> Hi Xueling,
> > >>
> > >> In that case, I would recommend the following:
> > >>
> > >> 1) Put all of your data on HDFS
> > >> 2) Write a MapReduce job that sorts the data by position of match
> > >> 3) As a second output of this job, you can write a "sparse index" -
> > >> basically a set of entries like this:
> > >>
> > >> <position of match> <offset into file> <number of entries following>
> > >>
> > >> where you're basically giving offsets into every 10K records or so. If
> > >> you index every 10K records, then 5 billion total will mean 100,000
> > >> index entries. Each index entry shouldn't be more than 20 bytes, so
> > >> 100,000 entries will be 2MB. This is super easy to fit into memory.
> > >> (you could probably index every 100th record instead and end up with
> > >> 200MB, still easy to fit in memory)
> > >>
> > >> Then to satisfy your count-range query, you can simply scan your
> > >> in-memory sparse index. Some of the indexed blocks will be completely
> > >> included in the range, in which case you just add up the "number of
> > >> entries following" column. The start and finish block will be
> > >> partially covered, so you can use the file offset info to load that
> > >> file off HDFS, start reading at that offset, and finish the count.
> > >>
> > >> Total time per query should be <100ms no problem.
> > >>
> > >> -Todd
> > >>
> > >> On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu <
> xshu@systemsbiology.org>
> > >> wrote:
> > >> > Hi Todd:
> > >> >
> > >> > Thank you for your reply.
> > >> >
> > >> > The datasets wont be updated often. But the query against a data set
> > is
> > >> > frequent. The quicker the query, the better. For example we have
> done
> > >> > testing on a Mysql database (5 billion records randomly scattered
> into
> > >> 24
> > >> > tables) and the slowest query against the biggest table (400,000,000
> > >> > records) is around 12 mins. So if using any Hadoop product can speed
> > up
> > >> the
> > >> > search then the product is what we are looking for.
> > >> >
> > >> > Cheers,
> > >> > Xueling
> > >> >
> > >> > On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon <to...@cloudera.com>
> > wrote:
> > >> >
> > >> >> Hi Xueling,
> > >> >>
> > >> >> One important question that can really change the answer:
> > >> >>
> > >> >> How often does the dataset change? Can the changes be merged in in
> > >> >> bulk every once in a while, or do you need to actually update them
> > >> >> randomly very often?
> > >> >>
> > >> >> Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1
> > second,
> > >> or
> > >> >> 10ms?
> > >> >>
> > >> >> Thanks
> > >> >> -Todd
> > >> >>
> > >> >> On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <
> > xshu@systemsbiology.org>
> > >> >> wrote:
> > >> >> >  Hi there:
> > >> >> >
> > >> >> > I am researching Hadoop to see which of its products suits our
> need
> > >> for
> > >> >> > quick queries against large data sets (billions of records per
> set)
> > >> >> >
> > >> >> > The queries will be performed against chip sequencing data. Each
> > >> record
> > >> >> is
> > >> >> > one line in a file. To be clear below shows a sample record in
> the
> > >> data
> > >> >> set.
> > >> >> >
> > >> >> >
> > >> >> > one line (record) looks like: 1-1-174-418
> > TGTGTCCCTTTGTAATGAATCACTATC
> > >> U2
> > >> >> 0 0
> > >> >> > 1 4 *103570835* F .. 23G 24C
> > >> >> >
> > >> >> > The highlighted field is called "position of match" and the query
> > we
> > >> are
> > >> >> > interested in is the # of sequences in a certain range of this
> > >> "position
> > >> >> of
> > >> >> > match". For instance the range can be "position of match" > 200
> and
> > >> >> > "position of match" + 36 < 200,000.
> > >> >> >
> > >> >> > Any suggestions on the Hadoop product I should start with to
> > >> accomplish
> > >> >> the
> > >> >> > task? HBase,Pig,Hive, or ...?
> > >> >> >
> > >> >> > Thanks!
> > >> >> >
> > >> >> > Xueling
> > >> >> >
> > >> >>
> > >> >
> > >>
> > >
> > >
> >
>

Re: Which Hadoop product is more appropriate for a quick query on a large data set?

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Xueling,

Here's a general outline:

My guess is that your "position of match" field is bounded (perhaps by the
number of base pairs in the human genome?) Given this, you can probably
write a very simple Partitioner implementation that divides this field into
ranges, each with an approximately equal number of records.

Next, write a simple MR job which takes in a line of data, and outputs the
same line, but with the position-of-match as the key. This will get
partitioned by the above function, so you end up with each reducer receiving
all of the records in a given range.

In the reducer, simply output every 1000th position into your "sparse"
output file (along with the non-sparse output file offset), and every
position into the non-sparse output file.

In your realtime query server (not part of Hadoop), load the "sparse" file
into RAM and perform binary search, etc - find the "bins" which the range
endpoints land in, and then open the non-sparse output on HDFS to finish the
count.

Hope that helps.

Thanks
-Todd

On Tue, Jan 5, 2010 at 5:26 PM, Xueling Shu <xs...@systemsbiology.org> wrote:

> Rephrase the sentence "Or what APIs I should start with for my testing?": I
> mean "What HDFS APIs I should start to look into for my testing?
>
> Thanks,
> Xueling
>
> On Tue, Jan 5, 2010 at 5:24 PM, Xueling Shu <xs...@systemsbiology.org>
> wrote:
>
> > Hi Todd:
> >
> > After finishing some tasks I finally get back to HDFS testing.
> >
> > One question for your last reply to this thread: Are there any code
> > examples close to your second and third recommendations? Or what APIs I
> > should start with for my testing?
> >
> > Thanks.
> > Xueling
> >
> >
> > On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon <to...@cloudera.com> wrote:
> >
> >> Hi Xueling,
> >>
> >> In that case, I would recommend the following:
> >>
> >> 1) Put all of your data on HDFS
> >> 2) Write a MapReduce job that sorts the data by position of match
> >> 3) As a second output of this job, you can write a "sparse index" -
> >> basically a set of entries like this:
> >>
> >> <position of match> <offset into file> <number of entries following>
> >>
> >> where you're basically giving offsets into every 10K records or so. If
> >> you index every 10K records, then 5 billion total will mean 100,000
> >> index entries. Each index entry shouldn't be more than 20 bytes, so
> >> 100,000 entries will be 2MB. This is super easy to fit into memory.
> >> (you could probably index every 100th record instead and end up with
> >> 200MB, still easy to fit in memory)
> >>
> >> Then to satisfy your count-range query, you can simply scan your
> >> in-memory sparse index. Some of the indexed blocks will be completely
> >> included in the range, in which case you just add up the "number of
> >> entries following" column. The start and finish block will be
> >> partially covered, so you can use the file offset info to load that
> >> file off HDFS, start reading at that offset, and finish the count.
> >>
> >> Total time per query should be <100ms no problem.
> >>
> >> -Todd
> >>
> >> On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu <xs...@systemsbiology.org>
> >> wrote:
> >> > Hi Todd:
> >> >
> >> > Thank you for your reply.
> >> >
> >> > The datasets wont be updated often. But the query against a data set
> is
> >> > frequent. The quicker the query, the better. For example we have done
> >> > testing on a Mysql database (5 billion records randomly scattered into
> >> 24
> >> > tables) and the slowest query against the biggest table (400,000,000
> >> > records) is around 12 mins. So if using any Hadoop product can speed
> up
> >> the
> >> > search then the product is what we are looking for.
> >> >
> >> > Cheers,
> >> > Xueling
> >> >
> >> > On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon <to...@cloudera.com>
> wrote:
> >> >
> >> >> Hi Xueling,
> >> >>
> >> >> One important question that can really change the answer:
> >> >>
> >> >> How often does the dataset change? Can the changes be merged in in
> >> >> bulk every once in a while, or do you need to actually update them
> >> >> randomly very often?
> >> >>
> >> >> Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1
> second,
> >> or
> >> >> 10ms?
> >> >>
> >> >> Thanks
> >> >> -Todd
> >> >>
> >> >> On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <
> xshu@systemsbiology.org>
> >> >> wrote:
> >> >> >  Hi there:
> >> >> >
> >> >> > I am researching Hadoop to see which of its products suits our need
> >> for
> >> >> > quick queries against large data sets (billions of records per set)
> >> >> >
> >> >> > The queries will be performed against chip sequencing data. Each
> >> record
> >> >> is
> >> >> > one line in a file. To be clear below shows a sample record in the
> >> data
> >> >> set.
> >> >> >
> >> >> >
> >> >> > one line (record) looks like: 1-1-174-418
> TGTGTCCCTTTGTAATGAATCACTATC
> >> U2
> >> >> 0 0
> >> >> > 1 4 *103570835* F .. 23G 24C
> >> >> >
> >> >> > The highlighted field is called "position of match" and the query
> we
> >> are
> >> >> > interested in is the # of sequences in a certain range of this
> >> "position
> >> >> of
> >> >> > match". For instance the range can be "position of match" > 200 and
> >> >> > "position of match" + 36 < 200,000.
> >> >> >
> >> >> > Any suggestions on the Hadoop product I should start with to
> >> accomplish
> >> >> the
> >> >> > task? HBase,Pig,Hive, or ...?
> >> >> >
> >> >> > Thanks!
> >> >> >
> >> >> > Xueling
> >> >> >
> >> >>
> >> >
> >>
> >
> >
>

Re: Which Hadoop product is more appropriate for a quick query on a large data set?

Posted by Xueling Shu <xs...@systemsbiology.org>.
Rephrase the sentence "Or what APIs I should start with for my testing?": I
mean "What HDFS APIs I should start to look into for my testing?

Thanks,
Xueling

On Tue, Jan 5, 2010 at 5:24 PM, Xueling Shu <xs...@systemsbiology.org> wrote:

> Hi Todd:
>
> After finishing some tasks I finally get back to HDFS testing.
>
> One question for your last reply to this thread: Are there any code
> examples close to your second and third recommendations? Or what APIs I
> should start with for my testing?
>
> Thanks.
> Xueling
>
>
> On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> Hi Xueling,
>>
>> In that case, I would recommend the following:
>>
>> 1) Put all of your data on HDFS
>> 2) Write a MapReduce job that sorts the data by position of match
>> 3) As a second output of this job, you can write a "sparse index" -
>> basically a set of entries like this:
>>
>> <position of match> <offset into file> <number of entries following>
>>
>> where you're basically giving offsets into every 10K records or so. If
>> you index every 10K records, then 5 billion total will mean 100,000
>> index entries. Each index entry shouldn't be more than 20 bytes, so
>> 100,000 entries will be 2MB. This is super easy to fit into memory.
>> (you could probably index every 100th record instead and end up with
>> 200MB, still easy to fit in memory)
>>
>> Then to satisfy your count-range query, you can simply scan your
>> in-memory sparse index. Some of the indexed blocks will be completely
>> included in the range, in which case you just add up the "number of
>> entries following" column. The start and finish block will be
>> partially covered, so you can use the file offset info to load that
>> file off HDFS, start reading at that offset, and finish the count.
>>
>> Total time per query should be <100ms no problem.
>>
>> -Todd
>>
>> On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu <xs...@systemsbiology.org>
>> wrote:
>> > Hi Todd:
>> >
>> > Thank you for your reply.
>> >
>> > The datasets wont be updated often. But the query against a data set is
>> > frequent. The quicker the query, the better. For example we have done
>> > testing on a Mysql database (5 billion records randomly scattered into
>> 24
>> > tables) and the slowest query against the biggest table (400,000,000
>> > records) is around 12 mins. So if using any Hadoop product can speed up
>> the
>> > search then the product is what we are looking for.
>> >
>> > Cheers,
>> > Xueling
>> >
>> > On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon <to...@cloudera.com> wrote:
>> >
>> >> Hi Xueling,
>> >>
>> >> One important question that can really change the answer:
>> >>
>> >> How often does the dataset change? Can the changes be merged in in
>> >> bulk every once in a while, or do you need to actually update them
>> >> randomly very often?
>> >>
>> >> Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 second,
>> or
>> >> 10ms?
>> >>
>> >> Thanks
>> >> -Todd
>> >>
>> >> On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <xs...@systemsbiology.org>
>> >> wrote:
>> >> >  Hi there:
>> >> >
>> >> > I am researching Hadoop to see which of its products suits our need
>> for
>> >> > quick queries against large data sets (billions of records per set)
>> >> >
>> >> > The queries will be performed against chip sequencing data. Each
>> record
>> >> is
>> >> > one line in a file. To be clear below shows a sample record in the
>> data
>> >> set.
>> >> >
>> >> >
>> >> > one line (record) looks like: 1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC
>> U2
>> >> 0 0
>> >> > 1 4 *103570835* F .. 23G 24C
>> >> >
>> >> > The highlighted field is called "position of match" and the query we
>> are
>> >> > interested in is the # of sequences in a certain range of this
>> "position
>> >> of
>> >> > match". For instance the range can be "position of match" > 200 and
>> >> > "position of match" + 36 < 200,000.
>> >> >
>> >> > Any suggestions on the Hadoop product I should start with to
>> accomplish
>> >> the
>> >> > task? HBase,Pig,Hive, or ...?
>> >> >
>> >> > Thanks!
>> >> >
>> >> > Xueling
>> >> >
>> >>
>> >
>>
>
>

Re: Which Hadoop product is more appropriate for a quick query on a large data set?

Posted by Xueling Shu <xs...@systemsbiology.org>.
Hi Todd:

After finishing some tasks I finally get back to HDFS testing.

One question for your last reply to this thread: Are there any code examples
close to your second and third recommendations? Or what APIs I should start
with for my testing?

Thanks.
Xueling

On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Xueling,
>
> In that case, I would recommend the following:
>
> 1) Put all of your data on HDFS
> 2) Write a MapReduce job that sorts the data by position of match
> 3) As a second output of this job, you can write a "sparse index" -
> basically a set of entries like this:
>
> <position of match> <offset into file> <number of entries following>
>
> where you're basically giving offsets into every 10K records or so. If
> you index every 10K records, then 5 billion total will mean 100,000
> index entries. Each index entry shouldn't be more than 20 bytes, so
> 100,000 entries will be 2MB. This is super easy to fit into memory.
> (you could probably index every 100th record instead and end up with
> 200MB, still easy to fit in memory)
>
> Then to satisfy your count-range query, you can simply scan your
> in-memory sparse index. Some of the indexed blocks will be completely
> included in the range, in which case you just add up the "number of
> entries following" column. The start and finish block will be
> partially covered, so you can use the file offset info to load that
> file off HDFS, start reading at that offset, and finish the count.
>
> Total time per query should be <100ms no problem.
>
> -Todd
>
> On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu <xs...@systemsbiology.org>
> wrote:
> > Hi Todd:
> >
> > Thank you for your reply.
> >
> > The datasets wont be updated often. But the query against a data set is
> > frequent. The quicker the query, the better. For example we have done
> > testing on a Mysql database (5 billion records randomly scattered into 24
> > tables) and the slowest query against the biggest table (400,000,000
> > records) is around 12 mins. So if using any Hadoop product can speed up
> the
> > search then the product is what we are looking for.
> >
> > Cheers,
> > Xueling
> >
> > On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon <to...@cloudera.com> wrote:
> >
> >> Hi Xueling,
> >>
> >> One important question that can really change the answer:
> >>
> >> How often does the dataset change? Can the changes be merged in in
> >> bulk every once in a while, or do you need to actually update them
> >> randomly very often?
> >>
> >> Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 second,
> or
> >> 10ms?
> >>
> >> Thanks
> >> -Todd
> >>
> >> On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <xs...@systemsbiology.org>
> >> wrote:
> >> >  Hi there:
> >> >
> >> > I am researching Hadoop to see which of its products suits our need
> for
> >> > quick queries against large data sets (billions of records per set)
> >> >
> >> > The queries will be performed against chip sequencing data. Each
> record
> >> is
> >> > one line in a file. To be clear below shows a sample record in the
> data
> >> set.
> >> >
> >> >
> >> > one line (record) looks like: 1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC
> U2
> >> 0 0
> >> > 1 4 *103570835* F .. 23G 24C
> >> >
> >> > The highlighted field is called "position of match" and the query we
> are
> >> > interested in is the # of sequences in a certain range of this
> "position
> >> of
> >> > match". For instance the range can be "position of match" > 200 and
> >> > "position of match" + 36 < 200,000.
> >> >
> >> > Any suggestions on the Hadoop product I should start with to
> accomplish
> >> the
> >> > task? HBase,Pig,Hive, or ...?
> >> >
> >> > Thanks!
> >> >
> >> > Xueling
> >> >
> >>
> >
>

Re: Which Hadoop product is more appropriate for a quick query on a large data set?

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Xueling,

In that case, I would recommend the following:

1) Put all of your data on HDFS
2) Write a MapReduce job that sorts the data by position of match
3) As a second output of this job, you can write a "sparse index" -
basically a set of entries like this:

<position of match> <offset into file> <number of entries following>

where you're basically giving offsets into every 10K records or so. If
you index every 10K records, then 5 billion total will mean 100,000
index entries. Each index entry shouldn't be more than 20 bytes, so
100,000 entries will be 2MB. This is super easy to fit into memory.
(you could probably index every 100th record instead and end up with
200MB, still easy to fit in memory)

Then to satisfy your count-range query, you can simply scan your
in-memory sparse index. Some of the indexed blocks will be completely
included in the range, in which case you just add up the "number of
entries following" column. The start and finish block will be
partially covered, so you can use the file offset info to load that
file off HDFS, start reading at that offset, and finish the count.

Total time per query should be <100ms no problem.

-Todd

On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu <xs...@systemsbiology.org> wrote:
> Hi Todd:
>
> Thank you for your reply.
>
> The datasets wont be updated often. But the query against a data set is
> frequent. The quicker the query, the better. For example we have done
> testing on a Mysql database (5 billion records randomly scattered into 24
> tables) and the slowest query against the biggest table (400,000,000
> records) is around 12 mins. So if using any Hadoop product can speed up the
> search then the product is what we are looking for.
>
> Cheers,
> Xueling
>
> On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> Hi Xueling,
>>
>> One important question that can really change the answer:
>>
>> How often does the dataset change? Can the changes be merged in in
>> bulk every once in a while, or do you need to actually update them
>> randomly very often?
>>
>> Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 second, or
>> 10ms?
>>
>> Thanks
>> -Todd
>>
>> On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <xs...@systemsbiology.org>
>> wrote:
>> >  Hi there:
>> >
>> > I am researching Hadoop to see which of its products suits our need for
>> > quick queries against large data sets (billions of records per set)
>> >
>> > The queries will be performed against chip sequencing data. Each record
>> is
>> > one line in a file. To be clear below shows a sample record in the data
>> set.
>> >
>> >
>> > one line (record) looks like: 1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC U2
>> 0 0
>> > 1 4 *103570835* F .. 23G 24C
>> >
>> > The highlighted field is called "position of match" and the query we are
>> > interested in is the # of sequences in a certain range of this "position
>> of
>> > match". For instance the range can be "position of match" > 200 and
>> > "position of match" + 36 < 200,000.
>> >
>> > Any suggestions on the Hadoop product I should start with to accomplish
>> the
>> > task? HBase,Pig,Hive, or ...?
>> >
>> > Thanks!
>> >
>> > Xueling
>> >
>>
>

Re: Which Hadoop product is more appropriate for a quick query on a large data set?

Posted by Xueling Shu <xs...@systemsbiology.org>.
Hi Todd:

Thank you for your reply.

The datasets wont be updated often. But the query against a data set is
frequent. The quicker the query, the better. For example we have done
testing on a Mysql database (5 billion records randomly scattered into 24
tables) and the slowest query against the biggest table (400,000,000
records) is around 12 mins. So if using any Hadoop product can speed up the
search then the product is what we are looking for.

Cheers,
Xueling

On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Xueling,
>
> One important question that can really change the answer:
>
> How often does the dataset change? Can the changes be merged in in
> bulk every once in a while, or do you need to actually update them
> randomly very often?
>
> Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 second, or
> 10ms?
>
> Thanks
> -Todd
>
> On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <xs...@systemsbiology.org>
> wrote:
> >  Hi there:
> >
> > I am researching Hadoop to see which of its products suits our need for
> > quick queries against large data sets (billions of records per set)
> >
> > The queries will be performed against chip sequencing data. Each record
> is
> > one line in a file. To be clear below shows a sample record in the data
> set.
> >
> >
> > one line (record) looks like: 1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC U2
> 0 0
> > 1 4 *103570835* F .. 23G 24C
> >
> > The highlighted field is called "position of match" and the query we are
> > interested in is the # of sequences in a certain range of this "position
> of
> > match". For instance the range can be "position of match" > 200 and
> > "position of match" + 36 < 200,000.
> >
> > Any suggestions on the Hadoop product I should start with to accomplish
> the
> > task? HBase,Pig,Hive, or ...?
> >
> > Thanks!
> >
> > Xueling
> >
>

Re: Which Hadoop product is more appropriate for a quick query on a large data set?

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Xueling,

One important question that can really change the answer:

How often does the dataset change? Can the changes be merged in in
bulk every once in a while, or do you need to actually update them
randomly very often?

Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 second, or 10ms?

Thanks
-Todd

On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <xs...@systemsbiology.org> wrote:
>  Hi there:
>
> I am researching Hadoop to see which of its products suits our need for
> quick queries against large data sets (billions of records per set)
>
> The queries will be performed against chip sequencing data. Each record is
> one line in a file. To be clear below shows a sample record in the data set.
>
>
> one line (record) looks like: 1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC U2 0 0
> 1 4 *103570835* F .. 23G 24C
>
> The highlighted field is called "position of match" and the query we are
> interested in is the # of sequences in a certain range of this "position of
> match". For instance the range can be "position of match" > 200 and
> "position of match" + 36 < 200,000.
>
> Any suggestions on the Hadoop product I should start with to accomplish the
> task? HBase,Pig,Hive, or ...?
>
> Thanks!
>
> Xueling
>