You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Ken Krugler <kk...@transpac.com> on 2014/12/04 03:06:02 UTC

Building skip table for Avro data

Hi all,

I'm looking for suggestions on how to optimize a number of Hadoop jobs (written using Cascading) that only need a fraction of the records store in Avro files.

Essentially I have a small number (let's say 10K) of essentially random keys out of a total of 100M unique values, and I need to select & process all and only those records in my Avro files where the key field matches. The set of keys that are of interest changes with each run.

I have about 1TB of compressed data to scan through, saved as about 200 5GB files. This represents about 10B records.

The data format has to stay as Avro, for interchange with various groups.

As I'm building the Avro files, I could sort by the key field.

I'm wondering if it's feasible to build a skip table that would let me seek to a sync position in the Avro file and read from it. If the default sync interval is 16K, then I'd have 65M of these that I could use, and even if every key of interest had 100 records that were each in a separate block, this would still dramatically cut down on the amount of data I'd have to scan over.

But is that possible? Any input would be appreciated.

Thanks,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






RE: Building skip table for Avro data

Posted by "Peterson, Michael" <mi...@truvenhealth.com>.
Ken,

Did you get this working? If so, I'd love to get some pointers (or a blog post) on how you did it.  I'm interested in the same functionality in the near future, but don't have time right now to dive in and figure it out.

-Michael

From: Ken Krugler [mailto:kkrugler_lists@transpac.com]
Sent: Thursday, December 04, 2014 11:05 PM
To: user@avro.apache.org
Subject: RE: Building skip table for Avro data

Hi Doug,

________________________________

From: Doug Cutting

Sent: December 4, 2014 7:41:05am PST

To: user@avro.apache.org<ma...@avro.apache.org>

Subject: Re: Building skip table for Avro data


Have you looked at SortedKeyValueFile?

https://avro.apache.org/docs/current/api/java/org/apache/avro/hadoop/file/SortedKeyValueFile.html

This may already provide what you need.
Seems pretty darn close.

So to use this is a regular Hadoop job, it looks like I'd essentially not emit anything via the reducer's standard context.write() call, but instead I'd need to set up a SortedKeyValueFile.Writer to write to the job's output directory (using a sub-directory Path based on the task number), yes?

Which means I'd wind up with <job output path>/part-xxxxx/data and <job output path>/part-xxxxx/data/index

However I'm a little confused about the /data file being "an ordinary Avro container file". The doc says:

     Each record has exactly two fields, 'key' and 'value'. The keys are sorted lexicographically

But both the key and value fields have their own Avro Schema, right? Or at least that's what I assume from the withValueSchema<https://avro.apache.org/docs/current/api/java/org/apache/avro/hadoop/file/SortedKeyValueFile.Writer.Options.html#getValueSchema()> and withKeySchema<https://avro.apache.org/docs/current/api/java/org/apache/avro/hadoop/file/SortedKeyValueFile.Writer.Options.html#withKeySchema(org.apache.avro.Schema)> calls for setting up the writer options.

So in that case:

a. how is it sorted lexicographically (as per the SortedKeyValueFile<https://avro.apache.org/docs/current/api/java/org/apache/avro/hadoop/file/SortedKeyValueFile.html> JavaDocs)?

b. How would a reader who's expecting a regular Avro file read the records? Would they get records that were the union of fields in the key + value schemas?

Thanks again,

-- Ken



On Dec 3, 2014 10:14 PM, "Joey Echeverria" <jo...@cloudera.com>> wrote:
It sounds feasible to me. You can certainly seek to a specific sync
marker and so long as you're periodically calling sync to get the last
position, then you can save those offsets in a separate file(s) that
you load into memory or search sequentially.

This sounds very similar to MapFiles which used a pair of
SequenceFiles, one with the data and one with an index of every Nth
key to speed up lookups of sorted data.

-Joey

On Wed, Dec 3, 2014 at 6:06 PM, Ken Krugler <kk...@transpac.com>> wrote:
> Hi all,
>
> I'm looking for suggestions on how to optimize a number of Hadoop jobs
> (written using Cascading) that only need a fraction of the records store in
> Avro files.
>
> Essentially I have a small number (let's say 10K) of essentially random keys
> out of a total of 100M unique values, and I need to select & process all and
> only those records in my Avro files where the key field matches. The set of
> keys that are of interest changes with each run.
>
> I have about 1TB of compressed data to scan through, saved as about 200 5GB
> files. This represents about 10B records.
>
> The data format has to stay as Avro, for interchange with various groups.
>
> As I'm building the Avro files, I could sort by the key field.
>
> I'm wondering if it's feasible to build a skip table that would let me seek
> to a sync position in the Avro file and read from it. If the default sync
> interval is 16K, then I'd have 65M of these that I could use, and even if
> every key of interest had 100 records that were each in a separate block,
> this would still dramatically cut down on the amount of data I'd have to
> scan over.
>
> But is that possible? Any input would be appreciated.
>
> Thanks,
>
> -- Ken
>


--
Joey Echeverria


--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






Re: Building skip table for Avro data

Posted by Doug Cutting <cu...@apache.org>.
On Thu, Dec 4, 2014 at 8:05 PM, Ken Krugler <kk...@transpac.com> wrote:
> a. how is it sorted lexicographically (as per the SortedKeyValueFile
> JavaDocs)?

The key/value pairs are sorted by their key schema, as per Avro's
order specification:

http://avro.apache.org/docs/current/spec.html#order

> b. How would a reader who's expecting a regular Avro file read the records?
> Would they get records that were the union of fields in the key + value
> schemas?

Looking at the source, it looks like the key/value pairs are stored as
a record schema with fields named "key" and "value".  The name of the
record is org.apache.avro.mapreduce.KeyValuePair.

Doug

RE: Building skip table for Avro data

Posted by Ken Krugler <kk...@transpac.com>.
Hi Doug,

> From: Doug Cutting
> Sent: December 4, 2014 7:41:05am PST
> To: user@avro.apache.org
> Subject: Re: Building skip table for Avro data
> 
> Have you looked at SortedKeyValueFile?
> 
> https://avro.apache.org/docs/current/api/java/org/apache/avro/hadoop/file/SortedKeyValueFile.html
> 
> This may already provide what you need.
> 
Seems pretty darn close.

So to use this is a regular Hadoop job, it looks like I'd essentially not emit anything via the reducer's standard context.write() call, but instead I'd need to set up a SortedKeyValueFile.Writer to write to the job's output directory (using a sub-directory Path based on the task number), yes?

Which means I'd wind up with <job output path>/part-xxxxx/data and <job output path>/part-xxxxx/data/index

However I'm a little confused about the /data file being "an ordinary Avro container file". The doc says:

     Each record has exactly two fields, 'key' and 'value'. The keys are sorted lexicographically

But both the key and value fields have their own Avro Schema, right? Or at least that's what I assume from the withValueSchema and withKeySchema calls for setting up the writer options.

So in that case:

a. how is it sorted lexicographically (as per the SortedKeyValueFile JavaDocs)?

b. How would a reader who's expecting a regular Avro file read the records? Would they get records that were the union of fields in the key + value schemas?

Thanks again,

-- Ken

> On Dec 3, 2014 10:14 PM, "Joey Echeverria" <jo...@cloudera.com> wrote:
> 
> It sounds feasible to me. You can certainly seek to a specific sync
> marker and so long as you're periodically calling sync to get the last
> position, then you can save those offsets in a separate file(s) that
> you load into memory or search sequentially.
> 
> This sounds very similar to MapFiles which used a pair of
> SequenceFiles, one with the data and one with an index of every Nth
> key to speed up lookups of sorted data.
> 
> -Joey
> 
> On Wed, Dec 3, 2014 at 6:06 PM, Ken Krugler <kk...@transpac.com> wrote:
> > Hi all,
> >
> > I'm looking for suggestions on how to optimize a number of Hadoop jobs
> > (written using Cascading) that only need a fraction of the records store in
> > Avro files.
> >
> > Essentially I have a small number (let's say 10K) of essentially random keys
> > out of a total of 100M unique values, and I need to select & process all and
> > only those records in my Avro files where the key field matches. The set of
> > keys that are of interest changes with each run.
> >
> > I have about 1TB of compressed data to scan through, saved as about 200 5GB
> > files. This represents about 10B records.
> >
> > The data format has to stay as Avro, for interchange with various groups.
> >
> > As I'm building the Avro files, I could sort by the key field.
> >
> > I'm wondering if it's feasible to build a skip table that would let me seek
> > to a sync position in the Avro file and read from it. If the default sync
> > interval is 16K, then I'd have 65M of these that I could use, and even if
> > every key of interest had 100 records that were each in a separate block,
> > this would still dramatically cut down on the amount of data I'd have to
> > scan over.
> >
> > But is that possible? Any input would be appreciated.
> >
> > Thanks,
> >
> > -- Ken
> >
> 
> 
> --
> Joey Echeverria


--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






Re: Building skip table for Avro data

Posted by Doug Cutting <cu...@apache.org>.
Have you looked at SortedKeyValueFile?

https://avro.apache.org/docs/current/api/java/org/apache/avro/hadoop/file/SortedKeyValueFile.html

This may already provide what you need.

Doug
On Dec 3, 2014 10:14 PM, "Joey Echeverria" <jo...@cloudera.com> wrote:

> It sounds feasible to me. You can certainly seek to a specific sync
> marker and so long as you're periodically calling sync to get the last
> position, then you can save those offsets in a separate file(s) that
> you load into memory or search sequentially.
>
> This sounds very similar to MapFiles which used a pair of
> SequenceFiles, one with the data and one with an index of every Nth
> key to speed up lookups of sorted data.
>
> -Joey
>
> On Wed, Dec 3, 2014 at 6:06 PM, Ken Krugler <kk...@transpac.com>
> wrote:
> > Hi all,
> >
> > I'm looking for suggestions on how to optimize a number of Hadoop jobs
> > (written using Cascading) that only need a fraction of the records store
> in
> > Avro files.
> >
> > Essentially I have a small number (let's say 10K) of essentially random
> keys
> > out of a total of 100M unique values, and I need to select & process all
> and
> > only those records in my Avro files where the key field matches. The set
> of
> > keys that are of interest changes with each run.
> >
> > I have about 1TB of compressed data to scan through, saved as about 200
> 5GB
> > files. This represents about 10B records.
> >
> > The data format has to stay as Avro, for interchange with various groups.
> >
> > As I'm building the Avro files, I could sort by the key field.
> >
> > I'm wondering if it's feasible to build a skip table that would let me
> seek
> > to a sync position in the Avro file and read from it. If the default sync
> > interval is 16K, then I'd have 65M of these that I could use, and even if
> > every key of interest had 100 records that were each in a separate block,
> > this would still dramatically cut down on the amount of data I'd have to
> > scan over.
> >
> > But is that possible? Any input would be appreciated.
> >
> > Thanks,
> >
> > -- Ken
> >
> > --------------------------
> > Ken Krugler
> > +1 530-210-6378
> > http://www.scaleunlimited.com
> > custom big data solutions & training
> > Hadoop, Cascading, Cassandra & Solr
> >
> >
> >
> >
> >
>
>
>
> --
> Joey Echeverria
>

Re: Building skip table for Avro data

Posted by Joey Echeverria <jo...@cloudera.com>.
It sounds feasible to me. You can certainly seek to a specific sync
marker and so long as you're periodically calling sync to get the last
position, then you can save those offsets in a separate file(s) that
you load into memory or search sequentially.

This sounds very similar to MapFiles which used a pair of
SequenceFiles, one with the data and one with an index of every Nth
key to speed up lookups of sorted data.

-Joey

On Wed, Dec 3, 2014 at 6:06 PM, Ken Krugler <kk...@transpac.com> wrote:
> Hi all,
>
> I'm looking for suggestions on how to optimize a number of Hadoop jobs
> (written using Cascading) that only need a fraction of the records store in
> Avro files.
>
> Essentially I have a small number (let's say 10K) of essentially random keys
> out of a total of 100M unique values, and I need to select & process all and
> only those records in my Avro files where the key field matches. The set of
> keys that are of interest changes with each run.
>
> I have about 1TB of compressed data to scan through, saved as about 200 5GB
> files. This represents about 10B records.
>
> The data format has to stay as Avro, for interchange with various groups.
>
> As I'm building the Avro files, I could sort by the key field.
>
> I'm wondering if it's feasible to build a skip table that would let me seek
> to a sync position in the Avro file and read from it. If the default sync
> interval is 16K, then I'd have 65M of these that I could use, and even if
> every key of interest had 100 records that were each in a separate block,
> this would still dramatically cut down on the amount of data I'd have to
> scan over.
>
> But is that possible? Any input would be appreciated.
>
> Thanks,
>
> -- Ken
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>



-- 
Joey Echeverria