You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Mike Sukmanowsky <mi...@parsely.com> on 2013/12/12 23:06:25 UTC

Log File Versioning and Pig

We're playing around with options to what I'm sure is a common problem -
changing schemas in our log data.

Specifically we collect pixel data via nginx servers.  These pixels
currently have a pretty static list of parameters in the query string.  We
have eventual plans to change this and support many different types of
parameters in the query string.

Our current logs have a static number of fields separated by a \u0001
delimiter.  So to support "dynamic fields" we have two options:

   1. Store data using a Java/Pig Map of key:chararray and val:chararray
   2. Stick with static fields, and version the log format so that we know
   exactly how many fields to expect and what the schema is per line

*Option 1 Pros:*
No versioning needed.  If we add a new param, it's automatically picked up
in the map and is available for all scripts to use.  Old scripts don't have
to worry about new params being added.

*Option 1 Cons:*
Adds significantly to our file sizes.  Compression will help big time as
many of the keys in the map are repeated string values which will benefit
largely from compression.   But eventually when logs are decompressed for
analysis, they'll eat up significantly more disk space.  Also, we're not
sure about this but dealing with a ton of Map objects in Pig could be way
more inefficient and have more overhead than just a bunch of
chararrays/Strings.  Anyone know if this is true?

*Option 2 Pros:*
Basically smaller file size is the big one here since we don't have to
store the field name in our raw logs only the value and probably a version
number also.

*Option 2 Cons:*
Becomes harder for scripts to work with different versions and we need to
explicitly state which log file version the script depends on somewhere.

Was hoping to get a few opinions on this, what are people doing to solve
this in the wild?

-- 
Mike Sukmanowsky

Product Lead, http://parse.ly
989 Avenue of the Americas, 3rd Floor
New York, NY  10018
p: +1 (416) 953-4248
e: mike@parsely.com

Re: Log File Versioning and Pig

Posted by Mike Sukmanowsky <mi...@parsely.com>.

Thanks Pradeep - none of our logs currently use Proto Buf/Thrift/Avro and
we were somewhat trying to stay away from these guys but they may be a good
option.


On Thu, Dec 12, 2013 at 6:35 PM, Pradeep Gollakota <pr...@gmail.com>wrote:

> It seems like what you're asking for is Versioned Schema management. Pig is
> not designed for that. Pig is only a scripting language to manipulate
> datasets.
>
> I'd recommend you look into Thrift, Protocol Buffers and Avro. They are
> compact serialization libraries that do versioned schema management.
>
>
> On Thu, Dec 12, 2013 at 2:06 PM, Mike Sukmanowsky <mi...@parsely.com>
> wrote:
>
> > We're playing around with options to what I'm sure is a common problem -
> > changing schemas in our log data.
> >
> > Specifically we collect pixel data via nginx servers.  These pixels
> > currently have a pretty static list of parameters in the query string.
>  We
> > have eventual plans to change this and support many different types of
> > parameters in the query string.
> >
> > Our current logs have a static number of fields separated by a \u0001
> > delimiter.  So to support "dynamic fields" we have two options:
> >
> >    1. Store data using a Java/Pig Map of key:chararray and val:chararray
> >    2. Stick with static fields, and version the log format so that we
> know
> >    exactly how many fields to expect and what the schema is per line
> >
> > *Option 1 Pros:*
> > No versioning needed.  If we add a new param, it's automatically picked
> up
> > in the map and is available for all scripts to use.  Old scripts don't
> have
> > to worry about new params being added.
> >
> > *Option 1 Cons:*
> > Adds significantly to our file sizes.  Compression will help big time as
> > many of the keys in the map are repeated string values which will benefit
> > largely from compression.   But eventually when logs are decompressed for
> > analysis, they'll eat up significantly more disk space.  Also, we're not
> > sure about this but dealing with a ton of Map objects in Pig could be way
> > more inefficient and have more overhead than just a bunch of
> > chararrays/Strings.  Anyone know if this is true?
> >
> > *Option 2 Pros:*
> > Basically smaller file size is the big one here since we don't have to
> > store the field name in our raw logs only the value and probably a
> version
> > number also.
> >
> > *Option 2 Cons:*
> > Becomes harder for scripts to work with different versions and we need to
> > explicitly state which log file version the script depends on somewhere.
> >
> > Was hoping to get a few opinions on this, what are people doing to solve
> > this in the wild?
> >
> > --
> > Mike Sukmanowsky
> >
> > Product Lead, http://parse.ly
> > 989 Avenue of the Americas, 3rd Floor
> > New York, NY  10018
> > p: +1 (416) 953-4248
> > e: mike@parsely.com
> >
>



-- 
Mike Sukmanowsky

Product Lead, http://parse.ly
989 Avenue of the Americas, 3rd Floor
New York, NY  10018
p: +1 (416) 953-4248
e: mike@parsely.com

Re: Log File Versioning and Pig

Posted by Pradeep Gollakota <pr...@gmail.com>.

It seems like what you're asking for is Versioned Schema management. Pig is
not designed for that. Pig is only a scripting language to manipulate
datasets.

I'd recommend you look into Thrift, Protocol Buffers and Avro. They are
compact serialization libraries that do versioned schema management.


On Thu, Dec 12, 2013 at 2:06 PM, Mike Sukmanowsky <mi...@parsely.com> wrote:

> We're playing around with options to what I'm sure is a common problem -
> changing schemas in our log data.
>
> Specifically we collect pixel data via nginx servers.  These pixels
> currently have a pretty static list of parameters in the query string.  We
> have eventual plans to change this and support many different types of
> parameters in the query string.
>
> Our current logs have a static number of fields separated by a \u0001
> delimiter.  So to support "dynamic fields" we have two options:
>
>    1. Store data using a Java/Pig Map of key:chararray and val:chararray
>    2. Stick with static fields, and version the log format so that we know
>    exactly how many fields to expect and what the schema is per line
>
> *Option 1 Pros:*
> No versioning needed.  If we add a new param, it's automatically picked up
> in the map and is available for all scripts to use.  Old scripts don't have
> to worry about new params being added.
>
> *Option 1 Cons:*
> Adds significantly to our file sizes.  Compression will help big time as
> many of the keys in the map are repeated string values which will benefit
> largely from compression.   But eventually when logs are decompressed for
> analysis, they'll eat up significantly more disk space.  Also, we're not
> sure about this but dealing with a ton of Map objects in Pig could be way
> more inefficient and have more overhead than just a bunch of
> chararrays/Strings.  Anyone know if this is true?
>
> *Option 2 Pros:*
> Basically smaller file size is the big one here since we don't have to
> store the field name in our raw logs only the value and probably a version
> number also.
>
> *Option 2 Cons:*
> Becomes harder for scripts to work with different versions and we need to
> explicitly state which log file version the script depends on somewhere.
>
> Was hoping to get a few opinions on this, what are people doing to solve
> this in the wild?
>
> --
> Mike Sukmanowsky
>
> Product Lead, http://parse.ly
> 989 Avenue of the Americas, 3rd Floor
> New York, NY  10018
> p: +1 (416) 953-4248
> e: mike@parsely.com
>