You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by shan s <my...@gmail.com> on 2012/03/21 19:49:44 UTC

how to best process key-value pairs with Pig

In the relational database we have a large key, value type of data in 2
tables. Let’s call it Entity and EntityAttribute.



Table: Entity                       Columns: Entity ID, Entity Type

Table: EntityAttribute        Columns: EntityID, PropertyName,
PropertyValue.



These entities are loosely related to each other, hence are under a single
roof.

There are approx.  100 attributes among entities and 20 different entity
types.



My questions are:

-          What is the best way to represent this kind of key-value pair
data for processing with Pig.

-          Do I represent it as key=value pairs in the text files,  if so
how would I process such data in Pig.

-          Any pointer to UDFs that help with key- value pairs would be
great.



Many Thanks,

Shan

Re: how to best process key-value pairs with Pig

Posted by Bill Graham <bi...@gmail.com>.

If you have to have one row per entity, you could store the data using Avro
or JSON. Both would allow you to associate a Map of key/values with your
entity. AvroStorage in piggybank and JsonLoader in Pig would help if you
were to store the entire row as Avro or JSON. If you just want to store a
field as a serialized object, then you could write a UDF to do the same.

On Wed, Mar 21, 2012 at 10:32 PM, shan s <my...@gmail.com> wrote:

> The numbers 100, 20 denotes meta-data numbers. The instances of data is
> large. Moreover given the demoralized form, it can’t take advantage of
> indexes.
>
> The data is currently demoralized, in the sense instead of having 100
> parse columns, the data is stored as key value pair in 3 column table.
> One row for every attribute of an entity.
> Resulting in N rows for an entity where N = number of attributes the
> entity has.
>
> I guess there are 2 options of converting this data to text files to yield
> one row for each entity.
> 1. Use parse columns. Add a column for each possible property/attribute of
> entity.
> Means at ETL time add new columns to file, maintain the schema.
> 2. Translate to key=value pairs.
> And handle the complexity of the parsing in the Pig scripts.
>
> For option#2, are there any tools, UDFs which makes parsing/processing of
> key-value pair easier?
> Example of a converted line is
> 8a9e202b-4da6-4cc0-958b-0000bd4c2c9d,prop1=xyz,prop2=9cd72489-6c03-489a-92cd-c9f938a7b223,prop3=20120312
> 04:38:02.140,prop4=20120312
> 04:38:02.140,prop5=e689968f-2c64-457b-a0ba-5f0122687172,prop6=5ce12c5b-2c82-4fbe-961e-fd04de96a8ae
>
> In other words, if I need to query this data with
> WHERE prop4 > now()
> WHERE prop2 = ‘9cd72489-6c03-489a-92cd-c9f938a7b223’
>
> Do I need to write UDFs or are there pre-existing tools that I can use to
> do this.
> Thanks!
>
> On Thu, Mar 22, 2012 at 9:29 AM, Bill Graham <bi...@gmail.com> wrote:
>
>> What about denormalizing and just representing these as 4-tuples of (id,
>> type, name, value) in a text file? You could always then group by type if
>> you need to get back to distinct types.
>>
>> Are you joining against a larger dataset? I ask just because 10x200 values
>> is not a lot and can be done without Hadoop.
>>
>>
>> On Wed, Mar 21, 2012 at 11:49 AM, shan s <my...@gmail.com> wrote:
>>
>> > In the relational database we have a large key, value type of data in 2
>> > tables. Let’s call it Entity and EntityAttribute.
>> >
>> >
>> >
>> > Table: Entity                       Columns: Entity ID, Entity Type
>> >
>> > Table: EntityAttribute        Columns: EntityID, PropertyName,
>> > PropertyValue.
>> >
>> >
>> >
>> > These entities are loosely related to each other, hence are under a
>> single
>> > roof.
>> >
>> > There are approx.  100 attributes among entities and 20 different entity
>> > types.
>> >
>> >
>> >
>> > My questions are:
>> >
>> > -          What is the best way to represent this kind of key-value pair
>> > data for processing with Pig.
>> >
>> > -          Do I represent it as key=value pairs in the text files,  if
>> so
>> > how would I process such data in Pig.
>> >
>> > -          Any pointer to UDFs that help with key- value pairs would be
>> > great.
>> >
>> >
>> >
>> > Many Thanks,
>> >
>> > Shan
>> >
>>
>>
>>
>> --
>> *Note that I'm no longer using my Yahoo! email address. Please email me at
>> billgraham@gmail.com going forward.*
>>
>
>


-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*

Re: how to best process key-value pairs with Pig

Posted by shan s <my...@gmail.com>.

The numbers 100, 20 denotes meta-data numbers. The instances of data is
large. Moreover given the demoralized form, it can’t take advantage of
indexes.

The data is currently demoralized, in the sense instead of having 100 parse
columns, the data is stored as key value pair in 3 column table.
One row for every attribute of an entity.
Resulting in N rows for an entity where N = number of attributes the entity
has.

I guess there are 2 options of converting this data to text files to yield
one row for each entity.
1. Use parse columns. Add a column for each possible property/attribute of
entity.
Means at ETL time add new columns to file, maintain the schema.
2. Translate to key=value pairs.
And handle the complexity of the parsing in the Pig scripts.

For option#2, are there any tools, UDFs which makes parsing/processing of
key-value pair easier?
Example of a converted line is
8a9e202b-4da6-4cc0-958b-0000bd4c2c9d,prop1=xyz,prop2=9cd72489-6c03-489a-92cd-c9f938a7b223,prop3=20120312
04:38:02.140,prop4=20120312
04:38:02.140,prop5=e689968f-2c64-457b-a0ba-5f0122687172,prop6=5ce12c5b-2c82-4fbe-961e-fd04de96a8ae

In other words, if I need to query this data with
WHERE prop4 > now()
WHERE prop2 = ‘9cd72489-6c03-489a-92cd-c9f938a7b223’

Do I need to write UDFs or are there pre-existing tools that I can use to
do this.
Thanks!

On Thu, Mar 22, 2012 at 9:29 AM, Bill Graham <bi...@gmail.com> wrote:

> What about denormalizing and just representing these as 4-tuples of (id,
> type, name, value) in a text file? You could always then group by type if
> you need to get back to distinct types.
>
> Are you joining against a larger dataset? I ask just because 10x200 values
> is not a lot and can be done without Hadoop.
>
>
> On Wed, Mar 21, 2012 at 11:49 AM, shan s <my...@gmail.com> wrote:
>
> > In the relational database we have a large key, value type of data in 2
> > tables. Let’s call it Entity and EntityAttribute.
> >
> >
> >
> > Table: Entity                       Columns: Entity ID, Entity Type
> >
> > Table: EntityAttribute        Columns: EntityID, PropertyName,
> > PropertyValue.
> >
> >
> >
> > These entities are loosely related to each other, hence are under a
> single
> > roof.
> >
> > There are approx.  100 attributes among entities and 20 different entity
> > types.
> >
> >
> >
> > My questions are:
> >
> > -          What is the best way to represent this kind of key-value pair
> > data for processing with Pig.
> >
> > -          Do I represent it as key=value pairs in the text files,  if so
> > how would I process such data in Pig.
> >
> > -          Any pointer to UDFs that help with key- value pairs would be
> > great.
> >
> >
> >
> > Many Thanks,
> >
> > Shan
> >
>
>
>
> --
> *Note that I'm no longer using my Yahoo! email address. Please email me at
> billgraham@gmail.com going forward.*
>

Re: how to best process key-value pairs with Pig

Posted by Bill Graham <bi...@gmail.com>.

What about denormalizing and just representing these as 4-tuples of (id,
type, name, value) in a text file? You could always then group by type if
you need to get back to distinct types.

Are you joining against a larger dataset? I ask just because 10x200 values
is not a lot and can be done without Hadoop.


On Wed, Mar 21, 2012 at 11:49 AM, shan s <my...@gmail.com> wrote:

> In the relational database we have a large key, value type of data in 2
> tables. Let’s call it Entity and EntityAttribute.
>
>
>
> Table: Entity                       Columns: Entity ID, Entity Type
>
> Table: EntityAttribute        Columns: EntityID, PropertyName,
> PropertyValue.
>
>
>
> These entities are loosely related to each other, hence are under a single
> roof.
>
> There are approx.  100 attributes among entities and 20 different entity
> types.
>
>
>
> My questions are:
>
> -          What is the best way to represent this kind of key-value pair
> data for processing with Pig.
>
> -          Do I represent it as key=value pairs in the text files,  if so
> how would I process such data in Pig.
>
> -          Any pointer to UDFs that help with key- value pairs would be
> great.
>
>
>
> Many Thanks,
>
> Shan
>



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*