You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Christian Decker <de...@gmail.com> on 2010/09/07 17:19:53 UTC

Re: Expand row content

I was thinking about creating my own Cassandra Storage to be able to
efficiently load data from a secondary index, and since I was already
writing most of the stuff I thought it might be a good idea to already
convert it into the format I'd like to elaborate on later, but as it turns
out LoadFunc is not generic and can therefor only return Tuples, is that
correct?

Regards,
Chris

On Wed, Aug 25, 2010 at 10:08 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> yeah, absolutely. You can have an EvalFunc<Map> that does this. I think it
> has to be <Map>, not <Map<String, Object>> because of how function
> prototypes get mapped, but more or less the same deal.
>
> -D
>
> On Wed, Aug 25, 2010 at 11:53 AM, Christian Decker <
> decker.christian@gmail.com> wrote:
>
> > I'm not sure either, but it's a good point. So basically it would be
> > possible to create a UDF that generates a Map<String, Object> from my
> > input,
> > right?
> > --
> > Christian Decker
> > Software Architect
> > http://blog.snyke.net
> >
> >
> > On Wed, Aug 25, 2010 at 8:11 PM, Dmitriy Ryaboy <dv...@gmail.com>
> > wrote:
> >
> > > Chris,
> > > This sort of pattern is not common because Map<String, Object> is a
> > > primitive data type in Pig, I am not sure why Cassandra doesn't just
> use
> > > it.
> > > That would seem to be the right solution based on what I am reading in
> > your
> > > email.
> > >
> > > -D
> > >
> > > On Wed, Aug 25, 2010 at 10:59 AM, Christian Decker <
> > > decker.christian@gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I'm trying to read some data from CassandraStorage (contrib by
> > Cassandra)
> > > > and then work on it, but the format of the data is just incredibly
> > ugly.
> > > > When just loading it and dumping it I can see that the format is
> > > something
> > > > like this:
> > > >
> > > >
> > >
> >
> (key,{(col0,col0value),(col1,col1value),(col2,col2value),(col3,col3value)})
> > > >
> > > >
> > > > which makes my UDFs incredibly ugly:
> > > >
> > > > public Boolean exec(Tuple arg0) throws IOException {
> > > >
> > > >  DataBag b = (DataBag) arg0.get(0);
> > > >
> > > >  Iterator<Tuple> i = b.iterator();
> > > >
> > > >  while(i.hasNext()){
> > > >
> > > >  Tuple next = i.next();
> > > >
> > > >  if("col1".equals(next.get(0).toString()))
> > > >
> > > >  col1 = Double.parseDouble(next.get(1).toString());
> > > >
> > > >  else if("longitude".equals(next.get(0).toString()))
> > > >
> > > >  col2 = Double.parseDouble(next.get(1).toString());
> > > >
> > > >  }
> > > >
> > > >  }
> > > >
> > > >  ...
> > > >
> > > > }
> > > >
> > > >
> > > > As you can see the most part of this is just iterating over the
> DataBag
> > > and
> > > > mapping the column names to their value, before working on the real
> > data.
> > > > Since my guess is that this is quite commonplace and timeconsuming, I
> > was
> > > > wondering whether there is a better way to prepare the data before
> > > passing
> > > > it to the UDFs, some sort of HashMap that extracts column names and
> > > values
> > > > and stores them correctly.
> > > >
> > > > Regards,
> > > > Chris
> > > >
> > > > --
> > > > Christian Decker
> > > > Software Architect
> > > > http://blog.snyke.net
> > > >
> > >
> >
>

Re: Expand row content

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

I think I must be confused because it seems like you know the keys
beforehand, so you can create a tuple in the Loader. And if you don't know
them, you can't pull them out anyhow...

Anyway, is the original question how to pull the key-value pairs out of a
map prior to processing them in a UDF?

mapdata = Load 'foo' using MyLoader() as ( hash:map[] );
rows = foreach mapdata generate
  (chararray) hash#'key1' as key1,
  (int) hash#'key2' as key2;

....

Does that help?

-D

On Wed, Sep 8, 2010 at 10:48 AM, Christian Decker <
decker.christian@gmail.com> wrote:

> Well my problem with the tuple is that I do not get key-value pairs that
> can
> be accessed by using the key in my Pig Scripts. As I understand it there is
> currently no way to have access using a key to its value because the keys
> are simply aliases to the indices in the tuple, and have to be specified by
> the UDF. So basically I cannot just return a hashmap and then reference
> them.
>
> What I'm trying to do is to load rows from Cassandra and then work on them,
> but the CassandraStorage provided by Cassandra just created a Tuple of the
> key and a databag containing key-value pairs as tuples. I'd like to have a
> more mysql-esque way of addressing the columns loaded from Cassandra :-)
> --
> Christian Decker
> Software Architect
> http://blog.snyke.net
>
>
> On Tue, Sep 7, 2010 at 7:52 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>
> > Yes. A tuple is (kind of) like a row in a database table -- a container
> for
> > fields, which may be of a number of different types. A LoadFunc returns
> > rows. You can stuff any objects into them that you like, however, by
> > serializing them into bytearays, or by doing things like extending Tuple
> > and
> > overriding its methods (see for example the ProtobufTuple in
> Elephant-Bird:
> >
> >
> http://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/util/ProtobufTuple.java(the<http://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/util/ProtobufTuple.java%28the>
> > latter should probably be considered a risky thing to do, as the Tuple
> > class is still evolving).
> >
> > Can you be more specific about what you are trying to do?
> >
> > -D
> >
> > On Tue, Sep 7, 2010 at 8:19 AM, Christian Decker <
> > decker.christian@gmail.com
> > > wrote:
> >
> > > I was thinking about creating my own Cassandra Storage to be able to
> > > efficiently load data from a secondary index, and since I was already
> > > writing most of the stuff I thought it might be a good idea to already
> > > convert it into the format I'd like to elaborate on later, but as it
> > turns
> > > out LoadFunc is not generic and can therefor only return Tuples, is
> that
> > > correct?
> > >
> > > Regards,
> > > Chris
> > >
> > > On Wed, Aug 25, 2010 at 10:08 PM, Dmitriy Ryaboy <dv...@gmail.com>
> > > wrote:
> > >
> > > > yeah, absolutely. You can have an EvalFunc<Map> that does this. I
> think
> > > it
> > > > has to be <Map>, not <Map<String, Object>> because of how function
> > > > prototypes get mapped, but more or less the same deal.
> > > >
> > > > -D
> > > >
> > > > On Wed, Aug 25, 2010 at 11:53 AM, Christian Decker <
> > > > decker.christian@gmail.com> wrote:
> > > >
> > > > > I'm not sure either, but it's a good point. So basically it would
> be
> > > > > possible to create a UDF that generates a Map<String, Object> from
> my
> > > > > input,
> > > > > right?
> > > > > --
> > > > > Christian Decker
> > > > > Software Architect
> > > > > http://blog.snyke.net
> > > > >
> > > > >
> > > > > On Wed, Aug 25, 2010 at 8:11 PM, Dmitriy Ryaboy <
> dvryaboy@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Chris,
> > > > > > This sort of pattern is not common because Map<String, Object> is
> a
> > > > > > primitive data type in Pig, I am not sure why Cassandra doesn't
> > just
> > > > use
> > > > > > it.
> > > > > > That would seem to be the right solution based on what I am
> reading
> > > in
> > > > > your
> > > > > > email.
> > > > > >
> > > > > > -D
> > > > > >
> > > > > > On Wed, Aug 25, 2010 at 10:59 AM, Christian Decker <
> > > > > > decker.christian@gmail.com> wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > I'm trying to read some data from CassandraStorage (contrib by
> > > > > Cassandra)
> > > > > > > and then work on it, but the format of the data is just
> > incredibly
> > > > > ugly.
> > > > > > > When just loading it and dumping it I can see that the format
> is
> > > > > > something
> > > > > > > like this:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> (key,{(col0,col0value),(col1,col1value),(col2,col2value),(col3,col3value)})
> > > > > > >
> > > > > > >
> > > > > > > which makes my UDFs incredibly ugly:
> > > > > > >
> > > > > > > public Boolean exec(Tuple arg0) throws IOException {
> > > > > > >
> > > > > > >  DataBag b = (DataBag) arg0.get(0);
> > > > > > >
> > > > > > >  Iterator<Tuple> i = b.iterator();
> > > > > > >
> > > > > > >  while(i.hasNext()){
> > > > > > >
> > > > > > >  Tuple next = i.next();
> > > > > > >
> > > > > > >  if("col1".equals(next.get(0).toString()))
> > > > > > >
> > > > > > >  col1 = Double.parseDouble(next.get(1).toString());
> > > > > > >
> > > > > > >  else if("longitude".equals(next.get(0).toString()))
> > > > > > >
> > > > > > >  col2 = Double.parseDouble(next.get(1).toString());
> > > > > > >
> > > > > > >  }
> > > > > > >
> > > > > > >  }
> > > > > > >
> > > > > > >  ...
> > > > > > >
> > > > > > > }
> > > > > > >
> > > > > > >
> > > > > > > As you can see the most part of this is just iterating over the
> > > > DataBag
> > > > > > and
> > > > > > > mapping the column names to their value, before working on the
> > real
> > > > > data.
> > > > > > > Since my guess is that this is quite commonplace and
> > timeconsuming,
> > > I
> > > > > was
> > > > > > > wondering whether there is a better way to prepare the data
> > before
> > > > > > passing
> > > > > > > it to the UDFs, some sort of HashMap that extracts column names
> > and
> > > > > > values
> > > > > > > and stores them correctly.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Chris
> > > > > > >
> > > > > > > --
> > > > > > > Christian Decker
> > > > > > > Software Architect
> > > > > > > http://blog.snyke.net
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Expand row content

Posted by Christian Decker <de...@gmail.com>.

Well my problem with the tuple is that I do not get key-value pairs that can
be accessed by using the key in my Pig Scripts. As I understand it there is
currently no way to have access using a key to its value because the keys
are simply aliases to the indices in the tuple, and have to be specified by
the UDF. So basically I cannot just return a hashmap and then reference
them.

What I'm trying to do is to load rows from Cassandra and then work on them,
but the CassandraStorage provided by Cassandra just created a Tuple of the
key and a databag containing key-value pairs as tuples. I'd like to have a
more mysql-esque way of addressing the columns loaded from Cassandra :-)
--
Christian Decker
Software Architect
http://blog.snyke.net


On Tue, Sep 7, 2010 at 7:52 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Yes. A tuple is (kind of) like a row in a database table -- a container for
> fields, which may be of a number of different types. A LoadFunc returns
> rows. You can stuff any objects into them that you like, however, by
> serializing them into bytearays, or by doing things like extending Tuple
> and
> overriding its methods (see for example the ProtobufTuple in Elephant-Bird:
>
> http://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/util/ProtobufTuple.java(the
> latter should probably be considered a risky thing to do, as the Tuple
> class is still evolving).
>
> Can you be more specific about what you are trying to do?
>
> -D
>
> On Tue, Sep 7, 2010 at 8:19 AM, Christian Decker <
> decker.christian@gmail.com
> > wrote:
>
> > I was thinking about creating my own Cassandra Storage to be able to
> > efficiently load data from a secondary index, and since I was already
> > writing most of the stuff I thought it might be a good idea to already
> > convert it into the format I'd like to elaborate on later, but as it
> turns
> > out LoadFunc is not generic and can therefor only return Tuples, is that
> > correct?
> >
> > Regards,
> > Chris
> >
> > On Wed, Aug 25, 2010 at 10:08 PM, Dmitriy Ryaboy <dv...@gmail.com>
> > wrote:
> >
> > > yeah, absolutely. You can have an EvalFunc<Map> that does this. I think
> > it
> > > has to be <Map>, not <Map<String, Object>> because of how function
> > > prototypes get mapped, but more or less the same deal.
> > >
> > > -D
> > >
> > > On Wed, Aug 25, 2010 at 11:53 AM, Christian Decker <
> > > decker.christian@gmail.com> wrote:
> > >
> > > > I'm not sure either, but it's a good point. So basically it would be
> > > > possible to create a UDF that generates a Map<String, Object> from my
> > > > input,
> > > > right?
> > > > --
> > > > Christian Decker
> > > > Software Architect
> > > > http://blog.snyke.net
> > > >
> > > >
> > > > On Wed, Aug 25, 2010 at 8:11 PM, Dmitriy Ryaboy <dv...@gmail.com>
> > > > wrote:
> > > >
> > > > > Chris,
> > > > > This sort of pattern is not common because Map<String, Object> is a
> > > > > primitive data type in Pig, I am not sure why Cassandra doesn't
> just
> > > use
> > > > > it.
> > > > > That would seem to be the right solution based on what I am reading
> > in
> > > > your
> > > > > email.
> > > > >
> > > > > -D
> > > > >
> > > > > On Wed, Aug 25, 2010 at 10:59 AM, Christian Decker <
> > > > > decker.christian@gmail.com> wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I'm trying to read some data from CassandraStorage (contrib by
> > > > Cassandra)
> > > > > > and then work on it, but the format of the data is just
> incredibly
> > > > ugly.
> > > > > > When just loading it and dumping it I can see that the format is
> > > > > something
> > > > > > like this:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> (key,{(col0,col0value),(col1,col1value),(col2,col2value),(col3,col3value)})
> > > > > >
> > > > > >
> > > > > > which makes my UDFs incredibly ugly:
> > > > > >
> > > > > > public Boolean exec(Tuple arg0) throws IOException {
> > > > > >
> > > > > >  DataBag b = (DataBag) arg0.get(0);
> > > > > >
> > > > > >  Iterator<Tuple> i = b.iterator();
> > > > > >
> > > > > >  while(i.hasNext()){
> > > > > >
> > > > > >  Tuple next = i.next();
> > > > > >
> > > > > >  if("col1".equals(next.get(0).toString()))
> > > > > >
> > > > > >  col1 = Double.parseDouble(next.get(1).toString());
> > > > > >
> > > > > >  else if("longitude".equals(next.get(0).toString()))
> > > > > >
> > > > > >  col2 = Double.parseDouble(next.get(1).toString());
> > > > > >
> > > > > >  }
> > > > > >
> > > > > >  }
> > > > > >
> > > > > >  ...
> > > > > >
> > > > > > }
> > > > > >
> > > > > >
> > > > > > As you can see the most part of this is just iterating over the
> > > DataBag
> > > > > and
> > > > > > mapping the column names to their value, before working on the
> real
> > > > data.
> > > > > > Since my guess is that this is quite commonplace and
> timeconsuming,
> > I
> > > > was
> > > > > > wondering whether there is a better way to prepare the data
> before
> > > > > passing
> > > > > > it to the UDFs, some sort of HashMap that extracts column names
> and
> > > > > values
> > > > > > and stores them correctly.
> > > > > >
> > > > > > Regards,
> > > > > > Chris
> > > > > >
> > > > > > --
> > > > > > Christian Decker
> > > > > > Software Architect
> > > > > > http://blog.snyke.net
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Expand row content

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Yes. A tuple is (kind of) like a row in a database table -- a container for
fields, which may be of a number of different types. A LoadFunc returns
rows. You can stuff any objects into them that you like, however, by
serializing them into bytearays, or by doing things like extending Tuple and
overriding its methods (see for example the ProtobufTuple in Elephant-Bird:
http://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/util/ProtobufTuple.java(the
latter should probably be considered a risky thing to do, as the Tuple
class is still evolving).

Can you be more specific about what you are trying to do?

-D

On Tue, Sep 7, 2010 at 8:19 AM, Christian Decker <decker.christian@gmail.com
> wrote:

> I was thinking about creating my own Cassandra Storage to be able to
> efficiently load data from a secondary index, and since I was already
> writing most of the stuff I thought it might be a good idea to already
> convert it into the format I'd like to elaborate on later, but as it turns
> out LoadFunc is not generic and can therefor only return Tuples, is that
> correct?
>
> Regards,
> Chris
>
> On Wed, Aug 25, 2010 at 10:08 PM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
>
> > yeah, absolutely. You can have an EvalFunc<Map> that does this. I think
> it
> > has to be <Map>, not <Map<String, Object>> because of how function
> > prototypes get mapped, but more or less the same deal.
> >
> > -D
> >
> > On Wed, Aug 25, 2010 at 11:53 AM, Christian Decker <
> > decker.christian@gmail.com> wrote:
> >
> > > I'm not sure either, but it's a good point. So basically it would be
> > > possible to create a UDF that generates a Map<String, Object> from my
> > > input,
> > > right?
> > > --
> > > Christian Decker
> > > Software Architect
> > > http://blog.snyke.net
> > >
> > >
> > > On Wed, Aug 25, 2010 at 8:11 PM, Dmitriy Ryaboy <dv...@gmail.com>
> > > wrote:
> > >
> > > > Chris,
> > > > This sort of pattern is not common because Map<String, Object> is a
> > > > primitive data type in Pig, I am not sure why Cassandra doesn't just
> > use
> > > > it.
> > > > That would seem to be the right solution based on what I am reading
> in
> > > your
> > > > email.
> > > >
> > > > -D
> > > >
> > > > On Wed, Aug 25, 2010 at 10:59 AM, Christian Decker <
> > > > decker.christian@gmail.com> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I'm trying to read some data from CassandraStorage (contrib by
> > > Cassandra)
> > > > > and then work on it, but the format of the data is just incredibly
> > > ugly.
> > > > > When just loading it and dumping it I can see that the format is
> > > > something
> > > > > like this:
> > > > >
> > > > >
> > > >
> > >
> >
> (key,{(col0,col0value),(col1,col1value),(col2,col2value),(col3,col3value)})
> > > > >
> > > > >
> > > > > which makes my UDFs incredibly ugly:
> > > > >
> > > > > public Boolean exec(Tuple arg0) throws IOException {
> > > > >
> > > > >  DataBag b = (DataBag) arg0.get(0);
> > > > >
> > > > >  Iterator<Tuple> i = b.iterator();
> > > > >
> > > > >  while(i.hasNext()){
> > > > >
> > > > >  Tuple next = i.next();
> > > > >
> > > > >  if("col1".equals(next.get(0).toString()))
> > > > >
> > > > >  col1 = Double.parseDouble(next.get(1).toString());
> > > > >
> > > > >  else if("longitude".equals(next.get(0).toString()))
> > > > >
> > > > >  col2 = Double.parseDouble(next.get(1).toString());
> > > > >
> > > > >  }
> > > > >
> > > > >  }
> > > > >
> > > > >  ...
> > > > >
> > > > > }
> > > > >
> > > > >
> > > > > As you can see the most part of this is just iterating over the
> > DataBag
> > > > and
> > > > > mapping the column names to their value, before working on the real
> > > data.
> > > > > Since my guess is that this is quite commonplace and timeconsuming,
> I
> > > was
> > > > > wondering whether there is a better way to prepare the data before
> > > > passing
> > > > > it to the UDFs, some sort of HashMap that extracts column names and
> > > > values
> > > > > and stores them correctly.
> > > > >
> > > > > Regards,
> > > > > Chris
> > > > >
> > > > > --
> > > > > Christian Decker
> > > > > Software Architect
> > > > > http://blog.snyke.net
> > > > >
> > > >
> > >
> >
>