You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by William Oberman <ob...@civicscience.com> on 2012/11/06 21:20:00 UTC

Having troubles with PigStorage

I'm trying to play around with Amazon EMR, and I currently have self hosted
Cassandra as the source of data.  I was going to try to do: Cassandra -> S3
-> EMR.  I've traced my problems to PigStorage.  At this point I can
recreate my problem "locally" without involving S3 or Amazon.

In my local test environment I have this script:

data = LOAD 'cassandra://XXX/YYY' USING CassandraStorage() AS
(key:chararray, columns:bag {column:tuple (name, value)});

STORE data INTO 'hdfs://ZZZ/tmp/test' USING PigStorage();


I can verify that HDFS file looks vaguely correct (\t separated fields,
return separated lines, my data is in the right spots).


Then if I do:

data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS (key:chararray,
columns:bag {column:tuple (name, value)});

keys = FOREACH data GENERATE key;

DUMP keys;


I can see that data is wrong.  In the dump sometimes I see keys, sometimes
I see columns, and sometimes I see a mismatch of keys/columns lumped
together.


As far as I can tell PigStorage is unable to parse the data it just
persisted.  I've tried pig 0.8, 0.9 and 0.10 with the same results.


In terms of my data:

key = URI (ASCII)

columns = binary UUID -> JSON (ASCII)


Any ideas?  Next I guess I'll see what kind of debugging is in pig in the
STORE/LOAD processes.


Thanks!


will

Re: Having troubles with PigStorage

Posted by William Oberman <ob...@civicscience.com>.

Just in case someone hits this thread by having the same issue, please vote
for this bug:
https://issues.apache.org/jira/browse/PIG-1271


On Tue, Nov 6, 2012 at 4:50 PM, William Oberman <ob...@civicscience.com>wrote:

> Wow, ok.  That is completely unexpected.  Thanks for the heads up!
>
> In my case, because part of my data is binary (UUIDs from Cassandra) all
> possible characters can appear in the data, making PigStorage.... unhelpful
> ;-)
>
> I just tried AvroStorage in piggybank and that is able to store/load my
> data correctly.
>
>
> On Tue, Nov 6, 2012 at 4:35 PM, Cheolsoo Park <ch...@cloudera.com>wrote:
>
>> >> This is a dumb question, but PigStorage escapes the delimiter, right?
>>
>> No it doesn't.
>>
>> On Tue, Nov 6, 2012 at 1:29 PM, William Oberman <oberman@civicscience.com
>> >wrote:
>>
>> > This is a dumb question, but PigStorage escapes the delimiter, right?  I
>> > was assuming I didn't have to select a delimiter such that it doesn't
>> > appear in the data as it would get escaped by the export process, and
>> > unescaped in the import process....
>> >
>> >
>> > On Tue, Nov 6, 2012 at 4:01 PM, Cheolsoo Park <ch...@cloudera.com>
>> > wrote:
>> >
>> > > Hi Will,
>> > >
>> > > >> data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS
>> > > (key:chararray,columns:bag {column:tuple (name, value)});
>> > >
>> > > Can you please provide some of your data from this file
>> > > (hdfs://ZZZ/tmp/test) that can help us to reproduce your problem? 1 ~
>> 2
>> > > rows would be sufficient.
>> > >
>> > > Thanks,
>> > > Cheolsoo
>> > >
>> > > On Tue, Nov 6, 2012 at 12:20 PM, William Oberman
>> > > <ob...@civicscience.com>wrote:
>> > >
>> > > > I'm trying to play around with Amazon EMR, and I currently have self
>> > > hosted
>> > > > Cassandra as the source of data.  I was going to try to do:
>> Cassandra
>> > ->
>> > > S3
>> > > > -> EMR.  I've traced my problems to PigStorage.  At this point I can
>> > > > recreate my problem "locally" without involving S3 or Amazon.
>> > > >
>> > > > In my local test environment I have this script:
>> > > >
>> > > > data = LOAD 'cassandra://XXX/YYY' USING CassandraStorage() AS
>> > > > (key:chararray, columns:bag {column:tuple (name, value)});
>> > > >
>> > > > STORE data INTO 'hdfs://ZZZ/tmp/test' USING PigStorage();
>> > > >
>> > > >
>> > > > I can verify that HDFS file looks vaguely correct (\t separated
>> fields,
>> > > > return separated lines, my data is in the right spots).
>> > > >
>> > > >
>> > > > Then if I do:
>> > > >
>> > > > data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS
>> (key:chararray,
>> > > > columns:bag {column:tuple (name, value)});
>> > > >
>> > > > keys = FOREACH data GENERATE key;
>> > > >
>> > > > DUMP keys;
>> > > >
>> > > >
>> > > > I can see that data is wrong.  In the dump sometimes I see keys,
>> > > sometimes
>> > > > I see columns, and sometimes I see a mismatch of keys/columns lumped
>> > > > together.
>> > > >
>> > > >
>> > > > As far as I can tell PigStorage is unable to parse the data it just
>> > > > persisted.  I've tried pig 0.8, 0.9 and 0.10 with the same results.
>> > > >
>> > > >
>> > > > In terms of my data:
>> > > >
>> > > > key = URI (ASCII)
>> > > >
>> > > > columns = binary UUID -> JSON (ASCII)
>> > > >
>> > > >
>> > > > Any ideas?  Next I guess I'll see what kind of debugging is in pig
>> in
>> > the
>> > > > STORE/LOAD processes.
>> > > >
>> > > >
>> > > > Thanks!
>> > > >
>> > > >
>> > > > will
>> > > >
>> > >
>> >
>>
>
>
>

Re: Having troubles with PigStorage

Posted by William Oberman <ob...@civicscience.com>.

Wow, ok.  That is completely unexpected.  Thanks for the heads up!

In my case, because part of my data is binary (UUIDs from Cassandra) all
possible characters can appear in the data, making PigStorage.... unhelpful
;-)

I just tried AvroStorage in piggybank and that is able to store/load my
data correctly.


On Tue, Nov 6, 2012 at 4:35 PM, Cheolsoo Park <ch...@cloudera.com> wrote:

> >> This is a dumb question, but PigStorage escapes the delimiter, right?
>
> No it doesn't.
>
> On Tue, Nov 6, 2012 at 1:29 PM, William Oberman <oberman@civicscience.com
> >wrote:
>
> > This is a dumb question, but PigStorage escapes the delimiter, right?  I
> > was assuming I didn't have to select a delimiter such that it doesn't
> > appear in the data as it would get escaped by the export process, and
> > unescaped in the import process....
> >
> >
> > On Tue, Nov 6, 2012 at 4:01 PM, Cheolsoo Park <ch...@cloudera.com>
> > wrote:
> >
> > > Hi Will,
> > >
> > > >> data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS
> > > (key:chararray,columns:bag {column:tuple (name, value)});
> > >
> > > Can you please provide some of your data from this file
> > > (hdfs://ZZZ/tmp/test) that can help us to reproduce your problem? 1 ~ 2
> > > rows would be sufficient.
> > >
> > > Thanks,
> > > Cheolsoo
> > >
> > > On Tue, Nov 6, 2012 at 12:20 PM, William Oberman
> > > <ob...@civicscience.com>wrote:
> > >
> > > > I'm trying to play around with Amazon EMR, and I currently have self
> > > hosted
> > > > Cassandra as the source of data.  I was going to try to do: Cassandra
> > ->
> > > S3
> > > > -> EMR.  I've traced my problems to PigStorage.  At this point I can
> > > > recreate my problem "locally" without involving S3 or Amazon.
> > > >
> > > > In my local test environment I have this script:
> > > >
> > > > data = LOAD 'cassandra://XXX/YYY' USING CassandraStorage() AS
> > > > (key:chararray, columns:bag {column:tuple (name, value)});
> > > >
> > > > STORE data INTO 'hdfs://ZZZ/tmp/test' USING PigStorage();
> > > >
> > > >
> > > > I can verify that HDFS file looks vaguely correct (\t separated
> fields,
> > > > return separated lines, my data is in the right spots).
> > > >
> > > >
> > > > Then if I do:
> > > >
> > > > data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS
> (key:chararray,
> > > > columns:bag {column:tuple (name, value)});
> > > >
> > > > keys = FOREACH data GENERATE key;
> > > >
> > > > DUMP keys;
> > > >
> > > >
> > > > I can see that data is wrong.  In the dump sometimes I see keys,
> > > sometimes
> > > > I see columns, and sometimes I see a mismatch of keys/columns lumped
> > > > together.
> > > >
> > > >
> > > > As far as I can tell PigStorage is unable to parse the data it just
> > > > persisted.  I've tried pig 0.8, 0.9 and 0.10 with the same results.
> > > >
> > > >
> > > > In terms of my data:
> > > >
> > > > key = URI (ASCII)
> > > >
> > > > columns = binary UUID -> JSON (ASCII)
> > > >
> > > >
> > > > Any ideas?  Next I guess I'll see what kind of debugging is in pig in
> > the
> > > > STORE/LOAD processes.
> > > >
> > > >
> > > > Thanks!
> > > >
> > > >
> > > > will
> > > >
> > >
> >
>

Re: Having troubles with PigStorage

Posted by Cheolsoo Park <ch...@cloudera.com>.

>> This is a dumb question, but PigStorage escapes the delimiter, right?

No it doesn't.

On Tue, Nov 6, 2012 at 1:29 PM, William Oberman <ob...@civicscience.com>wrote:

> This is a dumb question, but PigStorage escapes the delimiter, right?  I
> was assuming I didn't have to select a delimiter such that it doesn't
> appear in the data as it would get escaped by the export process, and
> unescaped in the import process....
>
>
> On Tue, Nov 6, 2012 at 4:01 PM, Cheolsoo Park <ch...@cloudera.com>
> wrote:
>
> > Hi Will,
> >
> > >> data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS
> > (key:chararray,columns:bag {column:tuple (name, value)});
> >
> > Can you please provide some of your data from this file
> > (hdfs://ZZZ/tmp/test) that can help us to reproduce your problem? 1 ~ 2
> > rows would be sufficient.
> >
> > Thanks,
> > Cheolsoo
> >
> > On Tue, Nov 6, 2012 at 12:20 PM, William Oberman
> > <ob...@civicscience.com>wrote:
> >
> > > I'm trying to play around with Amazon EMR, and I currently have self
> > hosted
> > > Cassandra as the source of data.  I was going to try to do: Cassandra
> ->
> > S3
> > > -> EMR.  I've traced my problems to PigStorage.  At this point I can
> > > recreate my problem "locally" without involving S3 or Amazon.
> > >
> > > In my local test environment I have this script:
> > >
> > > data = LOAD 'cassandra://XXX/YYY' USING CassandraStorage() AS
> > > (key:chararray, columns:bag {column:tuple (name, value)});
> > >
> > > STORE data INTO 'hdfs://ZZZ/tmp/test' USING PigStorage();
> > >
> > >
> > > I can verify that HDFS file looks vaguely correct (\t separated fields,
> > > return separated lines, my data is in the right spots).
> > >
> > >
> > > Then if I do:
> > >
> > > data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS (key:chararray,
> > > columns:bag {column:tuple (name, value)});
> > >
> > > keys = FOREACH data GENERATE key;
> > >
> > > DUMP keys;
> > >
> > >
> > > I can see that data is wrong.  In the dump sometimes I see keys,
> > sometimes
> > > I see columns, and sometimes I see a mismatch of keys/columns lumped
> > > together.
> > >
> > >
> > > As far as I can tell PigStorage is unable to parse the data it just
> > > persisted.  I've tried pig 0.8, 0.9 and 0.10 with the same results.
> > >
> > >
> > > In terms of my data:
> > >
> > > key = URI (ASCII)
> > >
> > > columns = binary UUID -> JSON (ASCII)
> > >
> > >
> > > Any ideas?  Next I guess I'll see what kind of debugging is in pig in
> the
> > > STORE/LOAD processes.
> > >
> > >
> > > Thanks!
> > >
> > >
> > > will
> > >
> >
>

Re: Having troubles with PigStorage

Posted by William Oberman <ob...@civicscience.com>.

This is a dumb question, but PigStorage escapes the delimiter, right?  I
was assuming I didn't have to select a delimiter such that it doesn't
appear in the data as it would get escaped by the export process, and
unescaped in the import process....


On Tue, Nov 6, 2012 at 4:01 PM, Cheolsoo Park <ch...@cloudera.com> wrote:

> Hi Will,
>
> >> data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS
> (key:chararray,columns:bag {column:tuple (name, value)});
>
> Can you please provide some of your data from this file
> (hdfs://ZZZ/tmp/test) that can help us to reproduce your problem? 1 ~ 2
> rows would be sufficient.
>
> Thanks,
> Cheolsoo
>
> On Tue, Nov 6, 2012 at 12:20 PM, William Oberman
> <ob...@civicscience.com>wrote:
>
> > I'm trying to play around with Amazon EMR, and I currently have self
> hosted
> > Cassandra as the source of data.  I was going to try to do: Cassandra ->
> S3
> > -> EMR.  I've traced my problems to PigStorage.  At this point I can
> > recreate my problem "locally" without involving S3 or Amazon.
> >
> > In my local test environment I have this script:
> >
> > data = LOAD 'cassandra://XXX/YYY' USING CassandraStorage() AS
> > (key:chararray, columns:bag {column:tuple (name, value)});
> >
> > STORE data INTO 'hdfs://ZZZ/tmp/test' USING PigStorage();
> >
> >
> > I can verify that HDFS file looks vaguely correct (\t separated fields,
> > return separated lines, my data is in the right spots).
> >
> >
> > Then if I do:
> >
> > data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS (key:chararray,
> > columns:bag {column:tuple (name, value)});
> >
> > keys = FOREACH data GENERATE key;
> >
> > DUMP keys;
> >
> >
> > I can see that data is wrong.  In the dump sometimes I see keys,
> sometimes
> > I see columns, and sometimes I see a mismatch of keys/columns lumped
> > together.
> >
> >
> > As far as I can tell PigStorage is unable to parse the data it just
> > persisted.  I've tried pig 0.8, 0.9 and 0.10 with the same results.
> >
> >
> > In terms of my data:
> >
> > key = URI (ASCII)
> >
> > columns = binary UUID -> JSON (ASCII)
> >
> >
> > Any ideas?  Next I guess I'll see what kind of debugging is in pig in the
> > STORE/LOAD processes.
> >
> >
> > Thanks!
> >
> >
> > will
> >
>

Re: Having troubles with PigStorage

Posted by Cheolsoo Park <ch...@cloudera.com>.

Hi Will,

>> data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS
(key:chararray,columns:bag {column:tuple (name, value)});

Can you please provide some of your data from this file
(hdfs://ZZZ/tmp/test) that can help us to reproduce your problem? 1 ~ 2
rows would be sufficient.

Thanks,
Cheolsoo

On Tue, Nov 6, 2012 at 12:20 PM, William Oberman
<ob...@civicscience.com>wrote:

> I'm trying to play around with Amazon EMR, and I currently have self hosted
> Cassandra as the source of data.  I was going to try to do: Cassandra -> S3
> -> EMR.  I've traced my problems to PigStorage.  At this point I can
> recreate my problem "locally" without involving S3 or Amazon.
>
> In my local test environment I have this script:
>
> data = LOAD 'cassandra://XXX/YYY' USING CassandraStorage() AS
> (key:chararray, columns:bag {column:tuple (name, value)});
>
> STORE data INTO 'hdfs://ZZZ/tmp/test' USING PigStorage();
>
>
> I can verify that HDFS file looks vaguely correct (\t separated fields,
> return separated lines, my data is in the right spots).
>
>
> Then if I do:
>
> data = LOAD 'hdfs://ZZZ/tmp/test' USING PigStorage() AS (key:chararray,
> columns:bag {column:tuple (name, value)});
>
> keys = FOREACH data GENERATE key;
>
> DUMP keys;
>
>
> I can see that data is wrong.  In the dump sometimes I see keys, sometimes
> I see columns, and sometimes I see a mismatch of keys/columns lumped
> together.
>
>
> As far as I can tell PigStorage is unable to parse the data it just
> persisted.  I've tried pig 0.8, 0.9 and 0.10 with the same results.
>
>
> In terms of my data:
>
> key = URI (ASCII)
>
> columns = binary UUID -> JSON (ASCII)
>
>
> Any ideas?  Next I guess I'll see what kind of debugging is in pig in the
> STORE/LOAD processes.
>
>
> Thanks!
>
>
> will
>