You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Lauren Blau <la...@digitalreasoning.com> on 2012/08/15 13:44:25 UTC

newbie just not getting structure

I'm having problems with understanding storage structures. Here's what I
did:

on the cluster I loaded some data and created a relation with one row.
I output the row using store relation into '/file' using PigStorage('|');
then I copied it my local workspace, copyToLocal /file ./file
then I tarred up the local file and scp'd it to my laptop.

on my laptop I untarred the file into data/file
then I ran these pig commands:

b = load 'data/file' using PigStorage('|') as (a:map[]); --because I'm
expecting a map
dump b;

return is successful but result is ().

then I ran
c = foreach b generate *;
dump c;

return is successful but result is ().

then I tried

d = load 'data/file' using PigStorage('|');
dump d;

return
is ([id#ID1,documentDate#1344461328851,source#93931,indexed#false,lastModifiedDate#1344461328851,contexts#{([id#CID1])}])

since that is a map, I'm not sure why dump b didn't return values. so then
I tried
e = foreach d generate $0#'id';
dump e;

and the return was ();

Does anyone see where I'm missing the point? And how do I grab those map
values?

Thanks

Re: newbie just not getting structure

Posted by Lauren Blau <la...@digitalreasoning.com>.
Still not getting it. A similar problem is occurring:
I have a file which I believe contains structures like,
("string1","string2",{[]})
and if I load it as  (messageId:chararray, documentName:chararray,
 annot:map[])
I can dump it, and I can define:
foo = foreach row generate messageId as messageId:chararray,documentName as
documentName:chararray,annot#'prefix' as apre:chararray, annot#'label' as
alabel:chararray ..), and can dump foo and see my results as expected

if I try
x = filter foo by apre == 'VALUE';
I get 0 rows back and I see a warning about
FIELD_DISCARDED_CONVERSION_FAILED

but if I store foo into a file using
store foo into '/filefoo';
and then define
foo2 = load '/filefoo' as
(messageId:chararray,documentName:chararray,apre:chararray,alabel:chararray
..)
then
y = filter foo2 by apre == 'VALUE'
I get back the rows I expect.

would some please explain what the difference between the 2 is? Why should
storing and re-reading the data make a difference? What am I missing?
Thanks.

On Wed, Aug 15, 2012 at 7:44 AM, Lauren Blau <
lauren.blau@digitalreasoning.com> wrote:

> I'm having problems with understanding storage structures. Here's what I
> did:
>
> on the cluster I loaded some data and created a relation with one row.
> I output the row using store relation into '/file' using PigStorage('|');
> then I copied it my local workspace, copyToLocal /file ./file
> then I tarred up the local file and scp'd it to my laptop.
>
> on my laptop I untarred the file into data/file
> then I ran these pig commands:
>
> b = load 'data/file' using PigStorage('|') as (a:map[]); --because I'm
> expecting a map
> dump b;
>
> return is successful but result is ().
>
> then I ran
> c = foreach b generate *;
> dump c;
>
> return is successful but result is ().
>
> then I tried
>
> d = load 'data/file' using PigStorage('|');
> dump d;
>
> return
> is ([id#ID1,documentDate#1344461328851,source#93931,indexed#false,lastModifiedDate#1344461328851,contexts#{([id#CID1])}])
>
> since that is a map, I'm not sure why dump b didn't return values. so then
> I tried
> e = foreach d generate $0#'id';
> dump e;
>
> and the return was ();
>
> Does anyone see where I'm missing the point? And how do I grab those map
> values?
>
> Thanks
>
>
>
>

Re: newbie just not getting structure

Posted by Lauren Blau <la...@digitalreasoning.com>.
I don't know what the data file on disk looks like, as it is compressed or
encoded. It should be in whatever format PigStorage('|') would store a map.

(The original relation was a data that had been loaded into a map, and
could be accessed as a map, so something like:

original = load 'origFile' using customeLoader('params') as (a:map[]);
at this point I can work with original as map correctly, accessing fields
using a#'fieldname';
then I did a filter down to one row:
small = filter original by a#'id' == 'rowofinterest';
then I stored it
store small into '/outputfilename' using PigStorage('|');

so whatever format PigStorage('|') put in the output file is what it is. I
haven't manipulated it, just copied down to a different machine.

lauren


On Wed, Aug 15, 2012 at 7:06 PM, Cheolsoo Park <ch...@cloudera.com>wrote:

> Hi,
>
> What's the content of data/file like? Given your description, I guess that
> it looks as follows:
>
>
> [id#ID1,documentDate#1344461328851,source#93931,indexed#false,lastModifiedDate#1344461328851,contexts#{([id#CID1])}]
>
> But this is not map literal format. If you change it to:
>
> [id#ID1]
> [documentDate#1344461328851]
> [source#93931]
> [indexed#false]
> [lastModifiedDate#1344461328851]
> [contexts#{([id#CID1])}]
>
> then you can load it as map:
>
> >> a = load 'data/file'  using PigStorage(',') as (m:map[]);
> >> dump a;
>
> ([id#ID1])
> ([documentDate#1344461328851])
> ([source#93931])
> ([indexed#false])
> ([lastModifiedDate#1344461328851])
> ([contexts#{([id#CID1])}])
>
> Furthermore, you can do:
>
> >> b = foreach a generate $0#'id';
> >> dump b;
>
> (ID1)
> ()
> ()
> ()
> ()
> ()
>
> This is what you expect, no?
>
> Thanks,
> Cheolsoo
>
>
> On Wed, Aug 15, 2012 at 4:44 AM, Lauren Blau <
> lauren.blau@digitalreasoning.com> wrote:
>
> > I'm having problems with understanding storage structures. Here's what I
> > did:
> >
> > on the cluster I loaded some data and created a relation with one row.
> > I output the row using store relation into '/file' using PigStorage('|');
> > then I copied it my local workspace, copyToLocal /file ./file
> > then I tarred up the local file and scp'd it to my laptop.
> >
> > on my laptop I untarred the file into data/file
> > then I ran these pig commands:
> >
> > b = load 'data/file' using PigStorage('|') as (a:map[]); --because I'm
> > expecting a map
> > dump b;
> >
> > return is successful but result is ().
> >
> > then I ran
> > c = foreach b generate *;
> > dump c;
> >
> > return is successful but result is ().
> >
> > then I tried
> >
> > d = load 'data/file' using PigStorage('|');
> > dump d;
> >
> > return
> > is
> >
> ([id#ID1,documentDate#1344461328851,source#93931,indexed#false,lastModifiedDate#1344461328851,contexts#{([id#CID1])}])
> >
> > since that is a map, I'm not sure why dump b didn't return values. so
> then
> > I tried
> > e = foreach d generate $0#'id';
> > dump e;
> >
> > and the return was ();
> >
> > Does anyone see where I'm missing the point? And how do I grab those map
> > values?
> >
> > Thanks
> >
>

Re: newbie just not getting structure

Posted by Cheolsoo Park <ch...@cloudera.com>.
Hi,

What's the content of data/file like? Given your description, I guess that
it looks as follows:

[id#ID1,documentDate#1344461328851,source#93931,indexed#false,lastModifiedDate#1344461328851,contexts#{([id#CID1])}]

But this is not map literal format. If you change it to:

[id#ID1]
[documentDate#1344461328851]
[source#93931]
[indexed#false]
[lastModifiedDate#1344461328851]
[contexts#{([id#CID1])}]

then you can load it as map:

>> a = load 'data/file'  using PigStorage(',') as (m:map[]);
>> dump a;

([id#ID1])
([documentDate#1344461328851])
([source#93931])
([indexed#false])
([lastModifiedDate#1344461328851])
([contexts#{([id#CID1])}])

Furthermore, you can do:

>> b = foreach a generate $0#'id';
>> dump b;

(ID1)
()
()
()
()
()

This is what you expect, no?

Thanks,
Cheolsoo


On Wed, Aug 15, 2012 at 4:44 AM, Lauren Blau <
lauren.blau@digitalreasoning.com> wrote:

> I'm having problems with understanding storage structures. Here's what I
> did:
>
> on the cluster I loaded some data and created a relation with one row.
> I output the row using store relation into '/file' using PigStorage('|');
> then I copied it my local workspace, copyToLocal /file ./file
> then I tarred up the local file and scp'd it to my laptop.
>
> on my laptop I untarred the file into data/file
> then I ran these pig commands:
>
> b = load 'data/file' using PigStorage('|') as (a:map[]); --because I'm
> expecting a map
> dump b;
>
> return is successful but result is ().
>
> then I ran
> c = foreach b generate *;
> dump c;
>
> return is successful but result is ().
>
> then I tried
>
> d = load 'data/file' using PigStorage('|');
> dump d;
>
> return
> is
> ([id#ID1,documentDate#1344461328851,source#93931,indexed#false,lastModifiedDate#1344461328851,contexts#{([id#CID1])}])
>
> since that is a map, I'm not sure why dump b didn't return values. so then
> I tried
> e = foreach d generate $0#'id';
> dump e;
>
> and the return was ();
>
> Does anyone see where I'm missing the point? And how do I grab those map
> values?
>
> Thanks
>