You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Klüber, Ralf <Ra...@p3-group.com> on 2014/08/07 20:22:10 UTC

Json Loader - Array of objects - Loading results in empty data set

Hello,



I am new to this list. I tried to solve this problem for the last 48h but I am stuck. I hope someone here can hint me in the right direction.



I have problems using the Pig JsonLoader and wondering if I do something wrong or I encounter another problem.



The 1st half of this post is to show I know a at least something about what I am talking and that I did my homework. During research I found a lot about elephant-bird but there seems to be a conflict with cloudera. This way I am stuck as well. If you trust me already you can directly jump to the 2nd half of my post ,-).



The desired solution should work both, in Cloudera and on Amazon EMR.



To proof something works.

--------------------------



I have this data file:

```

$ cat a.json

{"DataASet":{"A1":1,"A2":4,"DataBSets":[{"B1":"1","B2":"1"},{"B1":"2","B2":"2"}]}}

$ ./jq '.' a.json

{

  "DataASet": {

    "A1": 1,

    "A2": 4,

    "DataBSets": [

      {

        "B1": "1",

        "B2": "1"

      },

      {

        "B1": "2",

        "B2": "2"

      }

    ]

  }

}

$

```



I am using this Pig Script to load it.



``` Pig

a = load 'a.json' using JsonLoader('

     DataASet: (

       A1:int,

       A2:int,

       DataBSets: {

        (

           B1:chararray,

           B2:chararray

         )

       }

     )

');

```



In grunt everything seems ok.



```

grunt> describe a;

a: {DataASet: (A1: int,A2: int,DataBSets: {(B1: chararray,B2: chararray)})}

grunt> dump a;

((1,4,{(1,1),(2,2)}))

grunt>

```



So far so good.



Real Problem

------------



In fact my real data (Gigabytes) looks a little bit different. The array is in fact an array of an object.



```

$ ./jq '.' b.json

{

  "DataASet": {

    "A1": 1,

    "A2": 4,

    "DataBSets": [

      {

        "DataBSet": {

          "B1": "1",

          "B2": "1"

        }

      },

      {

        "DataBSet": {

          "B1": "2",

          "B2": "2"

        }

      }

    ]

  }

}

$ cat b.json

{"DataASet":{"A1":1,"A2":4,"DataBSets":[{"DataBSet":{"B1":"1","B2":"1"}},{"DataBSet":{"B1":"2","B2":"2"}}]}}

$

```



I trying to load this json with the following schema:



``` Pig

b = load 'b.json' using JsonLoader('

     DataASet: (

       A1:int,

       A2:int,

       DataBSets: {

        DataBSet: (

           B1:chararray,

           B2:chararray

         )

       }

     )

');

```



Again it looks good so far in grunt.



```

grunt> describe b;

b: {DataASet: (A1: int,A2: int,DataBSets: {DataBSet: (B1: chararray,B2: chararray)})} ```



I expect someting like this when dumping b:



```

((1,4,{((1,1)),((2,2))}))

```



But I get this:



```

grunt> dump b;

()

grunt>

```



Obviously I am doing something wrong. An empty set hints in the direction that the schema does not match on the input line.



Any hints? Thanks in advance.



Kind regards.

Ralf

Re: Json Loader - Array of objects - Loading results in empty data set

Posted by Pradeep Gollakota <pr...@gmail.com>.
I haven't worked with JsonLoader much, so I'm not sure what the problem is.
But your schema looks correct for your JSON structure now.

DataBSets is an Array (or Bag) of Objects (or Tuples). Each Object (or
Tuple) inside the Array has one key which maps to an Object(or Tuple) with
two keys. This is exactly what you would want the structure to look like in
pig.

```
Grunt > describe b;
b: {DataASet: (A1: int,A2: int,DataBSets: {tuple_0: (DataBSet: (B1:
chararray,B2: chararray))})}
grunt> dump b;
()
grunt>
```

I know that lots of people have been having problems with JsonLoader in the
past. I can recall off-hand several emails over the past year on this
mailing list complaining about the loader. Most of the recommendations,
remembering off the top of my head, have been to use the Elephant bird
version of the Loader.

I'm not sure what the version conflict you're seeing with cdh +
elephant-bird, but I'd recommend compiling elephant-bird with the correct
version of hadoop + pig that you're using and deploy it to your maven repo.
I myself do this so that I know that all the code is compiled against
correct version that we're running in house.

I'm going to look into this problem a little bit more and see if I can get
it to work without elephant-bird.


On Fri, Aug 8, 2014 at 8:44 AM, Klüber, Ralf <Ra...@p3-group.com>
wrote:

> Hello,
>
> Much appreciated you taking your time to answer.
>
> > should probably look like
> >
> >  {DataASet: (A1: int,A2: int,DataBSets: {(DataBSet: (B1: chararray,B2:
> > chararray))})}
>
> How to achieve this? I tried:
> ```
> b = load 'b.json' using JsonLoader('
>      DataASet: (
>        A1:int,
>        A2:int,
>        DataBSets: {
>         (
>         (DataBSet: (
>            B1:chararray,
>            B2:chararray
>          )
>         ))
>        }
>      )
>  ');
> ```
>
> Which gives this schema which does not look right.
> Dump fails (empty bag)
>
> ```
> Grunt > describe b;
> b: {DataASet: (A1: int,A2: int,DataBSets: {tuple_0: (DataBSet: (B1:
> chararray,B2: chararray))})}
> grunt> dump b;
> ()
> grunt>
> ```
>
> Kind regards.
> Ralf
>
> > -----Ursprüngliche Nachricht-----
> > Von: Pradeep Gollakota [mailto:pradeepg26@gmail.com]
> > Gesendet: Friday, August 08, 2014 2:21 PM
> > An: user@pig.apache.org
> > Betreff: Re: Json Loader - Array of objects - Loading results in empty
> data set
> >
> > I think there's a problem with your schema.
> >
> >  {DataASet: (A1: int,A2: int,DataBSets: {DataBSet: (B1: chararray,B2:
> > chararray)})}
> >
> > should probably look like
> >
> >  {DataASet: (A1: int,A2: int,DataBSets: {(DataBSet: (B1: chararray,B2:
> > chararray))})}
> >
> >
> > On Thu, Aug 7, 2014 at 11:22 AM, Klüber, Ralf <Ralf.Klueber@p3-group.com
> >
> > wrote:
>

AW: Json Loader - Array of objects - Loading results in empty data set

Posted by Klüber, Ralf <Ra...@p3-group.com>.
Hello,

Much appreciated you taking your time to answer.

> should probably look like
> 
>  {DataASet: (A1: int,A2: int,DataBSets: {(DataBSet: (B1: chararray,B2:
> chararray))})}

How to achieve this? I tried:
```
b = load 'b.json' using JsonLoader('
     DataASet: (
       A1:int,
       A2:int,
       DataBSets: {
        (
        (DataBSet: (
           B1:chararray,
           B2:chararray
         )
        ))
       }
     )
 ');
```

Which gives this schema which does not look right.
Dump fails (empty bag)

```
Grunt > describe b;
b: {DataASet: (A1: int,A2: int,DataBSets: {tuple_0: (DataBSet: (B1: chararray,B2: chararray))})}
grunt> dump b;
()
grunt>
```

Kind regards.
Ralf 

> -----Ursprüngliche Nachricht-----
> Von: Pradeep Gollakota [mailto:pradeepg26@gmail.com]
> Gesendet: Friday, August 08, 2014 2:21 PM
> An: user@pig.apache.org
> Betreff: Re: Json Loader - Array of objects - Loading results in empty data set
> 
> I think there's a problem with your schema.
> 
>  {DataASet: (A1: int,A2: int,DataBSets: {DataBSet: (B1: chararray,B2:
> chararray)})}
> 
> should probably look like
> 
>  {DataASet: (A1: int,A2: int,DataBSets: {(DataBSet: (B1: chararray,B2:
> chararray))})}
> 
> 
> On Thu, Aug 7, 2014 at 11:22 AM, Klüber, Ralf <Ra...@p3-group.com>
> wrote:

Re: Json Loader - Array of objects - Loading results in empty data set

Posted by Pradeep Gollakota <pr...@gmail.com>.
I think there's a problem with your schema.

 {DataASet: (A1: int,A2: int,DataBSets: {DataBSet: (B1: chararray,B2:
chararray)})}

should probably look like

 {DataASet: (A1: int,A2: int,DataBSets: {(DataBSet: (B1: chararray,B2:
chararray))})}


On Thu, Aug 7, 2014 at 11:22 AM, Klüber, Ralf <Ra...@p3-group.com>
wrote:

> Hello,
>
>
>
> I am new to this list. I tried to solve this problem for the last 48h but
> I am stuck. I hope someone here can hint me in the right direction.
>
>
>
> I have problems using the Pig JsonLoader and wondering if I do something
> wrong or I encounter another problem.
>
>
>
> The 1st half of this post is to show I know a at least something about
> what I am talking and that I did my homework. During research I found a lot
> about elephant-bird but there seems to be a conflict with cloudera. This
> way I am stuck as well. If you trust me already you can directly jump to
> the 2nd half of my post ,-).
>
>
>
> The desired solution should work both, in Cloudera and on Amazon EMR.
>
>
>
> To proof something works.
>
> --------------------------
>
>
>
> I have this data file:
>
> ```
>
> $ cat a.json
>
>
> {"DataASet":{"A1":1,"A2":4,"DataBSets":[{"B1":"1","B2":"1"},{"B1":"2","B2":"2"}]}}
>
> $ ./jq '.' a.json
>
> {
>
>   "DataASet": {
>
>     "A1": 1,
>
>     "A2": 4,
>
>     "DataBSets": [
>
>       {
>
>         "B1": "1",
>
>         "B2": "1"
>
>       },
>
>       {
>
>         "B1": "2",
>
>         "B2": "2"
>
>       }
>
>     ]
>
>   }
>
> }
>
> $
>
> ```
>
>
>
> I am using this Pig Script to load it.
>
>
>
> ``` Pig
>
> a = load 'a.json' using JsonLoader('
>
>      DataASet: (
>
>        A1:int,
>
>        A2:int,
>
>        DataBSets: {
>
>         (
>
>            B1:chararray,
>
>            B2:chararray
>
>          )
>
>        }
>
>      )
>
> ');
>
> ```
>
>
>
> In grunt everything seems ok.
>
>
>
> ```
>
> grunt> describe a;
>
> a: {DataASet: (A1: int,A2: int,DataBSets: {(B1: chararray,B2: chararray)})}
>
> grunt> dump a;
>
> ((1,4,{(1,1),(2,2)}))
>
> grunt>
>
> ```
>
>
>
> So far so good.
>
>
>
> Real Problem
>
> ------------
>
>
>
> In fact my real data (Gigabytes) looks a little bit different. The array
> is in fact an array of an object.
>
>
>
> ```
>
> $ ./jq '.' b.json
>
> {
>
>   "DataASet": {
>
>     "A1": 1,
>
>     "A2": 4,
>
>     "DataBSets": [
>
>       {
>
>         "DataBSet": {
>
>           "B1": "1",
>
>           "B2": "1"
>
>         }
>
>       },
>
>       {
>
>         "DataBSet": {
>
>           "B1": "2",
>
>           "B2": "2"
>
>         }
>
>       }
>
>     ]
>
>   }
>
> }
>
> $ cat b.json
>
>
> {"DataASet":{"A1":1,"A2":4,"DataBSets":[{"DataBSet":{"B1":"1","B2":"1"}},{"DataBSet":{"B1":"2","B2":"2"}}]}}
>
> $
>
> ```
>
>
>
> I trying to load this json with the following schema:
>
>
>
> ``` Pig
>
> b = load 'b.json' using JsonLoader('
>
>      DataASet: (
>
>        A1:int,
>
>        A2:int,
>
>        DataBSets: {
>
>         DataBSet: (
>
>            B1:chararray,
>
>            B2:chararray
>
>          )
>
>        }
>
>      )
>
> ');
>
> ```
>
>
>
> Again it looks good so far in grunt.
>
>
>
> ```
>
> grunt> describe b;
>
> b: {DataASet: (A1: int,A2: int,DataBSets: {DataBSet: (B1: chararray,B2:
> chararray)})} ```
>
>
>
> I expect someting like this when dumping b:
>
>
>
> ```
>
> ((1,4,{((1,1)),((2,2))}))
>
> ```
>
>
>
> But I get this:
>
>
>
> ```
>
> grunt> dump b;
>
> ()
>
> grunt>
>
> ```
>
>
>
> Obviously I am doing something wrong. An empty set hints in the direction
> that the schema does not match on the input line.
>
>
>
> Any hints? Thanks in advance.
>
>
>
> Kind regards.
>
> Ralf
>