You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Tecno Brain <ce...@gmail.com> on 2013/06/13 02:57:21 UTC

Aggregating data nested into JSON documents

Hello,
   I'm new to Hadoop.
   I have a large quantity of JSON documents with a structure similar to
what is shown below.

   {
     g : "some-group-identifier",
     sg: "some-subgroup-identifier",
     j      : "some-job-identifier",
     page     : 23,
     ... // other fields omitted
     important-data : [
         {
           f1  : "abc",
           f2  : "a",
           f3  : "/"
           ...
         },
         ...
         {
           f1 : "xyz",
           f2  : "q",
           f3  : "/",
           ...
         },
     ],
    ... // other fields omitted
     other-important-data : [
        {
           x1  : "ford",
           x2  : "green",
           x3  : 35
           map : {
               "free-field" : "value",
               "other-free-field" : value2"
              }
         },
         ...
         {
           x1 : "vw",
           x2  : "red",
           x3  : 54,
           ...
         },
     ]
   },
}


Each file contains a single JSON document (gzip compressed, and roughly
about 200KB uncompressed of pretty-printed json text per document)

I am interested in analyzing only the  "important-data" array and the
"other-important-data" array.
My source data would ideally be easier to analyze if it looked like a
couple of tables with a fixed set of columns. Only the column "map" would
be a complex column, all others would be primitives.

( g, sg, j, page, f1, f2, f3 )

( g, sg, j, page, x1, x2, x3, map )

So, for each JSON document, I would like to "create" several rows, but I
would like to avoid the intermediate step of persisting -and duplicating-
the "flattened" data.

In order to avoid persisting the data flattened, I thought I had to write
my own map-reduce in Java code, but discovered that others have had the
same problem of using JSON as the source and there are somewhat "standard"
solutions.

By reading about the SerDe approach for Hive I get the impression that each
JSON document is transformed into a single "row" of the table with some
columns being an array, a map of other nested structures.
a) Is there a way to break each JSON document into several "rows" for a
Hive external table?
b) It seems there are too many JSON SerDe libraries! Is there any of them
considered the de-facto standard?

The Pig approach seems also promising using Elephant Bird Do anybody has
pointers to more user documentation on this project? Or is browsing through
the examples in GitHub my only source?

Thanks

Re: Aggregating data nested into JSON documents

Posted by Tecno Brain <ce...@gmail.com>.

Never mind, I got the solution!

uberflat = FOREACH flat GENERATE g, sg,
              FLATTEN(important-data#'f1') as f1,
              FLATTEN(important-data#'f2') as f2;

-Jorge


On Thu, Jun 20, 2013 at 11:54 AM, Tecno Brain
<ce...@gmail.com>wrote:

> OK, I'll go back to my original question ( although this time I know what
> tools I'm using).
>
> I am using Pig + ElephantBird.
>
> I have JSON documents with the following structure:
> {
>      g : "some-group-identifier",
>      sg: "some-subgroup-identifier",
>      j      : "some-job-identifier",
>      page     : 23,
>      ... // other fields omitted
>      important-data : [
>          {
>            f1  : "abc",
>            f2  : "a",
>            f3  : "/"
>            ...
>          },
>          ...
>          {
>            f1 : "xyz",
>            f2  : "q",
>            f3  : "/",
>            ...
>          },
>      ]
>     ... // other fields omitted
> }
>
> I want Pig to GENERATE a tuple for each element on the "important-data"
> array attribute. For the example above, I would like to generate:
>
> ( "some-group-identifier" , "some-subgroup-identifier", 23, "abc", "a",
> "/" )
> ( "some-group-identifier" , "some-subgroup-identifier", 23, "xyz", "q",
> "/" )
>
> This is what I have tried:
>
> doc = LOAD '/example.json' USING
>      com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as
> (json:map[]);
> flat = FOREACH doc  GENERATE  (chararray)json#'gr' as g, (long)json#'sg'
> as sg,  FLATTEN( json#'important-data') ;
> DUMP flat;
>
> but that produces:
>
> ( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#abc,
> f2#a, f3#/ ] )
> ( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#xyz,
> f2#q, f3#/ ] )
>
> Close, but not exactly what I want.
>
> Do I require to use ProtoBuf ?
>
> -Jorge
>
>
> On Wed, Jun 19, 2013 at 3:44 PM, Tecno Brain <cerebrotecnologico@gmail.com
> > wrote:
>
>> Ok, I found that elephant-bird JsonLoader cannot handle JSON documents
>> that are pretty-printed. (expanding over multiple-lines) The entire json
>> document has to be on a single line.
>>
>> After I reformated some of the source files, now I am getting the
>> expected output.
>>
>>
>>
>>
>> On Wed, Jun 19, 2013 at 2:47 PM, Tecno Brain <
>> cerebrotecnologico@gmail.com> wrote:
>>
>>> I also tried:
>>>
>>> doc = LOAD '/json-pcr/pcr-000001.json' USING
>>>  com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]);
>>>  flat = FOREACH doc  GENERATE  (chararray)json#'a' AS first,
>>> (long)json#'b' AS second ;
>>> DUMP flat;
>>>
>>> but I got no output either.
>>>
>>>      Input(s):
>>>      Successfully read 0 records (35863 bytes) from:
>>> "/json-pcr/pcr-000001.json"
>>>
>>>      Output(s):
>>>      Successfully stored 0 records in:
>>> "hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210"
>>>
>>>
>>>
>>> On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain <
>>> cerebrotecnologico@gmail.com> wrote:
>>>
>>>> I got Pig and Hive working ona single-node and I am able to run some
>>>> script/queries over regular text files (access log files); with a record
>>>> per line.
>>>>
>>>> Now, I want to process some JSON files.
>>>>
>>>> As mentioned before, it seems  that ElephantBird would be a would be a
>>>> good solution to read JSON files.
>>>>
>>>> I uploaded 5 files to HDFS. Each file only contain a single JSON
>>>> document. The documents are NOT in a single line, but rather contain
>>>> pretty-printed JSON expanding over multiple lines.
>>>>
>>>> I'm trying something simple, extracting two (primitive) attributes at
>>>> the top of the document:
>>>> {
>>>>    a : "some value",
>>>>    ...
>>>>    b : 133,
>>>>    ...
>>>> }
>>>>
>>>> So, lets start with a LOAD of a single file (single JSON document):
>>>>
>>>> REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
>>>> doc = LOAD '/json-pcr/pcr-000001.json' using
>>>>  com.twitter.elephantbird.pig.load.JsonLoader();
>>>> flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b'
>>>> AS second ;
>>>> DUMP flat;
>>>>
>>>> Apparently the job runs without problem, but I get no output. The
>>>> output I get includes this message:
>>>>
>>>>    Input(s):
>>>>    Successfully read 0 records (35863 bytes) from:
>>>> "/json-pcr/pcr-000001.json"
>>>>
>>>> I was expecting to get
>>>>
>>>> ( "some value", 133 )
>>>>
>>>> Any idea on what I am doing wrong?
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <
>>>> michael_segel@hotmail.com> wrote:
>>>>
>>>>> I think you have a misconception of HBase.
>>>>>
>>>>> You don't need to actually have mutable data for it to be effective.
>>>>> The key is that you need to have access to specific records and work a
>>>>> very small subset of the data and not the complete data set.
>>>>>
>>>>>
>>>>> On Jun 13, 2013, at 11:59 AM, Tecno Brain <
>>>>> cerebrotecnologico@gmail.com> wrote:
>>>>>
>>>>> Hi Mike,
>>>>>
>>>>> Yes, I also have thought about HBase or Cassandra but my data is
>>>>> pretty much a snapshot, it does not require updates. Most of my
>>>>> aggregations will also need to be computed once and won't change over time
>>>>> with the exception of some aggregation that is based on the last N days of
>>>>> data.  Should I still consider HBase ? I think that probably it will be
>>>>> good for the aggregated data.
>>>>>
>>>>> I have no idea what are sequence files, but I will take a look.  My
>>>>> raw data is stored in the cloud, not in my Hadoop cluster.
>>>>>
>>>>> I'll keep looking at Pig with ElephantBird.
>>>>> Thanks,
>>>>>
>>>>> -Jorge
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <
>>>>> michael_segel@hotmail.com> wrote:
>>>>>
>>>>>> Hi..
>>>>>>
>>>>>> Have you thought about HBase?
>>>>>>
>>>>>> I would suggest that if you're using Hive or Pig, to look at taking
>>>>>> these files and putting the JSON records in to a sequence file.
>>>>>> Or set of sequence files.... (Then look at HBase to help index
>>>>>> them...) 200KB is small.
>>>>>>
>>>>>> That would be the same for either pig/hive.
>>>>>>
>>>>>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty
>>>>>> nice. And yes you get each record as a row, however you can always flatten
>>>>>> them as needed.
>>>>>>
>>>>>> Hive?
>>>>>> I haven't worked with the latest SerDe, but maybe Dean Wampler or
>>>>>> Edward Capriolo could give you a better answer.
>>>>>> Going from memory, I don't know that there is a good SerDe that would
>>>>>> write JSON, just read it. (Hive)
>>>>>>
>>>>>> IMHO Pig/ElephantBird is the best so far, but then again I may be
>>>>>> dated and biased.
>>>>>>
>>>>>> I think you're on the right track or at least train of thought.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> -Mike
>>>>>>
>>>>>>
>>>>>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <
>>>>>> cerebrotecnologico@gmail.com> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>    I'm new to Hadoop.
>>>>>>    I have a large quantity of JSON documents with a structure similar
>>>>>> to what is shown below.
>>>>>>
>>>>>>    {
>>>>>>      g : "some-group-identifier",
>>>>>>      sg: "some-subgroup-identifier",
>>>>>>      j      : "some-job-identifier",
>>>>>>      page     : 23,
>>>>>>      ... // other fields omitted
>>>>>>      important-data : [
>>>>>>          {
>>>>>>            f1  : "abc",
>>>>>>            f2  : "a",
>>>>>>            f3  : "/"
>>>>>>            ...
>>>>>>          },
>>>>>>          ...
>>>>>>          {
>>>>>>            f1 : "xyz",
>>>>>>            f2  : "q",
>>>>>>            f3  : "/",
>>>>>>            ...
>>>>>>          },
>>>>>>      ],
>>>>>>     ... // other fields omitted
>>>>>>      other-important-data : [
>>>>>>         {
>>>>>>            x1  : "ford",
>>>>>>            x2  : "green",
>>>>>>            x3  : 35
>>>>>>            map : {
>>>>>>                "free-field" : "value",
>>>>>>                "other-free-field" : value2"
>>>>>>               }
>>>>>>          },
>>>>>>          ...
>>>>>>          {
>>>>>>            x1 : "vw",
>>>>>>            x2  : "red",
>>>>>>            x3  : 54,
>>>>>>            ...
>>>>>>          },
>>>>>>      ]
>>>>>>    },
>>>>>> }
>>>>>>
>>>>>>
>>>>>> Each file contains a single JSON document (gzip compressed, and
>>>>>> roughly about 200KB uncompressed of pretty-printed json text per document)
>>>>>>
>>>>>> I am interested in analyzing only the  "important-data" array and the
>>>>>> "other-important-data" array.
>>>>>> My source data would ideally be easier to analyze if it looked like a
>>>>>> couple of tables with a fixed set of columns. Only the column "map" would
>>>>>> be a complex column, all others would be primitives.
>>>>>>
>>>>>> ( g, sg, j, page, f1, f2, f3 )
>>>>>>
>>>>>> ( g, sg, j, page, x1, x2, x3, map )
>>>>>>
>>>>>> So, for each JSON document, I would like to "create" several rows,
>>>>>> but I would like to avoid the intermediate step of persisting -and
>>>>>> duplicating- the "flattened" data.
>>>>>>
>>>>>> In order to avoid persisting the data flattened, I thought I had to
>>>>>> write my own map-reduce in Java code, but discovered that others have had
>>>>>> the same problem of using JSON as the source and there are somewhat
>>>>>> "standard" solutions.
>>>>>>
>>>>>> By reading about the SerDe approach for Hive I get the impression
>>>>>> that each JSON document is transformed into a single "row" of the table
>>>>>> with some columns being an array, a map of other nested structures.
>>>>>> a) Is there a way to break each JSON document into several "rows" for
>>>>>> a Hive external table?
>>>>>> b) It seems there are too many JSON SerDe libraries! Is there any of
>>>>>> them considered the de-facto standard?
>>>>>>
>>>>>> The Pig approach seems also promising using Elephant Bird Do anybody
>>>>>> has pointers to more user documentation on this project? Or is browsing
>>>>>> through the examples in GitHub my only source?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Aggregating data nested into JSON documents

Posted by Tecno Brain <ce...@gmail.com>.

Never mind, I got the solution!

uberflat = FOREACH flat GENERATE g, sg,
              FLATTEN(important-data#'f1') as f1,
              FLATTEN(important-data#'f2') as f2;

-Jorge


On Thu, Jun 20, 2013 at 11:54 AM, Tecno Brain
<ce...@gmail.com>wrote:

> OK, I'll go back to my original question ( although this time I know what
> tools I'm using).
>
> I am using Pig + ElephantBird.
>
> I have JSON documents with the following structure:
> {
>      g : "some-group-identifier",
>      sg: "some-subgroup-identifier",
>      j      : "some-job-identifier",
>      page     : 23,
>      ... // other fields omitted
>      important-data : [
>          {
>            f1  : "abc",
>            f2  : "a",
>            f3  : "/"
>            ...
>          },
>          ...
>          {
>            f1 : "xyz",
>            f2  : "q",
>            f3  : "/",
>            ...
>          },
>      ]
>     ... // other fields omitted
> }
>
> I want Pig to GENERATE a tuple for each element on the "important-data"
> array attribute. For the example above, I would like to generate:
>
> ( "some-group-identifier" , "some-subgroup-identifier", 23, "abc", "a",
> "/" )
> ( "some-group-identifier" , "some-subgroup-identifier", 23, "xyz", "q",
> "/" )
>
> This is what I have tried:
>
> doc = LOAD '/example.json' USING
>      com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as
> (json:map[]);
> flat = FOREACH doc  GENERATE  (chararray)json#'gr' as g, (long)json#'sg'
> as sg,  FLATTEN( json#'important-data') ;
> DUMP flat;
>
> but that produces:
>
> ( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#abc,
> f2#a, f3#/ ] )
> ( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#xyz,
> f2#q, f3#/ ] )
>
> Close, but not exactly what I want.
>
> Do I require to use ProtoBuf ?
>
> -Jorge
>
>
> On Wed, Jun 19, 2013 at 3:44 PM, Tecno Brain <cerebrotecnologico@gmail.com
> > wrote:
>
>> Ok, I found that elephant-bird JsonLoader cannot handle JSON documents
>> that are pretty-printed. (expanding over multiple-lines) The entire json
>> document has to be on a single line.
>>
>> After I reformated some of the source files, now I am getting the
>> expected output.
>>
>>
>>
>>
>> On Wed, Jun 19, 2013 at 2:47 PM, Tecno Brain <
>> cerebrotecnologico@gmail.com> wrote:
>>
>>> I also tried:
>>>
>>> doc = LOAD '/json-pcr/pcr-000001.json' USING
>>>  com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]);
>>>  flat = FOREACH doc  GENERATE  (chararray)json#'a' AS first,
>>> (long)json#'b' AS second ;
>>> DUMP flat;
>>>
>>> but I got no output either.
>>>
>>>      Input(s):
>>>      Successfully read 0 records (35863 bytes) from:
>>> "/json-pcr/pcr-000001.json"
>>>
>>>      Output(s):
>>>      Successfully stored 0 records in:
>>> "hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210"
>>>
>>>
>>>
>>> On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain <
>>> cerebrotecnologico@gmail.com> wrote:
>>>
>>>> I got Pig and Hive working ona single-node and I am able to run some
>>>> script/queries over regular text files (access log files); with a record
>>>> per line.
>>>>
>>>> Now, I want to process some JSON files.
>>>>
>>>> As mentioned before, it seems  that ElephantBird would be a would be a
>>>> good solution to read JSON files.
>>>>
>>>> I uploaded 5 files to HDFS. Each file only contain a single JSON
>>>> document. The documents are NOT in a single line, but rather contain
>>>> pretty-printed JSON expanding over multiple lines.
>>>>
>>>> I'm trying something simple, extracting two (primitive) attributes at
>>>> the top of the document:
>>>> {
>>>>    a : "some value",
>>>>    ...
>>>>    b : 133,
>>>>    ...
>>>> }
>>>>
>>>> So, lets start with a LOAD of a single file (single JSON document):
>>>>
>>>> REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
>>>> doc = LOAD '/json-pcr/pcr-000001.json' using
>>>>  com.twitter.elephantbird.pig.load.JsonLoader();
>>>> flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b'
>>>> AS second ;
>>>> DUMP flat;
>>>>
>>>> Apparently the job runs without problem, but I get no output. The
>>>> output I get includes this message:
>>>>
>>>>    Input(s):
>>>>    Successfully read 0 records (35863 bytes) from:
>>>> "/json-pcr/pcr-000001.json"
>>>>
>>>> I was expecting to get
>>>>
>>>> ( "some value", 133 )
>>>>
>>>> Any idea on what I am doing wrong?
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <
>>>> michael_segel@hotmail.com> wrote:
>>>>
>>>>> I think you have a misconception of HBase.
>>>>>
>>>>> You don't need to actually have mutable data for it to be effective.
>>>>> The key is that you need to have access to specific records and work a
>>>>> very small subset of the data and not the complete data set.
>>>>>
>>>>>
>>>>> On Jun 13, 2013, at 11:59 AM, Tecno Brain <
>>>>> cerebrotecnologico@gmail.com> wrote:
>>>>>
>>>>> Hi Mike,
>>>>>
>>>>> Yes, I also have thought about HBase or Cassandra but my data is
>>>>> pretty much a snapshot, it does not require updates. Most of my
>>>>> aggregations will also need to be computed once and won't change over time
>>>>> with the exception of some aggregation that is based on the last N days of
>>>>> data.  Should I still consider HBase ? I think that probably it will be
>>>>> good for the aggregated data.
>>>>>
>>>>> I have no idea what are sequence files, but I will take a look.  My
>>>>> raw data is stored in the cloud, not in my Hadoop cluster.
>>>>>
>>>>> I'll keep looking at Pig with ElephantBird.
>>>>> Thanks,
>>>>>
>>>>> -Jorge
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <
>>>>> michael_segel@hotmail.com> wrote:
>>>>>
>>>>>> Hi..
>>>>>>
>>>>>> Have you thought about HBase?
>>>>>>
>>>>>> I would suggest that if you're using Hive or Pig, to look at taking
>>>>>> these files and putting the JSON records in to a sequence file.
>>>>>> Or set of sequence files.... (Then look at HBase to help index
>>>>>> them...) 200KB is small.
>>>>>>
>>>>>> That would be the same for either pig/hive.
>>>>>>
>>>>>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty
>>>>>> nice. And yes you get each record as a row, however you can always flatten
>>>>>> them as needed.
>>>>>>
>>>>>> Hive?
>>>>>> I haven't worked with the latest SerDe, but maybe Dean Wampler or
>>>>>> Edward Capriolo could give you a better answer.
>>>>>> Going from memory, I don't know that there is a good SerDe that would
>>>>>> write JSON, just read it. (Hive)
>>>>>>
>>>>>> IMHO Pig/ElephantBird is the best so far, but then again I may be
>>>>>> dated and biased.
>>>>>>
>>>>>> I think you're on the right track or at least train of thought.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> -Mike
>>>>>>
>>>>>>
>>>>>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <
>>>>>> cerebrotecnologico@gmail.com> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>    I'm new to Hadoop.
>>>>>>    I have a large quantity of JSON documents with a structure similar
>>>>>> to what is shown below.
>>>>>>
>>>>>>    {
>>>>>>      g : "some-group-identifier",
>>>>>>      sg: "some-subgroup-identifier",
>>>>>>      j      : "some-job-identifier",
>>>>>>      page     : 23,
>>>>>>      ... // other fields omitted
>>>>>>      important-data : [
>>>>>>          {
>>>>>>            f1  : "abc",
>>>>>>            f2  : "a",
>>>>>>            f3  : "/"
>>>>>>            ...
>>>>>>          },
>>>>>>          ...
>>>>>>          {
>>>>>>            f1 : "xyz",
>>>>>>            f2  : "q",
>>>>>>            f3  : "/",
>>>>>>            ...
>>>>>>          },
>>>>>>      ],
>>>>>>     ... // other fields omitted
>>>>>>      other-important-data : [
>>>>>>         {
>>>>>>            x1  : "ford",
>>>>>>            x2  : "green",
>>>>>>            x3  : 35
>>>>>>            map : {
>>>>>>                "free-field" : "value",
>>>>>>                "other-free-field" : value2"
>>>>>>               }
>>>>>>          },
>>>>>>          ...
>>>>>>          {
>>>>>>            x1 : "vw",
>>>>>>            x2  : "red",
>>>>>>            x3  : 54,
>>>>>>            ...
>>>>>>          },
>>>>>>      ]
>>>>>>    },
>>>>>> }
>>>>>>
>>>>>>
>>>>>> Each file contains a single JSON document (gzip compressed, and
>>>>>> roughly about 200KB uncompressed of pretty-printed json text per document)
>>>>>>
>>>>>> I am interested in analyzing only the  "important-data" array and the
>>>>>> "other-important-data" array.
>>>>>> My source data would ideally be easier to analyze if it looked like a
>>>>>> couple of tables with a fixed set of columns. Only the column "map" would
>>>>>> be a complex column, all others would be primitives.
>>>>>>
>>>>>> ( g, sg, j, page, f1, f2, f3 )
>>>>>>
>>>>>> ( g, sg, j, page, x1, x2, x3, map )
>>>>>>
>>>>>> So, for each JSON document, I would like to "create" several rows,
>>>>>> but I would like to avoid the intermediate step of persisting -and
>>>>>> duplicating- the "flattened" data.
>>>>>>
>>>>>> In order to avoid persisting the data flattened, I thought I had to
>>>>>> write my own map-reduce in Java code, but discovered that others have had
>>>>>> the same problem of using JSON as the source and there are somewhat
>>>>>> "standard" solutions.
>>>>>>
>>>>>> By reading about the SerDe approach for Hive I get the impression
>>>>>> that each JSON document is transformed into a single "row" of the table
>>>>>> with some columns being an array, a map of other nested structures.
>>>>>> a) Is there a way to break each JSON document into several "rows" for
>>>>>> a Hive external table?
>>>>>> b) It seems there are too many JSON SerDe libraries! Is there any of
>>>>>> them considered the de-facto standard?
>>>>>>
>>>>>> The Pig approach seems also promising using Elephant Bird Do anybody
>>>>>> has pointers to more user documentation on this project? Or is browsing
>>>>>> through the examples in GitHub my only source?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Aggregating data nested into JSON documents

Posted by Tecno Brain <ce...@gmail.com>.

Never mind, I got the solution!

uberflat = FOREACH flat GENERATE g, sg,
              FLATTEN(important-data#'f1') as f1,
              FLATTEN(important-data#'f2') as f2;

-Jorge


On Thu, Jun 20, 2013 at 11:54 AM, Tecno Brain
<ce...@gmail.com>wrote:

> OK, I'll go back to my original question ( although this time I know what
> tools I'm using).
>
> I am using Pig + ElephantBird.
>
> I have JSON documents with the following structure:
> {
>      g : "some-group-identifier",
>      sg: "some-subgroup-identifier",
>      j      : "some-job-identifier",
>      page     : 23,
>      ... // other fields omitted
>      important-data : [
>          {
>            f1  : "abc",
>            f2  : "a",
>            f3  : "/"
>            ...
>          },
>          ...
>          {
>            f1 : "xyz",
>            f2  : "q",
>            f3  : "/",
>            ...
>          },
>      ]
>     ... // other fields omitted
> }
>
> I want Pig to GENERATE a tuple for each element on the "important-data"
> array attribute. For the example above, I would like to generate:
>
> ( "some-group-identifier" , "some-subgroup-identifier", 23, "abc", "a",
> "/" )
> ( "some-group-identifier" , "some-subgroup-identifier", 23, "xyz", "q",
> "/" )
>
> This is what I have tried:
>
> doc = LOAD '/example.json' USING
>      com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as
> (json:map[]);
> flat = FOREACH doc  GENERATE  (chararray)json#'gr' as g, (long)json#'sg'
> as sg,  FLATTEN( json#'important-data') ;
> DUMP flat;
>
> but that produces:
>
> ( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#abc,
> f2#a, f3#/ ] )
> ( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#xyz,
> f2#q, f3#/ ] )
>
> Close, but not exactly what I want.
>
> Do I require to use ProtoBuf ?
>
> -Jorge
>
>
> On Wed, Jun 19, 2013 at 3:44 PM, Tecno Brain <cerebrotecnologico@gmail.com
> > wrote:
>
>> Ok, I found that elephant-bird JsonLoader cannot handle JSON documents
>> that are pretty-printed. (expanding over multiple-lines) The entire json
>> document has to be on a single line.
>>
>> After I reformated some of the source files, now I am getting the
>> expected output.
>>
>>
>>
>>
>> On Wed, Jun 19, 2013 at 2:47 PM, Tecno Brain <
>> cerebrotecnologico@gmail.com> wrote:
>>
>>> I also tried:
>>>
>>> doc = LOAD '/json-pcr/pcr-000001.json' USING
>>>  com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]);
>>>  flat = FOREACH doc  GENERATE  (chararray)json#'a' AS first,
>>> (long)json#'b' AS second ;
>>> DUMP flat;
>>>
>>> but I got no output either.
>>>
>>>      Input(s):
>>>      Successfully read 0 records (35863 bytes) from:
>>> "/json-pcr/pcr-000001.json"
>>>
>>>      Output(s):
>>>      Successfully stored 0 records in:
>>> "hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210"
>>>
>>>
>>>
>>> On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain <
>>> cerebrotecnologico@gmail.com> wrote:
>>>
>>>> I got Pig and Hive working ona single-node and I am able to run some
>>>> script/queries over regular text files (access log files); with a record
>>>> per line.
>>>>
>>>> Now, I want to process some JSON files.
>>>>
>>>> As mentioned before, it seems  that ElephantBird would be a would be a
>>>> good solution to read JSON files.
>>>>
>>>> I uploaded 5 files to HDFS. Each file only contain a single JSON
>>>> document. The documents are NOT in a single line, but rather contain
>>>> pretty-printed JSON expanding over multiple lines.
>>>>
>>>> I'm trying something simple, extracting two (primitive) attributes at
>>>> the top of the document:
>>>> {
>>>>    a : "some value",
>>>>    ...
>>>>    b : 133,
>>>>    ...
>>>> }
>>>>
>>>> So, lets start with a LOAD of a single file (single JSON document):
>>>>
>>>> REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
>>>> doc = LOAD '/json-pcr/pcr-000001.json' using
>>>>  com.twitter.elephantbird.pig.load.JsonLoader();
>>>> flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b'
>>>> AS second ;
>>>> DUMP flat;
>>>>
>>>> Apparently the job runs without problem, but I get no output. The
>>>> output I get includes this message:
>>>>
>>>>    Input(s):
>>>>    Successfully read 0 records (35863 bytes) from:
>>>> "/json-pcr/pcr-000001.json"
>>>>
>>>> I was expecting to get
>>>>
>>>> ( "some value", 133 )
>>>>
>>>> Any idea on what I am doing wrong?
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <
>>>> michael_segel@hotmail.com> wrote:
>>>>
>>>>> I think you have a misconception of HBase.
>>>>>
>>>>> You don't need to actually have mutable data for it to be effective.
>>>>> The key is that you need to have access to specific records and work a
>>>>> very small subset of the data and not the complete data set.
>>>>>
>>>>>
>>>>> On Jun 13, 2013, at 11:59 AM, Tecno Brain <
>>>>> cerebrotecnologico@gmail.com> wrote:
>>>>>
>>>>> Hi Mike,
>>>>>
>>>>> Yes, I also have thought about HBase or Cassandra but my data is
>>>>> pretty much a snapshot, it does not require updates. Most of my
>>>>> aggregations will also need to be computed once and won't change over time
>>>>> with the exception of some aggregation that is based on the last N days of
>>>>> data.  Should I still consider HBase ? I think that probably it will be
>>>>> good for the aggregated data.
>>>>>
>>>>> I have no idea what are sequence files, but I will take a look.  My
>>>>> raw data is stored in the cloud, not in my Hadoop cluster.
>>>>>
>>>>> I'll keep looking at Pig with ElephantBird.
>>>>> Thanks,
>>>>>
>>>>> -Jorge
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <
>>>>> michael_segel@hotmail.com> wrote:
>>>>>
>>>>>> Hi..
>>>>>>
>>>>>> Have you thought about HBase?
>>>>>>
>>>>>> I would suggest that if you're using Hive or Pig, to look at taking
>>>>>> these files and putting the JSON records in to a sequence file.
>>>>>> Or set of sequence files.... (Then look at HBase to help index
>>>>>> them...) 200KB is small.
>>>>>>
>>>>>> That would be the same for either pig/hive.
>>>>>>
>>>>>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty
>>>>>> nice. And yes you get each record as a row, however you can always flatten
>>>>>> them as needed.
>>>>>>
>>>>>> Hive?
>>>>>> I haven't worked with the latest SerDe, but maybe Dean Wampler or
>>>>>> Edward Capriolo could give you a better answer.
>>>>>> Going from memory, I don't know that there is a good SerDe that would
>>>>>> write JSON, just read it. (Hive)
>>>>>>
>>>>>> IMHO Pig/ElephantBird is the best so far, but then again I may be
>>>>>> dated and biased.
>>>>>>
>>>>>> I think you're on the right track or at least train of thought.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> -Mike
>>>>>>
>>>>>>
>>>>>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <
>>>>>> cerebrotecnologico@gmail.com> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>    I'm new to Hadoop.
>>>>>>    I have a large quantity of JSON documents with a structure similar
>>>>>> to what is shown below.
>>>>>>
>>>>>>    {
>>>>>>      g : "some-group-identifier",
>>>>>>      sg: "some-subgroup-identifier",
>>>>>>      j      : "some-job-identifier",
>>>>>>      page     : 23,
>>>>>>      ... // other fields omitted
>>>>>>      important-data : [
>>>>>>          {
>>>>>>            f1  : "abc",
>>>>>>            f2  : "a",
>>>>>>            f3  : "/"
>>>>>>            ...
>>>>>>          },
>>>>>>          ...
>>>>>>          {
>>>>>>            f1 : "xyz",
>>>>>>            f2  : "q",
>>>>>>            f3  : "/",
>>>>>>            ...
>>>>>>          },
>>>>>>      ],
>>>>>>     ... // other fields omitted
>>>>>>      other-important-data : [
>>>>>>         {
>>>>>>            x1  : "ford",
>>>>>>            x2  : "green",
>>>>>>            x3  : 35
>>>>>>            map : {
>>>>>>                "free-field" : "value",
>>>>>>                "other-free-field" : value2"
>>>>>>               }
>>>>>>          },
>>>>>>          ...
>>>>>>          {
>>>>>>            x1 : "vw",
>>>>>>            x2  : "red",
>>>>>>            x3  : 54,
>>>>>>            ...
>>>>>>          },
>>>>>>      ]
>>>>>>    },
>>>>>> }
>>>>>>
>>>>>>
>>>>>> Each file contains a single JSON document (gzip compressed, and
>>>>>> roughly about 200KB uncompressed of pretty-printed json text per document)
>>>>>>
>>>>>> I am interested in analyzing only the  "important-data" array and the
>>>>>> "other-important-data" array.
>>>>>> My source data would ideally be easier to analyze if it looked like a
>>>>>> couple of tables with a fixed set of columns. Only the column "map" would
>>>>>> be a complex column, all others would be primitives.
>>>>>>
>>>>>> ( g, sg, j, page, f1, f2, f3 )
>>>>>>
>>>>>> ( g, sg, j, page, x1, x2, x3, map )
>>>>>>
>>>>>> So, for each JSON document, I would like to "create" several rows,
>>>>>> but I would like to avoid the intermediate step of persisting -and
>>>>>> duplicating- the "flattened" data.
>>>>>>
>>>>>> In order to avoid persisting the data flattened, I thought I had to
>>>>>> write my own map-reduce in Java code, but discovered that others have had
>>>>>> the same problem of using JSON as the source and there are somewhat
>>>>>> "standard" solutions.
>>>>>>
>>>>>> By reading about the SerDe approach for Hive I get the impression
>>>>>> that each JSON document is transformed into a single "row" of the table
>>>>>> with some columns being an array, a map of other nested structures.
>>>>>> a) Is there a way to break each JSON document into several "rows" for
>>>>>> a Hive external table?
>>>>>> b) It seems there are too many JSON SerDe libraries! Is there any of
>>>>>> them considered the de-facto standard?
>>>>>>
>>>>>> The Pig approach seems also promising using Elephant Bird Do anybody
>>>>>> has pointers to more user documentation on this project? Or is browsing
>>>>>> through the examples in GitHub my only source?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Aggregating data nested into JSON documents

Posted by Tecno Brain <ce...@gmail.com>.

Never mind, I got the solution!

uberflat = FOREACH flat GENERATE g, sg,
              FLATTEN(important-data#'f1') as f1,
              FLATTEN(important-data#'f2') as f2;

-Jorge


On Thu, Jun 20, 2013 at 11:54 AM, Tecno Brain
<ce...@gmail.com>wrote:

> OK, I'll go back to my original question ( although this time I know what
> tools I'm using).
>
> I am using Pig + ElephantBird.
>
> I have JSON documents with the following structure:
> {
>      g : "some-group-identifier",
>      sg: "some-subgroup-identifier",
>      j      : "some-job-identifier",
>      page     : 23,
>      ... // other fields omitted
>      important-data : [
>          {
>            f1  : "abc",
>            f2  : "a",
>            f3  : "/"
>            ...
>          },
>          ...
>          {
>            f1 : "xyz",
>            f2  : "q",
>            f3  : "/",
>            ...
>          },
>      ]
>     ... // other fields omitted
> }
>
> I want Pig to GENERATE a tuple for each element on the "important-data"
> array attribute. For the example above, I would like to generate:
>
> ( "some-group-identifier" , "some-subgroup-identifier", 23, "abc", "a",
> "/" )
> ( "some-group-identifier" , "some-subgroup-identifier", 23, "xyz", "q",
> "/" )
>
> This is what I have tried:
>
> doc = LOAD '/example.json' USING
>      com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as
> (json:map[]);
> flat = FOREACH doc  GENERATE  (chararray)json#'gr' as g, (long)json#'sg'
> as sg,  FLATTEN( json#'important-data') ;
> DUMP flat;
>
> but that produces:
>
> ( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#abc,
> f2#a, f3#/ ] )
> ( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#xyz,
> f2#q, f3#/ ] )
>
> Close, but not exactly what I want.
>
> Do I require to use ProtoBuf ?
>
> -Jorge
>
>
> On Wed, Jun 19, 2013 at 3:44 PM, Tecno Brain <cerebrotecnologico@gmail.com
> > wrote:
>
>> Ok, I found that elephant-bird JsonLoader cannot handle JSON documents
>> that are pretty-printed. (expanding over multiple-lines) The entire json
>> document has to be on a single line.
>>
>> After I reformated some of the source files, now I am getting the
>> expected output.
>>
>>
>>
>>
>> On Wed, Jun 19, 2013 at 2:47 PM, Tecno Brain <
>> cerebrotecnologico@gmail.com> wrote:
>>
>>> I also tried:
>>>
>>> doc = LOAD '/json-pcr/pcr-000001.json' USING
>>>  com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]);
>>>  flat = FOREACH doc  GENERATE  (chararray)json#'a' AS first,
>>> (long)json#'b' AS second ;
>>> DUMP flat;
>>>
>>> but I got no output either.
>>>
>>>      Input(s):
>>>      Successfully read 0 records (35863 bytes) from:
>>> "/json-pcr/pcr-000001.json"
>>>
>>>      Output(s):
>>>      Successfully stored 0 records in:
>>> "hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210"
>>>
>>>
>>>
>>> On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain <
>>> cerebrotecnologico@gmail.com> wrote:
>>>
>>>> I got Pig and Hive working ona single-node and I am able to run some
>>>> script/queries over regular text files (access log files); with a record
>>>> per line.
>>>>
>>>> Now, I want to process some JSON files.
>>>>
>>>> As mentioned before, it seems  that ElephantBird would be a would be a
>>>> good solution to read JSON files.
>>>>
>>>> I uploaded 5 files to HDFS. Each file only contain a single JSON
>>>> document. The documents are NOT in a single line, but rather contain
>>>> pretty-printed JSON expanding over multiple lines.
>>>>
>>>> I'm trying something simple, extracting two (primitive) attributes at
>>>> the top of the document:
>>>> {
>>>>    a : "some value",
>>>>    ...
>>>>    b : 133,
>>>>    ...
>>>> }
>>>>
>>>> So, lets start with a LOAD of a single file (single JSON document):
>>>>
>>>> REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
>>>> doc = LOAD '/json-pcr/pcr-000001.json' using
>>>>  com.twitter.elephantbird.pig.load.JsonLoader();
>>>> flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b'
>>>> AS second ;
>>>> DUMP flat;
>>>>
>>>> Apparently the job runs without problem, but I get no output. The
>>>> output I get includes this message:
>>>>
>>>>    Input(s):
>>>>    Successfully read 0 records (35863 bytes) from:
>>>> "/json-pcr/pcr-000001.json"
>>>>
>>>> I was expecting to get
>>>>
>>>> ( "some value", 133 )
>>>>
>>>> Any idea on what I am doing wrong?
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <
>>>> michael_segel@hotmail.com> wrote:
>>>>
>>>>> I think you have a misconception of HBase.
>>>>>
>>>>> You don't need to actually have mutable data for it to be effective.
>>>>> The key is that you need to have access to specific records and work a
>>>>> very small subset of the data and not the complete data set.
>>>>>
>>>>>
>>>>> On Jun 13, 2013, at 11:59 AM, Tecno Brain <
>>>>> cerebrotecnologico@gmail.com> wrote:
>>>>>
>>>>> Hi Mike,
>>>>>
>>>>> Yes, I also have thought about HBase or Cassandra but my data is
>>>>> pretty much a snapshot, it does not require updates. Most of my
>>>>> aggregations will also need to be computed once and won't change over time
>>>>> with the exception of some aggregation that is based on the last N days of
>>>>> data.  Should I still consider HBase ? I think that probably it will be
>>>>> good for the aggregated data.
>>>>>
>>>>> I have no idea what are sequence files, but I will take a look.  My
>>>>> raw data is stored in the cloud, not in my Hadoop cluster.
>>>>>
>>>>> I'll keep looking at Pig with ElephantBird.
>>>>> Thanks,
>>>>>
>>>>> -Jorge
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <
>>>>> michael_segel@hotmail.com> wrote:
>>>>>
>>>>>> Hi..
>>>>>>
>>>>>> Have you thought about HBase?
>>>>>>
>>>>>> I would suggest that if you're using Hive or Pig, to look at taking
>>>>>> these files and putting the JSON records in to a sequence file.
>>>>>> Or set of sequence files.... (Then look at HBase to help index
>>>>>> them...) 200KB is small.
>>>>>>
>>>>>> That would be the same for either pig/hive.
>>>>>>
>>>>>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty
>>>>>> nice. And yes you get each record as a row, however you can always flatten
>>>>>> them as needed.
>>>>>>
>>>>>> Hive?
>>>>>> I haven't worked with the latest SerDe, but maybe Dean Wampler or
>>>>>> Edward Capriolo could give you a better answer.
>>>>>> Going from memory, I don't know that there is a good SerDe that would
>>>>>> write JSON, just read it. (Hive)
>>>>>>
>>>>>> IMHO Pig/ElephantBird is the best so far, but then again I may be
>>>>>> dated and biased.
>>>>>>
>>>>>> I think you're on the right track or at least train of thought.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> -Mike
>>>>>>
>>>>>>
>>>>>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <
>>>>>> cerebrotecnologico@gmail.com> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>    I'm new to Hadoop.
>>>>>>    I have a large quantity of JSON documents with a structure similar
>>>>>> to what is shown below.
>>>>>>
>>>>>>    {
>>>>>>      g : "some-group-identifier",
>>>>>>      sg: "some-subgroup-identifier",
>>>>>>      j      : "some-job-identifier",
>>>>>>      page     : 23,
>>>>>>      ... // other fields omitted
>>>>>>      important-data : [
>>>>>>          {
>>>>>>            f1  : "abc",
>>>>>>            f2  : "a",
>>>>>>            f3  : "/"
>>>>>>            ...
>>>>>>          },
>>>>>>          ...
>>>>>>          {
>>>>>>            f1 : "xyz",
>>>>>>            f2  : "q",
>>>>>>            f3  : "/",
>>>>>>            ...
>>>>>>          },
>>>>>>      ],
>>>>>>     ... // other fields omitted
>>>>>>      other-important-data : [
>>>>>>         {
>>>>>>            x1  : "ford",
>>>>>>            x2  : "green",
>>>>>>            x3  : 35
>>>>>>            map : {
>>>>>>                "free-field" : "value",
>>>>>>                "other-free-field" : value2"
>>>>>>               }
>>>>>>          },
>>>>>>          ...
>>>>>>          {
>>>>>>            x1 : "vw",
>>>>>>            x2  : "red",
>>>>>>            x3  : 54,
>>>>>>            ...
>>>>>>          },
>>>>>>      ]
>>>>>>    },
>>>>>> }
>>>>>>
>>>>>>
>>>>>> Each file contains a single JSON document (gzip compressed, and
>>>>>> roughly about 200KB uncompressed of pretty-printed json text per document)
>>>>>>
>>>>>> I am interested in analyzing only the  "important-data" array and the
>>>>>> "other-important-data" array.
>>>>>> My source data would ideally be easier to analyze if it looked like a
>>>>>> couple of tables with a fixed set of columns. Only the column "map" would
>>>>>> be a complex column, all others would be primitives.
>>>>>>
>>>>>> ( g, sg, j, page, f1, f2, f3 )
>>>>>>
>>>>>> ( g, sg, j, page, x1, x2, x3, map )
>>>>>>
>>>>>> So, for each JSON document, I would like to "create" several rows,
>>>>>> but I would like to avoid the intermediate step of persisting -and
>>>>>> duplicating- the "flattened" data.
>>>>>>
>>>>>> In order to avoid persisting the data flattened, I thought I had to
>>>>>> write my own map-reduce in Java code, but discovered that others have had
>>>>>> the same problem of using JSON as the source and there are somewhat
>>>>>> "standard" solutions.
>>>>>>
>>>>>> By reading about the SerDe approach for Hive I get the impression
>>>>>> that each JSON document is transformed into a single "row" of the table
>>>>>> with some columns being an array, a map of other nested structures.
>>>>>> a) Is there a way to break each JSON document into several "rows" for
>>>>>> a Hive external table?
>>>>>> b) It seems there are too many JSON SerDe libraries! Is there any of
>>>>>> them considered the de-facto standard?
>>>>>>
>>>>>> The Pig approach seems also promising using Elephant Bird Do anybody
>>>>>> has pointers to more user documentation on this project? Or is browsing
>>>>>> through the examples in GitHub my only source?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Aggregating data nested into JSON documents

Posted by Tecno Brain <ce...@gmail.com>.

OK, I'll go back to my original question ( although this time I know what
tools I'm using).

I am using Pig + ElephantBird.

I have JSON documents with the following structure:
{
     g : "some-group-identifier",
     sg: "some-subgroup-identifier",
     j      : "some-job-identifier",
     page     : 23,
     ... // other fields omitted
     important-data : [
         {
           f1  : "abc",
           f2  : "a",
           f3  : "/"
           ...
         },
         ...
         {
           f1 : "xyz",
           f2  : "q",
           f3  : "/",
           ...
         },
     ]
    ... // other fields omitted
}

I want Pig to GENERATE a tuple for each element on the "important-data"
array attribute. For the example above, I would like to generate:

( "some-group-identifier" , "some-subgroup-identifier", 23, "abc", "a", "/"
)
( "some-group-identifier" , "some-subgroup-identifier", 23, "xyz", "q", "/"
)

This is what I have tried:

doc = LOAD '/example.json' USING
     com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as
(json:map[]);
flat = FOREACH doc  GENERATE  (chararray)json#'gr' as g, (long)json#'sg' as
sg,  FLATTEN( json#'important-data') ;
DUMP flat;

but that produces:

( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#abc, f2#a,
f3#/ ] )
( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#xyz, f2#q,
f3#/ ] )

Close, but not exactly what I want.

Do I require to use ProtoBuf ?

-Jorge


On Wed, Jun 19, 2013 at 3:44 PM, Tecno Brain
<ce...@gmail.com>wrote:

> Ok, I found that elephant-bird JsonLoader cannot handle JSON documents
> that are pretty-printed. (expanding over multiple-lines) The entire json
> document has to be on a single line.
>
> After I reformated some of the source files, now I am getting the expected
> output.
>
>
>
>
> On Wed, Jun 19, 2013 at 2:47 PM, Tecno Brain <cerebrotecnologico@gmail.com
> > wrote:
>
>> I also tried:
>>
>> doc = LOAD '/json-pcr/pcr-000001.json' USING
>>  com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]);
>>  flat = FOREACH doc  GENERATE  (chararray)json#'a' AS first,
>> (long)json#'b' AS second ;
>> DUMP flat;
>>
>> but I got no output either.
>>
>>      Input(s):
>>      Successfully read 0 records (35863 bytes) from:
>> "/json-pcr/pcr-000001.json"
>>
>>      Output(s):
>>      Successfully stored 0 records in:
>> "hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210"
>>
>>
>>
>> On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain <
>> cerebrotecnologico@gmail.com> wrote:
>>
>>> I got Pig and Hive working ona single-node and I am able to run some
>>> script/queries over regular text files (access log files); with a record
>>> per line.
>>>
>>> Now, I want to process some JSON files.
>>>
>>> As mentioned before, it seems  that ElephantBird would be a would be a
>>> good solution to read JSON files.
>>>
>>> I uploaded 5 files to HDFS. Each file only contain a single JSON
>>> document. The documents are NOT in a single line, but rather contain
>>> pretty-printed JSON expanding over multiple lines.
>>>
>>> I'm trying something simple, extracting two (primitive) attributes at
>>> the top of the document:
>>> {
>>>    a : "some value",
>>>    ...
>>>    b : 133,
>>>    ...
>>> }
>>>
>>> So, lets start with a LOAD of a single file (single JSON document):
>>>
>>> REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
>>> doc = LOAD '/json-pcr/pcr-000001.json' using
>>>  com.twitter.elephantbird.pig.load.JsonLoader();
>>> flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b'
>>> AS second ;
>>> DUMP flat;
>>>
>>> Apparently the job runs without problem, but I get no output. The output
>>> I get includes this message:
>>>
>>>    Input(s):
>>>    Successfully read 0 records (35863 bytes) from:
>>> "/json-pcr/pcr-000001.json"
>>>
>>> I was expecting to get
>>>
>>> ( "some value", 133 )
>>>
>>> Any idea on what I am doing wrong?
>>>
>>>
>>>
>>>
>>> On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>
>>>> I think you have a misconception of HBase.
>>>>
>>>> You don't need to actually have mutable data for it to be effective.
>>>> The key is that you need to have access to specific records and work a
>>>> very small subset of the data and not the complete data set.
>>>>
>>>>
>>>> On Jun 13, 2013, at 11:59 AM, Tecno Brain <ce...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Mike,
>>>>
>>>> Yes, I also have thought about HBase or Cassandra but my data is pretty
>>>> much a snapshot, it does not require updates. Most of my aggregations will
>>>> also need to be computed once and won't change over time with the exception
>>>> of some aggregation that is based on the last N days of data.  Should I
>>>> still consider HBase ? I think that probably it will be good for the
>>>> aggregated data.
>>>>
>>>> I have no idea what are sequence files, but I will take a look.  My raw
>>>> data is stored in the cloud, not in my Hadoop cluster.
>>>>
>>>> I'll keep looking at Pig with ElephantBird.
>>>> Thanks,
>>>>
>>>> -Jorge
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <
>>>> michael_segel@hotmail.com> wrote:
>>>>
>>>>> Hi..
>>>>>
>>>>> Have you thought about HBase?
>>>>>
>>>>> I would suggest that if you're using Hive or Pig, to look at taking
>>>>> these files and putting the JSON records in to a sequence file.
>>>>> Or set of sequence files.... (Then look at HBase to help index
>>>>> them...) 200KB is small.
>>>>>
>>>>> That would be the same for either pig/hive.
>>>>>
>>>>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty
>>>>> nice. And yes you get each record as a row, however you can always flatten
>>>>> them as needed.
>>>>>
>>>>> Hive?
>>>>> I haven't worked with the latest SerDe, but maybe Dean Wampler or
>>>>> Edward Capriolo could give you a better answer.
>>>>> Going from memory, I don't know that there is a good SerDe that would
>>>>> write JSON, just read it. (Hive)
>>>>>
>>>>> IMHO Pig/ElephantBird is the best so far, but then again I may be
>>>>> dated and biased.
>>>>>
>>>>> I think you're on the right track or at least train of thought.
>>>>>
>>>>> HTH
>>>>>
>>>>> -Mike
>>>>>
>>>>>
>>>>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hello,
>>>>>    I'm new to Hadoop.
>>>>>    I have a large quantity of JSON documents with a structure similar
>>>>> to what is shown below.
>>>>>
>>>>>    {
>>>>>      g : "some-group-identifier",
>>>>>      sg: "some-subgroup-identifier",
>>>>>      j      : "some-job-identifier",
>>>>>      page     : 23,
>>>>>      ... // other fields omitted
>>>>>      important-data : [
>>>>>          {
>>>>>            f1  : "abc",
>>>>>            f2  : "a",
>>>>>            f3  : "/"
>>>>>            ...
>>>>>          },
>>>>>          ...
>>>>>          {
>>>>>            f1 : "xyz",
>>>>>            f2  : "q",
>>>>>            f3  : "/",
>>>>>            ...
>>>>>          },
>>>>>      ],
>>>>>     ... // other fields omitted
>>>>>      other-important-data : [
>>>>>         {
>>>>>            x1  : "ford",
>>>>>            x2  : "green",
>>>>>            x3  : 35
>>>>>            map : {
>>>>>                "free-field" : "value",
>>>>>                "other-free-field" : value2"
>>>>>               }
>>>>>          },
>>>>>          ...
>>>>>          {
>>>>>            x1 : "vw",
>>>>>            x2  : "red",
>>>>>            x3  : 54,
>>>>>            ...
>>>>>          },
>>>>>      ]
>>>>>    },
>>>>> }
>>>>>
>>>>>
>>>>> Each file contains a single JSON document (gzip compressed, and
>>>>> roughly about 200KB uncompressed of pretty-printed json text per document)
>>>>>
>>>>> I am interested in analyzing only the  "important-data" array and the
>>>>> "other-important-data" array.
>>>>> My source data would ideally be easier to analyze if it looked like a
>>>>> couple of tables with a fixed set of columns. Only the column "map" would
>>>>> be a complex column, all others would be primitives.
>>>>>
>>>>> ( g, sg, j, page, f1, f2, f3 )
>>>>>
>>>>> ( g, sg, j, page, x1, x2, x3, map )
>>>>>
>>>>> So, for each JSON document, I would like to "create" several rows, but I
>>>>> would like to avoid the intermediate step of persisting -and duplicating-
>>>>> the "flattened" data.
>>>>>
>>>>> In order to avoid persisting the data flattened, I thought I had to
>>>>> write my own map-reduce in Java code, but discovered that others have had
>>>>> the same problem of using JSON as the source and there are somewhat
>>>>> "standard" solutions.
>>>>>
>>>>> By reading about the SerDe approach for Hive I get the impression
>>>>> that each JSON document is transformed into a single "row" of the table
>>>>> with some columns being an array, a map of other nested structures.
>>>>> a) Is there a way to break each JSON document into several "rows" for
>>>>> a Hive external table?
>>>>> b) It seems there are too many JSON SerDe libraries! Is there any of
>>>>> them considered the de-facto standard?
>>>>>
>>>>> The Pig approach seems also promising using Elephant Bird Do anybody
>>>>> has pointers to more user documentation on this project? Or is browsing
>>>>> through the examples in GitHub my only source?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Aggregating data nested into JSON documents

Posted by Tecno Brain <ce...@gmail.com>.

OK, I'll go back to my original question ( although this time I know what
tools I'm using).

I am using Pig + ElephantBird.

I have JSON documents with the following structure:
{
     g : "some-group-identifier",
     sg: "some-subgroup-identifier",
     j      : "some-job-identifier",
     page     : 23,
     ... // other fields omitted
     important-data : [
         {
           f1  : "abc",
           f2  : "a",
           f3  : "/"
           ...
         },
         ...
         {
           f1 : "xyz",
           f2  : "q",
           f3  : "/",
           ...
         },
     ]
    ... // other fields omitted
}

I want Pig to GENERATE a tuple for each element on the "important-data"
array attribute. For the example above, I would like to generate:

( "some-group-identifier" , "some-subgroup-identifier", 23, "abc", "a", "/"
)
( "some-group-identifier" , "some-subgroup-identifier", 23, "xyz", "q", "/"
)

This is what I have tried:

doc = LOAD '/example.json' USING
     com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as
(json:map[]);
flat = FOREACH doc  GENERATE  (chararray)json#'gr' as g, (long)json#'sg' as
sg,  FLATTEN( json#'important-data') ;
DUMP flat;

but that produces:

( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#abc, f2#a,
f3#/ ] )
( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#xyz, f2#q,
f3#/ ] )

Close, but not exactly what I want.

Do I require to use ProtoBuf ?

-Jorge


On Wed, Jun 19, 2013 at 3:44 PM, Tecno Brain
<ce...@gmail.com>wrote:

> Ok, I found that elephant-bird JsonLoader cannot handle JSON documents
> that are pretty-printed. (expanding over multiple-lines) The entire json
> document has to be on a single line.
>
> After I reformated some of the source files, now I am getting the expected
> output.
>
>
>
>
> On Wed, Jun 19, 2013 at 2:47 PM, Tecno Brain <cerebrotecnologico@gmail.com
> > wrote:
>
>> I also tried:
>>
>> doc = LOAD '/json-pcr/pcr-000001.json' USING
>>  com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]);
>>  flat = FOREACH doc  GENERATE  (chararray)json#'a' AS first,
>> (long)json#'b' AS second ;
>> DUMP flat;
>>
>> but I got no output either.
>>
>>      Input(s):
>>      Successfully read 0 records (35863 bytes) from:
>> "/json-pcr/pcr-000001.json"
>>
>>      Output(s):
>>      Successfully stored 0 records in:
>> "hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210"
>>
>>
>>
>> On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain <
>> cerebrotecnologico@gmail.com> wrote:
>>
>>> I got Pig and Hive working ona single-node and I am able to run some
>>> script/queries over regular text files (access log files); with a record
>>> per line.
>>>
>>> Now, I want to process some JSON files.
>>>
>>> As mentioned before, it seems  that ElephantBird would be a would be a
>>> good solution to read JSON files.
>>>
>>> I uploaded 5 files to HDFS. Each file only contain a single JSON
>>> document. The documents are NOT in a single line, but rather contain
>>> pretty-printed JSON expanding over multiple lines.
>>>
>>> I'm trying something simple, extracting two (primitive) attributes at
>>> the top of the document:
>>> {
>>>    a : "some value",
>>>    ...
>>>    b : 133,
>>>    ...
>>> }
>>>
>>> So, lets start with a LOAD of a single file (single JSON document):
>>>
>>> REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
>>> doc = LOAD '/json-pcr/pcr-000001.json' using
>>>  com.twitter.elephantbird.pig.load.JsonLoader();
>>> flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b'
>>> AS second ;
>>> DUMP flat;
>>>
>>> Apparently the job runs without problem, but I get no output. The output
>>> I get includes this message:
>>>
>>>    Input(s):
>>>    Successfully read 0 records (35863 bytes) from:
>>> "/json-pcr/pcr-000001.json"
>>>
>>> I was expecting to get
>>>
>>> ( "some value", 133 )
>>>
>>> Any idea on what I am doing wrong?
>>>
>>>
>>>
>>>
>>> On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>
>>>> I think you have a misconception of HBase.
>>>>
>>>> You don't need to actually have mutable data for it to be effective.
>>>> The key is that you need to have access to specific records and work a
>>>> very small subset of the data and not the complete data set.
>>>>
>>>>
>>>> On Jun 13, 2013, at 11:59 AM, Tecno Brain <ce...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Mike,
>>>>
>>>> Yes, I also have thought about HBase or Cassandra but my data is pretty
>>>> much a snapshot, it does not require updates. Most of my aggregations will
>>>> also need to be computed once and won't change over time with the exception
>>>> of some aggregation that is based on the last N days of data.  Should I
>>>> still consider HBase ? I think that probably it will be good for the
>>>> aggregated data.
>>>>
>>>> I have no idea what are sequence files, but I will take a look.  My raw
>>>> data is stored in the cloud, not in my Hadoop cluster.
>>>>
>>>> I'll keep looking at Pig with ElephantBird.
>>>> Thanks,
>>>>
>>>> -Jorge
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <
>>>> michael_segel@hotmail.com> wrote:
>>>>
>>>>> Hi..
>>>>>
>>>>> Have you thought about HBase?
>>>>>
>>>>> I would suggest that if you're using Hive or Pig, to look at taking
>>>>> these files and putting the JSON records in to a sequence file.
>>>>> Or set of sequence files.... (Then look at HBase to help index
>>>>> them...) 200KB is small.
>>>>>
>>>>> That would be the same for either pig/hive.
>>>>>
>>>>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty
>>>>> nice. And yes you get each record as a row, however you can always flatten
>>>>> them as needed.
>>>>>
>>>>> Hive?
>>>>> I haven't worked with the latest SerDe, but maybe Dean Wampler or
>>>>> Edward Capriolo could give you a better answer.
>>>>> Going from memory, I don't know that there is a good SerDe that would
>>>>> write JSON, just read it. (Hive)
>>>>>
>>>>> IMHO Pig/ElephantBird is the best so far, but then again I may be
>>>>> dated and biased.
>>>>>
>>>>> I think you're on the right track or at least train of thought.
>>>>>
>>>>> HTH
>>>>>
>>>>> -Mike
>>>>>
>>>>>
>>>>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hello,
>>>>>    I'm new to Hadoop.
>>>>>    I have a large quantity of JSON documents with a structure similar
>>>>> to what is shown below.
>>>>>
>>>>>    {
>>>>>      g : "some-group-identifier",
>>>>>      sg: "some-subgroup-identifier",
>>>>>      j      : "some-job-identifier",
>>>>>      page     : 23,
>>>>>      ... // other fields omitted
>>>>>      important-data : [
>>>>>          {
>>>>>            f1  : "abc",
>>>>>            f2  : "a",
>>>>>            f3  : "/"
>>>>>            ...
>>>>>          },
>>>>>          ...
>>>>>          {
>>>>>            f1 : "xyz",
>>>>>            f2  : "q",
>>>>>            f3  : "/",
>>>>>            ...
>>>>>          },
>>>>>      ],
>>>>>     ... // other fields omitted
>>>>>      other-important-data : [
>>>>>         {
>>>>>            x1  : "ford",
>>>>>            x2  : "green",
>>>>>            x3  : 35
>>>>>            map : {
>>>>>                "free-field" : "value",
>>>>>                "other-free-field" : value2"
>>>>>               }
>>>>>          },
>>>>>          ...
>>>>>          {
>>>>>            x1 : "vw",
>>>>>            x2  : "red",
>>>>>            x3  : 54,
>>>>>            ...
>>>>>          },
>>>>>      ]
>>>>>    },
>>>>> }
>>>>>
>>>>>
>>>>> Each file contains a single JSON document (gzip compressed, and
>>>>> roughly about 200KB uncompressed of pretty-printed json text per document)
>>>>>
>>>>> I am interested in analyzing only the  "important-data" array and the
>>>>> "other-important-data" array.
>>>>> My source data would ideally be easier to analyze if it looked like a
>>>>> couple of tables with a fixed set of columns. Only the column "map" would
>>>>> be a complex column, all others would be primitives.
>>>>>
>>>>> ( g, sg, j, page, f1, f2, f3 )
>>>>>
>>>>> ( g, sg, j, page, x1, x2, x3, map )
>>>>>
>>>>> So, for each JSON document, I would like to "create" several rows, but I
>>>>> would like to avoid the intermediate step of persisting -and duplicating-
>>>>> the "flattened" data.
>>>>>
>>>>> In order to avoid persisting the data flattened, I thought I had to
>>>>> write my own map-reduce in Java code, but discovered that others have had
>>>>> the same problem of using JSON as the source and there are somewhat
>>>>> "standard" solutions.
>>>>>
>>>>> By reading about the SerDe approach for Hive I get the impression
>>>>> that each JSON document is transformed into a single "row" of the table
>>>>> with some columns being an array, a map of other nested structures.
>>>>> a) Is there a way to break each JSON document into several "rows" for
>>>>> a Hive external table?
>>>>> b) It seems there are too many JSON SerDe libraries! Is there any of
>>>>> them considered the de-facto standard?
>>>>>
>>>>> The Pig approach seems also promising using Elephant Bird Do anybody
>>>>> has pointers to more user documentation on this project? Or is browsing
>>>>> through the examples in GitHub my only source?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Aggregating data nested into JSON documents

Posted by Tecno Brain <ce...@gmail.com>.

OK, I'll go back to my original question ( although this time I know what
tools I'm using).

I am using Pig + ElephantBird.

I have JSON documents with the following structure:
{
     g : "some-group-identifier",
     sg: "some-subgroup-identifier",
     j      : "some-job-identifier",
     page     : 23,
     ... // other fields omitted
     important-data : [
         {
           f1  : "abc",
           f2  : "a",
           f3  : "/"
           ...
         },
         ...
         {
           f1 : "xyz",
           f2  : "q",
           f3  : "/",
           ...
         },
     ]
    ... // other fields omitted
}

I want Pig to GENERATE a tuple for each element on the "important-data"
array attribute. For the example above, I would like to generate:

( "some-group-identifier" , "some-subgroup-identifier", 23, "abc", "a", "/"
)
( "some-group-identifier" , "some-subgroup-identifier", 23, "xyz", "q", "/"
)

This is what I have tried:

doc = LOAD '/example.json' USING
     com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as
(json:map[]);
flat = FOREACH doc  GENERATE  (chararray)json#'gr' as g, (long)json#'sg' as
sg,  FLATTEN( json#'important-data') ;
DUMP flat;

but that produces:

( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#abc, f2#a,
f3#/ ] )
( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#xyz, f2#q,
f3#/ ] )

Close, but not exactly what I want.

Do I require to use ProtoBuf ?

-Jorge


On Wed, Jun 19, 2013 at 3:44 PM, Tecno Brain
<ce...@gmail.com>wrote:

> Ok, I found that elephant-bird JsonLoader cannot handle JSON documents
> that are pretty-printed. (expanding over multiple-lines) The entire json
> document has to be on a single line.
>
> After I reformated some of the source files, now I am getting the expected
> output.
>
>
>
>
> On Wed, Jun 19, 2013 at 2:47 PM, Tecno Brain <cerebrotecnologico@gmail.com
> > wrote:
>
>> I also tried:
>>
>> doc = LOAD '/json-pcr/pcr-000001.json' USING
>>  com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]);
>>  flat = FOREACH doc  GENERATE  (chararray)json#'a' AS first,
>> (long)json#'b' AS second ;
>> DUMP flat;
>>
>> but I got no output either.
>>
>>      Input(s):
>>      Successfully read 0 records (35863 bytes) from:
>> "/json-pcr/pcr-000001.json"
>>
>>      Output(s):
>>      Successfully stored 0 records in:
>> "hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210"
>>
>>
>>
>> On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain <
>> cerebrotecnologico@gmail.com> wrote:
>>
>>> I got Pig and Hive working ona single-node and I am able to run some
>>> script/queries over regular text files (access log files); with a record
>>> per line.
>>>
>>> Now, I want to process some JSON files.
>>>
>>> As mentioned before, it seems  that ElephantBird would be a would be a
>>> good solution to read JSON files.
>>>
>>> I uploaded 5 files to HDFS. Each file only contain a single JSON
>>> document. The documents are NOT in a single line, but rather contain
>>> pretty-printed JSON expanding over multiple lines.
>>>
>>> I'm trying something simple, extracting two (primitive) attributes at
>>> the top of the document:
>>> {
>>>    a : "some value",
>>>    ...
>>>    b : 133,
>>>    ...
>>> }
>>>
>>> So, lets start with a LOAD of a single file (single JSON document):
>>>
>>> REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
>>> doc = LOAD '/json-pcr/pcr-000001.json' using
>>>  com.twitter.elephantbird.pig.load.JsonLoader();
>>> flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b'
>>> AS second ;
>>> DUMP flat;
>>>
>>> Apparently the job runs without problem, but I get no output. The output
>>> I get includes this message:
>>>
>>>    Input(s):
>>>    Successfully read 0 records (35863 bytes) from:
>>> "/json-pcr/pcr-000001.json"
>>>
>>> I was expecting to get
>>>
>>> ( "some value", 133 )
>>>
>>> Any idea on what I am doing wrong?
>>>
>>>
>>>
>>>
>>> On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>
>>>> I think you have a misconception of HBase.
>>>>
>>>> You don't need to actually have mutable data for it to be effective.
>>>> The key is that you need to have access to specific records and work a
>>>> very small subset of the data and not the complete data set.
>>>>
>>>>
>>>> On Jun 13, 2013, at 11:59 AM, Tecno Brain <ce...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Mike,
>>>>
>>>> Yes, I also have thought about HBase or Cassandra but my data is pretty
>>>> much a snapshot, it does not require updates. Most of my aggregations will
>>>> also need to be computed once and won't change over time with the exception
>>>> of some aggregation that is based on the last N days of data.  Should I
>>>> still consider HBase ? I think that probably it will be good for the
>>>> aggregated data.
>>>>
>>>> I have no idea what are sequence files, but I will take a look.  My raw
>>>> data is stored in the cloud, not in my Hadoop cluster.
>>>>
>>>> I'll keep looking at Pig with ElephantBird.
>>>> Thanks,
>>>>
>>>> -Jorge
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <
>>>> michael_segel@hotmail.com> wrote:
>>>>
>>>>> Hi..
>>>>>
>>>>> Have you thought about HBase?
>>>>>
>>>>> I would suggest that if you're using Hive or Pig, to look at taking
>>>>> these files and putting the JSON records in to a sequence file.
>>>>> Or set of sequence files.... (Then look at HBase to help index
>>>>> them...) 200KB is small.
>>>>>
>>>>> That would be the same for either pig/hive.
>>>>>
>>>>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty
>>>>> nice. And yes you get each record as a row, however you can always flatten
>>>>> them as needed.
>>>>>
>>>>> Hive?
>>>>> I haven't worked with the latest SerDe, but maybe Dean Wampler or
>>>>> Edward Capriolo could give you a better answer.
>>>>> Going from memory, I don't know that there is a good SerDe that would
>>>>> write JSON, just read it. (Hive)
>>>>>
>>>>> IMHO Pig/ElephantBird is the best so far, but then again I may be
>>>>> dated and biased.
>>>>>
>>>>> I think you're on the right track or at least train of thought.
>>>>>
>>>>> HTH
>>>>>
>>>>> -Mike
>>>>>
>>>>>
>>>>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hello,
>>>>>    I'm new to Hadoop.
>>>>>    I have a large quantity of JSON documents with a structure similar
>>>>> to what is shown below.
>>>>>
>>>>>    {
>>>>>      g : "some-group-identifier",
>>>>>      sg: "some-subgroup-identifier",
>>>>>      j      : "some-job-identifier",
>>>>>      page     : 23,
>>>>>      ... // other fields omitted
>>>>>      important-data : [
>>>>>          {
>>>>>            f1  : "abc",
>>>>>            f2  : "a",
>>>>>            f3  : "/"
>>>>>            ...
>>>>>          },
>>>>>          ...
>>>>>          {
>>>>>            f1 : "xyz",
>>>>>            f2  : "q",
>>>>>            f3  : "/",
>>>>>            ...
>>>>>          },
>>>>>      ],
>>>>>     ... // other fields omitted
>>>>>      other-important-data : [
>>>>>         {
>>>>>            x1  : "ford",
>>>>>            x2  : "green",
>>>>>            x3  : 35
>>>>>            map : {
>>>>>                "free-field" : "value",
>>>>>                "other-free-field" : value2"
>>>>>               }
>>>>>          },
>>>>>          ...
>>>>>          {
>>>>>            x1 : "vw",
>>>>>            x2  : "red",
>>>>>            x3  : 54,
>>>>>            ...
>>>>>          },
>>>>>      ]
>>>>>    },
>>>>> }
>>>>>
>>>>>
>>>>> Each file contains a single JSON document (gzip compressed, and
>>>>> roughly about 200KB uncompressed of pretty-printed json text per document)
>>>>>
>>>>> I am interested in analyzing only the  "important-data" array and the
>>>>> "other-important-data" array.
>>>>> My source data would ideally be easier to analyze if it looked like a
>>>>> couple of tables with a fixed set of columns. Only the column "map" would
>>>>> be a complex column, all others would be primitives.
>>>>>
>>>>> ( g, sg, j, page, f1, f2, f3 )
>>>>>
>>>>> ( g, sg, j, page, x1, x2, x3, map )
>>>>>
>>>>> So, for each JSON document, I would like to "create" several rows, but I
>>>>> would like to avoid the intermediate step of persisting -and duplicating-
>>>>> the "flattened" data.
>>>>>
>>>>> In order to avoid persisting the data flattened, I thought I had to
>>>>> write my own map-reduce in Java code, but discovered that others have had
>>>>> the same problem of using JSON as the source and there are somewhat
>>>>> "standard" solutions.
>>>>>
>>>>> By reading about the SerDe approach for Hive I get the impression
>>>>> that each JSON document is transformed into a single "row" of the table
>>>>> with some columns being an array, a map of other nested structures.
>>>>> a) Is there a way to break each JSON document into several "rows" for
>>>>> a Hive external table?
>>>>> b) It seems there are too many JSON SerDe libraries! Is there any of
>>>>> them considered the de-facto standard?
>>>>>
>>>>> The Pig approach seems also promising using Elephant Bird Do anybody
>>>>> has pointers to more user documentation on this project? Or is browsing
>>>>> through the examples in GitHub my only source?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Aggregating data nested into JSON documents

Posted by Tecno Brain <ce...@gmail.com>.

OK, I'll go back to my original question ( although this time I know what
tools I'm using).

I am using Pig + ElephantBird.

I have JSON documents with the following structure:
{
     g : "some-group-identifier",
     sg: "some-subgroup-identifier",
     j      : "some-job-identifier",
     page     : 23,
     ... // other fields omitted
     important-data : [
         {
           f1  : "abc",
           f2  : "a",
           f3  : "/"
           ...
         },
         ...
         {
           f1 : "xyz",
           f2  : "q",
           f3  : "/",
           ...
         },
     ]
    ... // other fields omitted
}

I want Pig to GENERATE a tuple for each element on the "important-data"
array attribute. For the example above, I would like to generate:

( "some-group-identifier" , "some-subgroup-identifier", 23, "abc", "a", "/"
)
( "some-group-identifier" , "some-subgroup-identifier", 23, "xyz", "q", "/"
)

This is what I have tried:

doc = LOAD '/example.json' USING
     com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as
(json:map[]);
flat = FOREACH doc  GENERATE  (chararray)json#'gr' as g, (long)json#'sg' as
sg,  FLATTEN( json#'important-data') ;
DUMP flat;

but that produces:

( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#abc, f2#a,
f3#/ ] )
( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#xyz, f2#q,
f3#/ ] )

Close, but not exactly what I want.

Do I require to use ProtoBuf ?

-Jorge


On Wed, Jun 19, 2013 at 3:44 PM, Tecno Brain
<ce...@gmail.com>wrote:

> Ok, I found that elephant-bird JsonLoader cannot handle JSON documents
> that are pretty-printed. (expanding over multiple-lines) The entire json
> document has to be on a single line.
>
> After I reformated some of the source files, now I am getting the expected
> output.
>
>
>
>
> On Wed, Jun 19, 2013 at 2:47 PM, Tecno Brain <cerebrotecnologico@gmail.com
> > wrote:
>
>> I also tried:
>>
>> doc = LOAD '/json-pcr/pcr-000001.json' USING
>>  com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]);
>>  flat = FOREACH doc  GENERATE  (chararray)json#'a' AS first,
>> (long)json#'b' AS second ;
>> DUMP flat;
>>
>> but I got no output either.
>>
>>      Input(s):
>>      Successfully read 0 records (35863 bytes) from:
>> "/json-pcr/pcr-000001.json"
>>
>>      Output(s):
>>      Successfully stored 0 records in:
>> "hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210"
>>
>>
>>
>> On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain <
>> cerebrotecnologico@gmail.com> wrote:
>>
>>> I got Pig and Hive working ona single-node and I am able to run some
>>> script/queries over regular text files (access log files); with a record
>>> per line.
>>>
>>> Now, I want to process some JSON files.
>>>
>>> As mentioned before, it seems  that ElephantBird would be a would be a
>>> good solution to read JSON files.
>>>
>>> I uploaded 5 files to HDFS. Each file only contain a single JSON
>>> document. The documents are NOT in a single line, but rather contain
>>> pretty-printed JSON expanding over multiple lines.
>>>
>>> I'm trying something simple, extracting two (primitive) attributes at
>>> the top of the document:
>>> {
>>>    a : "some value",
>>>    ...
>>>    b : 133,
>>>    ...
>>> }
>>>
>>> So, lets start with a LOAD of a single file (single JSON document):
>>>
>>> REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
>>> doc = LOAD '/json-pcr/pcr-000001.json' using
>>>  com.twitter.elephantbird.pig.load.JsonLoader();
>>> flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b'
>>> AS second ;
>>> DUMP flat;
>>>
>>> Apparently the job runs without problem, but I get no output. The output
>>> I get includes this message:
>>>
>>>    Input(s):
>>>    Successfully read 0 records (35863 bytes) from:
>>> "/json-pcr/pcr-000001.json"
>>>
>>> I was expecting to get
>>>
>>> ( "some value", 133 )
>>>
>>> Any idea on what I am doing wrong?
>>>
>>>
>>>
>>>
>>> On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>
>>>> I think you have a misconception of HBase.
>>>>
>>>> You don't need to actually have mutable data for it to be effective.
>>>> The key is that you need to have access to specific records and work a
>>>> very small subset of the data and not the complete data set.
>>>>
>>>>
>>>> On Jun 13, 2013, at 11:59 AM, Tecno Brain <ce...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Mike,
>>>>
>>>> Yes, I also have thought about HBase or Cassandra but my data is pretty
>>>> much a snapshot, it does not require updates. Most of my aggregations will
>>>> also need to be computed once and won't change over time with the exception
>>>> of some aggregation that is based on the last N days of data.  Should I
>>>> still consider HBase ? I think that probably it will be good for the
>>>> aggregated data.
>>>>
>>>> I have no idea what are sequence files, but I will take a look.  My raw
>>>> data is stored in the cloud, not in my Hadoop cluster.
>>>>
>>>> I'll keep looking at Pig with ElephantBird.
>>>> Thanks,
>>>>
>>>> -Jorge
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <
>>>> michael_segel@hotmail.com> wrote:
>>>>
>>>>> Hi..
>>>>>
>>>>> Have you thought about HBase?
>>>>>
>>>>> I would suggest that if you're using Hive or Pig, to look at taking
>>>>> these files and putting the JSON records in to a sequence file.
>>>>> Or set of sequence files.... (Then look at HBase to help index
>>>>> them...) 200KB is small.
>>>>>
>>>>> That would be the same for either pig/hive.
>>>>>
>>>>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty
>>>>> nice. And yes you get each record as a row, however you can always flatten
>>>>> them as needed.
>>>>>
>>>>> Hive?
>>>>> I haven't worked with the latest SerDe, but maybe Dean Wampler or
>>>>> Edward Capriolo could give you a better answer.
>>>>> Going from memory, I don't know that there is a good SerDe that would
>>>>> write JSON, just read it. (Hive)
>>>>>
>>>>> IMHO Pig/ElephantBird is the best so far, but then again I may be
>>>>> dated and biased.
>>>>>
>>>>> I think you're on the right track or at least train of thought.
>>>>>
>>>>> HTH
>>>>>
>>>>> -Mike
>>>>>
>>>>>
>>>>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hello,
>>>>>    I'm new to Hadoop.
>>>>>    I have a large quantity of JSON documents with a structure similar
>>>>> to what is shown below.
>>>>>
>>>>>    {
>>>>>      g : "some-group-identifier",
>>>>>      sg: "some-subgroup-identifier",
>>>>>      j      : "some-job-identifier",
>>>>>      page     : 23,
>>>>>      ... // other fields omitted
>>>>>      important-data : [
>>>>>          {
>>>>>            f1  : "abc",
>>>>>            f2  : "a",
>>>>>            f3  : "/"
>>>>>            ...
>>>>>          },
>>>>>          ...
>>>>>          {
>>>>>            f1 : "xyz",
>>>>>            f2  : "q",
>>>>>            f3  : "/",
>>>>>            ...
>>>>>          },
>>>>>      ],
>>>>>     ... // other fields omitted
>>>>>      other-important-data : [
>>>>>         {
>>>>>            x1  : "ford",
>>>>>            x2  : "green",
>>>>>            x3  : 35
>>>>>            map : {
>>>>>                "free-field" : "value",
>>>>>                "other-free-field" : value2"
>>>>>               }
>>>>>          },
>>>>>          ...
>>>>>          {
>>>>>            x1 : "vw",
>>>>>            x2  : "red",
>>>>>            x3  : 54,
>>>>>            ...
>>>>>          },
>>>>>      ]
>>>>>    },
>>>>> }
>>>>>
>>>>>
>>>>> Each file contains a single JSON document (gzip compressed, and
>>>>> roughly about 200KB uncompressed of pretty-printed json text per document)
>>>>>
>>>>> I am interested in analyzing only the  "important-data" array and the
>>>>> "other-important-data" array.
>>>>> My source data would ideally be easier to analyze if it looked like a
>>>>> couple of tables with a fixed set of columns. Only the column "map" would
>>>>> be a complex column, all others would be primitives.
>>>>>
>>>>> ( g, sg, j, page, f1, f2, f3 )
>>>>>
>>>>> ( g, sg, j, page, x1, x2, x3, map )
>>>>>
>>>>> So, for each JSON document, I would like to "create" several rows, but I
>>>>> would like to avoid the intermediate step of persisting -and duplicating-
>>>>> the "flattened" data.
>>>>>
>>>>> In order to avoid persisting the data flattened, I thought I had to
>>>>> write my own map-reduce in Java code, but discovered that others have had
>>>>> the same problem of using JSON as the source and there are somewhat
>>>>> "standard" solutions.
>>>>>
>>>>> By reading about the SerDe approach for Hive I get the impression
>>>>> that each JSON document is transformed into a single "row" of the table
>>>>> with some columns being an array, a map of other nested structures.
>>>>> a) Is there a way to break each JSON document into several "rows" for
>>>>> a Hive external table?
>>>>> b) It seems there are too many JSON SerDe libraries! Is there any of
>>>>> them considered the de-facto standard?
>>>>>
>>>>> The Pig approach seems also promising using Elephant Bird Do anybody
>>>>> has pointers to more user documentation on this project? Or is browsing
>>>>> through the examples in GitHub my only source?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Aggregating data nested into JSON documents

Posted by Tecno Brain <ce...@gmail.com>.

Ok, I found that elephant-bird JsonLoader cannot handle JSON documents that
are pretty-printed. (expanding over multiple-lines) The entire json
document has to be on a single line.

After I reformated some of the source files, now I am getting the expected
output.




On Wed, Jun 19, 2013 at 2:47 PM, Tecno Brain
<ce...@gmail.com>wrote:

> I also tried:
>
> doc = LOAD '/json-pcr/pcr-000001.json' USING
>  com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]);
> flat = FOREACH doc  GENERATE  (chararray)json#'a' AS first, (long)json#'b'
> AS second ;
> DUMP flat;
>
> but I got no output either.
>
>      Input(s):
>      Successfully read 0 records (35863 bytes) from:
> "/json-pcr/pcr-000001.json"
>
>      Output(s):
>      Successfully stored 0 records in:
> "hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210"
>
>
>
> On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain <cerebrotecnologico@gmail.com
> > wrote:
>
>> I got Pig and Hive working ona single-node and I am able to run some
>> script/queries over regular text files (access log files); with a record
>> per line.
>>
>> Now, I want to process some JSON files.
>>
>> As mentioned before, it seems  that ElephantBird would be a would be a
>> good solution to read JSON files.
>>
>> I uploaded 5 files to HDFS. Each file only contain a single JSON
>> document. The documents are NOT in a single line, but rather contain
>> pretty-printed JSON expanding over multiple lines.
>>
>> I'm trying something simple, extracting two (primitive) attributes at the
>> top of the document:
>> {
>>    a : "some value",
>>    ...
>>    b : 133,
>>    ...
>> }
>>
>> So, lets start with a LOAD of a single file (single JSON document):
>>
>> REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
>> doc = LOAD '/json-pcr/pcr-000001.json' using
>>  com.twitter.elephantbird.pig.load.JsonLoader();
>> flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b' AS
>> second ;
>> DUMP flat;
>>
>> Apparently the job runs without problem, but I get no output. The output
>> I get includes this message:
>>
>>    Input(s):
>>    Successfully read 0 records (35863 bytes) from:
>> "/json-pcr/pcr-000001.json"
>>
>> I was expecting to get
>>
>> ( "some value", 133 )
>>
>> Any idea on what I am doing wrong?
>>
>>
>>
>>
>> On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <michael_segel@hotmail.com
>> > wrote:
>>
>>> I think you have a misconception of HBase.
>>>
>>> You don't need to actually have mutable data for it to be effective.
>>> The key is that you need to have access to specific records and work a
>>> very small subset of the data and not the complete data set.
>>>
>>>
>>> On Jun 13, 2013, at 11:59 AM, Tecno Brain <ce...@gmail.com>
>>> wrote:
>>>
>>> Hi Mike,
>>>
>>> Yes, I also have thought about HBase or Cassandra but my data is pretty
>>> much a snapshot, it does not require updates. Most of my aggregations will
>>> also need to be computed once and won't change over time with the exception
>>> of some aggregation that is based on the last N days of data.  Should I
>>> still consider HBase ? I think that probably it will be good for the
>>> aggregated data.
>>>
>>> I have no idea what are sequence files, but I will take a look.  My raw
>>> data is stored in the cloud, not in my Hadoop cluster.
>>>
>>> I'll keep looking at Pig with ElephantBird.
>>> Thanks,
>>>
>>> -Jorge
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>
>>>> Hi..
>>>>
>>>> Have you thought about HBase?
>>>>
>>>> I would suggest that if you're using Hive or Pig, to look at taking
>>>> these files and putting the JSON records in to a sequence file.
>>>> Or set of sequence files.... (Then look at HBase to help index them...)
>>>> 200KB is small.
>>>>
>>>> That would be the same for either pig/hive.
>>>>
>>>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty
>>>> nice. And yes you get each record as a row, however you can always flatten
>>>> them as needed.
>>>>
>>>> Hive?
>>>> I haven't worked with the latest SerDe, but maybe Dean Wampler or
>>>> Edward Capriolo could give you a better answer.
>>>> Going from memory, I don't know that there is a good SerDe that would
>>>> write JSON, just read it. (Hive)
>>>>
>>>> IMHO Pig/ElephantBird is the best so far, but then again I may be dated
>>>> and biased.
>>>>
>>>> I think you're on the right track or at least train of thought.
>>>>
>>>> HTH
>>>>
>>>> -Mike
>>>>
>>>>
>>>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com>
>>>> wrote:
>>>>
>>>> Hello,
>>>>    I'm new to Hadoop.
>>>>    I have a large quantity of JSON documents with a structure similar
>>>> to what is shown below.
>>>>
>>>>    {
>>>>      g : "some-group-identifier",
>>>>      sg: "some-subgroup-identifier",
>>>>      j      : "some-job-identifier",
>>>>      page     : 23,
>>>>      ... // other fields omitted
>>>>      important-data : [
>>>>          {
>>>>            f1  : "abc",
>>>>            f2  : "a",
>>>>            f3  : "/"
>>>>            ...
>>>>          },
>>>>          ...
>>>>          {
>>>>            f1 : "xyz",
>>>>            f2  : "q",
>>>>            f3  : "/",
>>>>            ...
>>>>          },
>>>>      ],
>>>>     ... // other fields omitted
>>>>      other-important-data : [
>>>>         {
>>>>            x1  : "ford",
>>>>            x2  : "green",
>>>>            x3  : 35
>>>>            map : {
>>>>                "free-field" : "value",
>>>>                "other-free-field" : value2"
>>>>               }
>>>>          },
>>>>          ...
>>>>          {
>>>>            x1 : "vw",
>>>>            x2  : "red",
>>>>            x3  : 54,
>>>>            ...
>>>>          },
>>>>      ]
>>>>    },
>>>> }
>>>>
>>>>
>>>> Each file contains a single JSON document (gzip compressed, and roughly
>>>> about 200KB uncompressed of pretty-printed json text per document)
>>>>
>>>> I am interested in analyzing only the  "important-data" array and the
>>>> "other-important-data" array.
>>>> My source data would ideally be easier to analyze if it looked like a
>>>> couple of tables with a fixed set of columns. Only the column "map" would
>>>> be a complex column, all others would be primitives.
>>>>
>>>> ( g, sg, j, page, f1, f2, f3 )
>>>>
>>>> ( g, sg, j, page, x1, x2, x3, map )
>>>>
>>>> So, for each JSON document, I would like to "create" several rows, but I
>>>> would like to avoid the intermediate step of persisting -and duplicating-
>>>> the "flattened" data.
>>>>
>>>> In order to avoid persisting the data flattened, I thought I had to
>>>> write my own map-reduce in Java code, but discovered that others have had
>>>> the same problem of using JSON as the source and there are somewhat
>>>> "standard" solutions.
>>>>
>>>> By reading about the SerDe approach for Hive I get the impression that
>>>> each JSON document is transformed into a single "row" of the table with
>>>> some columns being an array, a map of other nested structures.
>>>> a) Is there a way to break each JSON document into several "rows" for a
>>>> Hive external table?
>>>> b) It seems there are too many JSON SerDe libraries! Is there any of
>>>> them considered the de-facto standard?
>>>>
>>>> The Pig approach seems also promising using Elephant Bird Do anybody
>>>> has pointers to more user documentation on this project? Or is browsing
>>>> through the examples in GitHub my only source?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: Aggregating data nested into JSON documents

Posted by Tecno Brain <ce...@gmail.com>.

Ok, I found that elephant-bird JsonLoader cannot handle JSON documents that
are pretty-printed. (expanding over multiple-lines) The entire json
document has to be on a single line.

After I reformated some of the source files, now I am getting the expected
output.




On Wed, Jun 19, 2013 at 2:47 PM, Tecno Brain
<ce...@gmail.com>wrote:

> I also tried:
>
> doc = LOAD '/json-pcr/pcr-000001.json' USING
>  com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]);
> flat = FOREACH doc  GENERATE  (chararray)json#'a' AS first, (long)json#'b'
> AS second ;
> DUMP flat;
>
> but I got no output either.
>
>      Input(s):
>      Successfully read 0 records (35863 bytes) from:
> "/json-pcr/pcr-000001.json"
>
>      Output(s):
>      Successfully stored 0 records in:
> "hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210"
>
>
>
> On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain <cerebrotecnologico@gmail.com
> > wrote:
>
>> I got Pig and Hive working ona single-node and I am able to run some
>> script/queries over regular text files (access log files); with a record
>> per line.
>>
>> Now, I want to process some JSON files.
>>
>> As mentioned before, it seems  that ElephantBird would be a would be a
>> good solution to read JSON files.
>>
>> I uploaded 5 files to HDFS. Each file only contain a single JSON
>> document. The documents are NOT in a single line, but rather contain
>> pretty-printed JSON expanding over multiple lines.
>>
>> I'm trying something simple, extracting two (primitive) attributes at the
>> top of the document:
>> {
>>    a : "some value",
>>    ...
>>    b : 133,
>>    ...
>> }
>>
>> So, lets start with a LOAD of a single file (single JSON document):
>>
>> REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
>> doc = LOAD '/json-pcr/pcr-000001.json' using
>>  com.twitter.elephantbird.pig.load.JsonLoader();
>> flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b' AS
>> second ;
>> DUMP flat;
>>
>> Apparently the job runs without problem, but I get no output. The output
>> I get includes this message:
>>
>>    Input(s):
>>    Successfully read 0 records (35863 bytes) from:
>> "/json-pcr/pcr-000001.json"
>>
>> I was expecting to get
>>
>> ( "some value", 133 )
>>
>> Any idea on what I am doing wrong?
>>
>>
>>
>>
>> On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <michael_segel@hotmail.com
>> > wrote:
>>
>>> I think you have a misconception of HBase.
>>>
>>> You don't need to actually have mutable data for it to be effective.
>>> The key is that you need to have access to specific records and work a
>>> very small subset of the data and not the complete data set.
>>>
>>>
>>> On Jun 13, 2013, at 11:59 AM, Tecno Brain <ce...@gmail.com>
>>> wrote:
>>>
>>> Hi Mike,
>>>
>>> Yes, I also have thought about HBase or Cassandra but my data is pretty
>>> much a snapshot, it does not require updates. Most of my aggregations will
>>> also need to be computed once and won't change over time with the exception
>>> of some aggregation that is based on the last N days of data.  Should I
>>> still consider HBase ? I think that probably it will be good for the
>>> aggregated data.
>>>
>>> I have no idea what are sequence files, but I will take a look.  My raw
>>> data is stored in the cloud, not in my Hadoop cluster.
>>>
>>> I'll keep looking at Pig with ElephantBird.
>>> Thanks,
>>>
>>> -Jorge
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>
>>>> Hi..
>>>>
>>>> Have you thought about HBase?
>>>>
>>>> I would suggest that if you're using Hive or Pig, to look at taking
>>>> these files and putting the JSON records in to a sequence file.
>>>> Or set of sequence files.... (Then look at HBase to help index them...)
>>>> 200KB is small.
>>>>
>>>> That would be the same for either pig/hive.
>>>>
>>>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty
>>>> nice. And yes you get each record as a row, however you can always flatten
>>>> them as needed.
>>>>
>>>> Hive?
>>>> I haven't worked with the latest SerDe, but maybe Dean Wampler or
>>>> Edward Capriolo could give you a better answer.
>>>> Going from memory, I don't know that there is a good SerDe that would
>>>> write JSON, just read it. (Hive)
>>>>
>>>> IMHO Pig/ElephantBird is the best so far, but then again I may be dated
>>>> and biased.
>>>>
>>>> I think you're on the right track or at least train of thought.
>>>>
>>>> HTH
>>>>
>>>> -Mike
>>>>
>>>>
>>>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com>
>>>> wrote:
>>>>
>>>> Hello,
>>>>    I'm new to Hadoop.
>>>>    I have a large quantity of JSON documents with a structure similar
>>>> to what is shown below.
>>>>
>>>>    {
>>>>      g : "some-group-identifier",
>>>>      sg: "some-subgroup-identifier",
>>>>      j      : "some-job-identifier",
>>>>      page     : 23,
>>>>      ... // other fields omitted
>>>>      important-data : [
>>>>          {
>>>>            f1  : "abc",
>>>>            f2  : "a",
>>>>            f3  : "/"
>>>>            ...
>>>>          },
>>>>          ...
>>>>          {
>>>>            f1 : "xyz",
>>>>            f2  : "q",
>>>>            f3  : "/",
>>>>            ...
>>>>          },
>>>>      ],
>>>>     ... // other fields omitted
>>>>      other-important-data : [
>>>>         {
>>>>            x1  : "ford",
>>>>            x2  : "green",
>>>>            x3  : 35
>>>>            map : {
>>>>                "free-field" : "value",
>>>>                "other-free-field" : value2"
>>>>               }
>>>>          },
>>>>          ...
>>>>          {
>>>>            x1 : "vw",
>>>>            x2  : "red",
>>>>            x3  : 54,
>>>>            ...
>>>>          },
>>>>      ]
>>>>    },
>>>> }
>>>>
>>>>
>>>> Each file contains a single JSON document (gzip compressed, and roughly
>>>> about 200KB uncompressed of pretty-printed json text per document)
>>>>
>>>> I am interested in analyzing only the  "important-data" array and the
>>>> "other-important-data" array.
>>>> My source data would ideally be easier to analyze if it looked like a
>>>> couple of tables with a fixed set of columns. Only the column "map" would
>>>> be a complex column, all others would be primitives.
>>>>
>>>> ( g, sg, j, page, f1, f2, f3 )
>>>>
>>>> ( g, sg, j, page, x1, x2, x3, map )
>>>>
>>>> So, for each JSON document, I would like to "create" several rows, but I
>>>> would like to avoid the intermediate step of persisting -and duplicating-
>>>> the "flattened" data.
>>>>
>>>> In order to avoid persisting the data flattened, I thought I had to
>>>> write my own map-reduce in Java code, but discovered that others have had
>>>> the same problem of using JSON as the source and there are somewhat
>>>> "standard" solutions.
>>>>
>>>> By reading about the SerDe approach for Hive I get the impression that
>>>> each JSON document is transformed into a single "row" of the table with
>>>> some columns being an array, a map of other nested structures.
>>>> a) Is there a way to break each JSON document into several "rows" for a
>>>> Hive external table?
>>>> b) It seems there are too many JSON SerDe libraries! Is there any of
>>>> them considered the de-facto standard?
>>>>
>>>> The Pig approach seems also promising using Elephant Bird Do anybody
>>>> has pointers to more user documentation on this project? Or is browsing
>>>> through the examples in GitHub my only source?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: Aggregating data nested into JSON documents

Posted by Tecno Brain <ce...@gmail.com>.

Ok, I found that elephant-bird JsonLoader cannot handle JSON documents that
are pretty-printed. (expanding over multiple-lines) The entire json
document has to be on a single line.

After I reformated some of the source files, now I am getting the expected
output.




On Wed, Jun 19, 2013 at 2:47 PM, Tecno Brain
<ce...@gmail.com>wrote:

> I also tried:
>
> doc = LOAD '/json-pcr/pcr-000001.json' USING
>  com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]);
> flat = FOREACH doc  GENERATE  (chararray)json#'a' AS first, (long)json#'b'
> AS second ;
> DUMP flat;
>
> but I got no output either.
>
>      Input(s):
>      Successfully read 0 records (35863 bytes) from:
> "/json-pcr/pcr-000001.json"
>
>      Output(s):
>      Successfully stored 0 records in:
> "hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210"
>
>
>
> On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain <cerebrotecnologico@gmail.com
> > wrote:
>
>> I got Pig and Hive working ona single-node and I am able to run some
>> script/queries over regular text files (access log files); with a record
>> per line.
>>
>> Now, I want to process some JSON files.
>>
>> As mentioned before, it seems  that ElephantBird would be a would be a
>> good solution to read JSON files.
>>
>> I uploaded 5 files to HDFS. Each file only contain a single JSON
>> document. The documents are NOT in a single line, but rather contain
>> pretty-printed JSON expanding over multiple lines.
>>
>> I'm trying something simple, extracting two (primitive) attributes at the
>> top of the document:
>> {
>>    a : "some value",
>>    ...
>>    b : 133,
>>    ...
>> }
>>
>> So, lets start with a LOAD of a single file (single JSON document):
>>
>> REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
>> doc = LOAD '/json-pcr/pcr-000001.json' using
>>  com.twitter.elephantbird.pig.load.JsonLoader();
>> flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b' AS
>> second ;
>> DUMP flat;
>>
>> Apparently the job runs without problem, but I get no output. The output
>> I get includes this message:
>>
>>    Input(s):
>>    Successfully read 0 records (35863 bytes) from:
>> "/json-pcr/pcr-000001.json"
>>
>> I was expecting to get
>>
>> ( "some value", 133 )
>>
>> Any idea on what I am doing wrong?
>>
>>
>>
>>
>> On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <michael_segel@hotmail.com
>> > wrote:
>>
>>> I think you have a misconception of HBase.
>>>
>>> You don't need to actually have mutable data for it to be effective.
>>> The key is that you need to have access to specific records and work a
>>> very small subset of the data and not the complete data set.
>>>
>>>
>>> On Jun 13, 2013, at 11:59 AM, Tecno Brain <ce...@gmail.com>
>>> wrote:
>>>
>>> Hi Mike,
>>>
>>> Yes, I also have thought about HBase or Cassandra but my data is pretty
>>> much a snapshot, it does not require updates. Most of my aggregations will
>>> also need to be computed once and won't change over time with the exception
>>> of some aggregation that is based on the last N days of data.  Should I
>>> still consider HBase ? I think that probably it will be good for the
>>> aggregated data.
>>>
>>> I have no idea what are sequence files, but I will take a look.  My raw
>>> data is stored in the cloud, not in my Hadoop cluster.
>>>
>>> I'll keep looking at Pig with ElephantBird.
>>> Thanks,
>>>
>>> -Jorge
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>
>>>> Hi..
>>>>
>>>> Have you thought about HBase?
>>>>
>>>> I would suggest that if you're using Hive or Pig, to look at taking
>>>> these files and putting the JSON records in to a sequence file.
>>>> Or set of sequence files.... (Then look at HBase to help index them...)
>>>> 200KB is small.
>>>>
>>>> That would be the same for either pig/hive.
>>>>
>>>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty
>>>> nice. And yes you get each record as a row, however you can always flatten
>>>> them as needed.
>>>>
>>>> Hive?
>>>> I haven't worked with the latest SerDe, but maybe Dean Wampler or
>>>> Edward Capriolo could give you a better answer.
>>>> Going from memory, I don't know that there is a good SerDe that would
>>>> write JSON, just read it. (Hive)
>>>>
>>>> IMHO Pig/ElephantBird is the best so far, but then again I may be dated
>>>> and biased.
>>>>
>>>> I think you're on the right track or at least train of thought.
>>>>
>>>> HTH
>>>>
>>>> -Mike
>>>>
>>>>
>>>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com>
>>>> wrote:
>>>>
>>>> Hello,
>>>>    I'm new to Hadoop.
>>>>    I have a large quantity of JSON documents with a structure similar
>>>> to what is shown below.
>>>>
>>>>    {
>>>>      g : "some-group-identifier",
>>>>      sg: "some-subgroup-identifier",
>>>>      j      : "some-job-identifier",
>>>>      page     : 23,
>>>>      ... // other fields omitted
>>>>      important-data : [
>>>>          {
>>>>            f1  : "abc",
>>>>            f2  : "a",
>>>>            f3  : "/"
>>>>            ...
>>>>          },
>>>>          ...
>>>>          {
>>>>            f1 : "xyz",
>>>>            f2  : "q",
>>>>            f3  : "/",
>>>>            ...
>>>>          },
>>>>      ],
>>>>     ... // other fields omitted
>>>>      other-important-data : [
>>>>         {
>>>>            x1  : "ford",
>>>>            x2  : "green",
>>>>            x3  : 35
>>>>            map : {
>>>>                "free-field" : "value",
>>>>                "other-free-field" : value2"
>>>>               }
>>>>          },
>>>>          ...
>>>>          {
>>>>            x1 : "vw",
>>>>            x2  : "red",
>>>>            x3  : 54,
>>>>            ...
>>>>          },
>>>>      ]
>>>>    },
>>>> }
>>>>
>>>>
>>>> Each file contains a single JSON document (gzip compressed, and roughly
>>>> about 200KB uncompressed of pretty-printed json text per document)
>>>>
>>>> I am interested in analyzing only the  "important-data" array and the
>>>> "other-important-data" array.
>>>> My source data would ideally be easier to analyze if it looked like a
>>>> couple of tables with a fixed set of columns. Only the column "map" would
>>>> be a complex column, all others would be primitives.
>>>>
>>>> ( g, sg, j, page, f1, f2, f3 )
>>>>
>>>> ( g, sg, j, page, x1, x2, x3, map )
>>>>
>>>> So, for each JSON document, I would like to "create" several rows, but I
>>>> would like to avoid the intermediate step of persisting -and duplicating-
>>>> the "flattened" data.
>>>>
>>>> In order to avoid persisting the data flattened, I thought I had to
>>>> write my own map-reduce in Java code, but discovered that others have had
>>>> the same problem of using JSON as the source and there are somewhat
>>>> "standard" solutions.
>>>>
>>>> By reading about the SerDe approach for Hive I get the impression that
>>>> each JSON document is transformed into a single "row" of the table with
>>>> some columns being an array, a map of other nested structures.
>>>> a) Is there a way to break each JSON document into several "rows" for a
>>>> Hive external table?
>>>> b) It seems there are too many JSON SerDe libraries! Is there any of
>>>> them considered the de-facto standard?
>>>>
>>>> The Pig approach seems also promising using Elephant Bird Do anybody
>>>> has pointers to more user documentation on this project? Or is browsing
>>>> through the examples in GitHub my only source?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: Aggregating data nested into JSON documents

Posted by Tecno Brain <ce...@gmail.com>.

Ok, I found that elephant-bird JsonLoader cannot handle JSON documents that
are pretty-printed. (expanding over multiple-lines) The entire json
document has to be on a single line.

After I reformated some of the source files, now I am getting the expected
output.




On Wed, Jun 19, 2013 at 2:47 PM, Tecno Brain
<ce...@gmail.com>wrote:

> I also tried:
>
> doc = LOAD '/json-pcr/pcr-000001.json' USING
>  com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]);
> flat = FOREACH doc  GENERATE  (chararray)json#'a' AS first, (long)json#'b'
> AS second ;
> DUMP flat;
>
> but I got no output either.
>
>      Input(s):
>      Successfully read 0 records (35863 bytes) from:
> "/json-pcr/pcr-000001.json"
>
>      Output(s):
>      Successfully stored 0 records in:
> "hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210"
>
>
>
> On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain <cerebrotecnologico@gmail.com
> > wrote:
>
>> I got Pig and Hive working ona single-node and I am able to run some
>> script/queries over regular text files (access log files); with a record
>> per line.
>>
>> Now, I want to process some JSON files.
>>
>> As mentioned before, it seems  that ElephantBird would be a would be a
>> good solution to read JSON files.
>>
>> I uploaded 5 files to HDFS. Each file only contain a single JSON
>> document. The documents are NOT in a single line, but rather contain
>> pretty-printed JSON expanding over multiple lines.
>>
>> I'm trying something simple, extracting two (primitive) attributes at the
>> top of the document:
>> {
>>    a : "some value",
>>    ...
>>    b : 133,
>>    ...
>> }
>>
>> So, lets start with a LOAD of a single file (single JSON document):
>>
>> REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
>> doc = LOAD '/json-pcr/pcr-000001.json' using
>>  com.twitter.elephantbird.pig.load.JsonLoader();
>> flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b' AS
>> second ;
>> DUMP flat;
>>
>> Apparently the job runs without problem, but I get no output. The output
>> I get includes this message:
>>
>>    Input(s):
>>    Successfully read 0 records (35863 bytes) from:
>> "/json-pcr/pcr-000001.json"
>>
>> I was expecting to get
>>
>> ( "some value", 133 )
>>
>> Any idea on what I am doing wrong?
>>
>>
>>
>>
>> On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <michael_segel@hotmail.com
>> > wrote:
>>
>>> I think you have a misconception of HBase.
>>>
>>> You don't need to actually have mutable data for it to be effective.
>>> The key is that you need to have access to specific records and work a
>>> very small subset of the data and not the complete data set.
>>>
>>>
>>> On Jun 13, 2013, at 11:59 AM, Tecno Brain <ce...@gmail.com>
>>> wrote:
>>>
>>> Hi Mike,
>>>
>>> Yes, I also have thought about HBase or Cassandra but my data is pretty
>>> much a snapshot, it does not require updates. Most of my aggregations will
>>> also need to be computed once and won't change over time with the exception
>>> of some aggregation that is based on the last N days of data.  Should I
>>> still consider HBase ? I think that probably it will be good for the
>>> aggregated data.
>>>
>>> I have no idea what are sequence files, but I will take a look.  My raw
>>> data is stored in the cloud, not in my Hadoop cluster.
>>>
>>> I'll keep looking at Pig with ElephantBird.
>>> Thanks,
>>>
>>> -Jorge
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>
>>>> Hi..
>>>>
>>>> Have you thought about HBase?
>>>>
>>>> I would suggest that if you're using Hive or Pig, to look at taking
>>>> these files and putting the JSON records in to a sequence file.
>>>> Or set of sequence files.... (Then look at HBase to help index them...)
>>>> 200KB is small.
>>>>
>>>> That would be the same for either pig/hive.
>>>>
>>>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty
>>>> nice. And yes you get each record as a row, however you can always flatten
>>>> them as needed.
>>>>
>>>> Hive?
>>>> I haven't worked with the latest SerDe, but maybe Dean Wampler or
>>>> Edward Capriolo could give you a better answer.
>>>> Going from memory, I don't know that there is a good SerDe that would
>>>> write JSON, just read it. (Hive)
>>>>
>>>> IMHO Pig/ElephantBird is the best so far, but then again I may be dated
>>>> and biased.
>>>>
>>>> I think you're on the right track or at least train of thought.
>>>>
>>>> HTH
>>>>
>>>> -Mike
>>>>
>>>>
>>>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com>
>>>> wrote:
>>>>
>>>> Hello,
>>>>    I'm new to Hadoop.
>>>>    I have a large quantity of JSON documents with a structure similar
>>>> to what is shown below.
>>>>
>>>>    {
>>>>      g : "some-group-identifier",
>>>>      sg: "some-subgroup-identifier",
>>>>      j      : "some-job-identifier",
>>>>      page     : 23,
>>>>      ... // other fields omitted
>>>>      important-data : [
>>>>          {
>>>>            f1  : "abc",
>>>>            f2  : "a",
>>>>            f3  : "/"
>>>>            ...
>>>>          },
>>>>          ...
>>>>          {
>>>>            f1 : "xyz",
>>>>            f2  : "q",
>>>>            f3  : "/",
>>>>            ...
>>>>          },
>>>>      ],
>>>>     ... // other fields omitted
>>>>      other-important-data : [
>>>>         {
>>>>            x1  : "ford",
>>>>            x2  : "green",
>>>>            x3  : 35
>>>>            map : {
>>>>                "free-field" : "value",
>>>>                "other-free-field" : value2"
>>>>               }
>>>>          },
>>>>          ...
>>>>          {
>>>>            x1 : "vw",
>>>>            x2  : "red",
>>>>            x3  : 54,
>>>>            ...
>>>>          },
>>>>      ]
>>>>    },
>>>> }
>>>>
>>>>
>>>> Each file contains a single JSON document (gzip compressed, and roughly
>>>> about 200KB uncompressed of pretty-printed json text per document)
>>>>
>>>> I am interested in analyzing only the  "important-data" array and the
>>>> "other-important-data" array.
>>>> My source data would ideally be easier to analyze if it looked like a
>>>> couple of tables with a fixed set of columns. Only the column "map" would
>>>> be a complex column, all others would be primitives.
>>>>
>>>> ( g, sg, j, page, f1, f2, f3 )
>>>>
>>>> ( g, sg, j, page, x1, x2, x3, map )
>>>>
>>>> So, for each JSON document, I would like to "create" several rows, but I
>>>> would like to avoid the intermediate step of persisting -and duplicating-
>>>> the "flattened" data.
>>>>
>>>> In order to avoid persisting the data flattened, I thought I had to
>>>> write my own map-reduce in Java code, but discovered that others have had
>>>> the same problem of using JSON as the source and there are somewhat
>>>> "standard" solutions.
>>>>
>>>> By reading about the SerDe approach for Hive I get the impression that
>>>> each JSON document is transformed into a single "row" of the table with
>>>> some columns being an array, a map of other nested structures.
>>>> a) Is there a way to break each JSON document into several "rows" for a
>>>> Hive external table?
>>>> b) It seems there are too many JSON SerDe libraries! Is there any of
>>>> them considered the de-facto standard?
>>>>
>>>> The Pig approach seems also promising using Elephant Bird Do anybody
>>>> has pointers to more user documentation on this project? Or is browsing
>>>> through the examples in GitHub my only source?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: Aggregating data nested into JSON documents

Posted by Tecno Brain <ce...@gmail.com>.

I also tried:

doc = LOAD '/json-pcr/pcr-000001.json' USING
 com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]);
flat = FOREACH doc  GENERATE  (chararray)json#'a' AS first, (long)json#'b'
AS second ;
DUMP flat;

but I got no output either.

     Input(s):
     Successfully read 0 records (35863 bytes) from:
"/json-pcr/pcr-000001.json"

     Output(s):
     Successfully stored 0 records in:
"hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210"



On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain
<ce...@gmail.com>wrote:

> I got Pig and Hive working ona single-node and I am able to run some
> script/queries over regular text files (access log files); with a record
> per line.
>
> Now, I want to process some JSON files.
>
> As mentioned before, it seems  that ElephantBird would be a would be a
> good solution to read JSON files.
>
> I uploaded 5 files to HDFS. Each file only contain a single JSON document.
> The documents are NOT in a single line, but rather contain pretty-printed
> JSON expanding over multiple lines.
>
> I'm trying something simple, extracting two (primitive) attributes at the
> top of the document:
> {
>    a : "some value",
>    ...
>    b : 133,
>    ...
> }
>
> So, lets start with a LOAD of a single file (single JSON document):
>
> REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
> doc = LOAD '/json-pcr/pcr-000001.json' using
>  com.twitter.elephantbird.pig.load.JsonLoader();
> flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b' AS
> second ;
> DUMP flat;
>
> Apparently the job runs without problem, but I get no output. The output I
> get includes this message:
>
>    Input(s):
>    Successfully read 0 records (35863 bytes) from:
> "/json-pcr/pcr-000001.json"
>
> I was expecting to get
>
> ( "some value", 133 )
>
> Any idea on what I am doing wrong?
>
>
>
>
> On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> I think you have a misconception of HBase.
>>
>> You don't need to actually have mutable data for it to be effective.
>> The key is that you need to have access to specific records and work a
>> very small subset of the data and not the complete data set.
>>
>>
>> On Jun 13, 2013, at 11:59 AM, Tecno Brain <ce...@gmail.com>
>> wrote:
>>
>> Hi Mike,
>>
>> Yes, I also have thought about HBase or Cassandra but my data is pretty
>> much a snapshot, it does not require updates. Most of my aggregations will
>> also need to be computed once and won't change over time with the exception
>> of some aggregation that is based on the last N days of data.  Should I
>> still consider HBase ? I think that probably it will be good for the
>> aggregated data.
>>
>> I have no idea what are sequence files, but I will take a look.  My raw
>> data is stored in the cloud, not in my Hadoop cluster.
>>
>> I'll keep looking at Pig with ElephantBird.
>> Thanks,
>>
>> -Jorge
>>
>>
>>
>>
>>
>> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <michael_segel@hotmail.com
>> > wrote:
>>
>>> Hi..
>>>
>>> Have you thought about HBase?
>>>
>>> I would suggest that if you're using Hive or Pig, to look at taking
>>> these files and putting the JSON records in to a sequence file.
>>> Or set of sequence files.... (Then look at HBase to help index them...)
>>> 200KB is small.
>>>
>>> That would be the same for either pig/hive.
>>>
>>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice.
>>> And yes you get each record as a row, however you can always flatten them
>>> as needed.
>>>
>>> Hive?
>>> I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward
>>> Capriolo could give you a better answer.
>>> Going from memory, I don't know that there is a good SerDe that would
>>> write JSON, just read it. (Hive)
>>>
>>> IMHO Pig/ElephantBird is the best so far, but then again I may be dated
>>> and biased.
>>>
>>> I think you're on the right track or at least train of thought.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>>
>>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com>
>>> wrote:
>>>
>>> Hello,
>>>    I'm new to Hadoop.
>>>    I have a large quantity of JSON documents with a structure similar to
>>> what is shown below.
>>>
>>>    {
>>>      g : "some-group-identifier",
>>>      sg: "some-subgroup-identifier",
>>>      j      : "some-job-identifier",
>>>      page     : 23,
>>>      ... // other fields omitted
>>>      important-data : [
>>>          {
>>>            f1  : "abc",
>>>            f2  : "a",
>>>            f3  : "/"
>>>            ...
>>>          },
>>>          ...
>>>          {
>>>            f1 : "xyz",
>>>            f2  : "q",
>>>            f3  : "/",
>>>            ...
>>>          },
>>>      ],
>>>     ... // other fields omitted
>>>      other-important-data : [
>>>         {
>>>            x1  : "ford",
>>>            x2  : "green",
>>>            x3  : 35
>>>            map : {
>>>                "free-field" : "value",
>>>                "other-free-field" : value2"
>>>               }
>>>          },
>>>          ...
>>>          {
>>>            x1 : "vw",
>>>            x2  : "red",
>>>            x3  : 54,
>>>            ...
>>>          },
>>>      ]
>>>    },
>>> }
>>>
>>>
>>> Each file contains a single JSON document (gzip compressed, and roughly
>>> about 200KB uncompressed of pretty-printed json text per document)
>>>
>>> I am interested in analyzing only the  "important-data" array and the
>>> "other-important-data" array.
>>> My source data would ideally be easier to analyze if it looked like a
>>> couple of tables with a fixed set of columns. Only the column "map" would
>>> be a complex column, all others would be primitives.
>>>
>>> ( g, sg, j, page, f1, f2, f3 )
>>>
>>> ( g, sg, j, page, x1, x2, x3, map )
>>>
>>> So, for each JSON document, I would like to "create" several rows, but I
>>> would like to avoid the intermediate step of persisting -and duplicating-
>>> the "flattened" data.
>>>
>>> In order to avoid persisting the data flattened, I thought I had to
>>> write my own map-reduce in Java code, but discovered that others have had
>>> the same problem of using JSON as the source and there are somewhat
>>> "standard" solutions.
>>>
>>> By reading about the SerDe approach for Hive I get the impression that
>>> each JSON document is transformed into a single "row" of the table with
>>> some columns being an array, a map of other nested structures.
>>> a) Is there a way to break each JSON document into several "rows" for a
>>> Hive external table?
>>> b) It seems there are too many JSON SerDe libraries! Is there any of
>>> them considered the de-facto standard?
>>>
>>> The Pig approach seems also promising using Elephant Bird Do anybody has
>>> pointers to more user documentation on this project? Or is browsing through
>>> the examples in GitHub my only source?
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: Aggregating data nested into JSON documents

Posted by Tecno Brain <ce...@gmail.com>.

I also tried:

doc = LOAD '/json-pcr/pcr-000001.json' USING
 com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]);
flat = FOREACH doc  GENERATE  (chararray)json#'a' AS first, (long)json#'b'
AS second ;
DUMP flat;

but I got no output either.

     Input(s):
     Successfully read 0 records (35863 bytes) from:
"/json-pcr/pcr-000001.json"

     Output(s):
     Successfully stored 0 records in:
"hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210"



On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain
<ce...@gmail.com>wrote:

> I got Pig and Hive working ona single-node and I am able to run some
> script/queries over regular text files (access log files); with a record
> per line.
>
> Now, I want to process some JSON files.
>
> As mentioned before, it seems  that ElephantBird would be a would be a
> good solution to read JSON files.
>
> I uploaded 5 files to HDFS. Each file only contain a single JSON document.
> The documents are NOT in a single line, but rather contain pretty-printed
> JSON expanding over multiple lines.
>
> I'm trying something simple, extracting two (primitive) attributes at the
> top of the document:
> {
>    a : "some value",
>    ...
>    b : 133,
>    ...
> }
>
> So, lets start with a LOAD of a single file (single JSON document):
>
> REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
> doc = LOAD '/json-pcr/pcr-000001.json' using
>  com.twitter.elephantbird.pig.load.JsonLoader();
> flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b' AS
> second ;
> DUMP flat;
>
> Apparently the job runs without problem, but I get no output. The output I
> get includes this message:
>
>    Input(s):
>    Successfully read 0 records (35863 bytes) from:
> "/json-pcr/pcr-000001.json"
>
> I was expecting to get
>
> ( "some value", 133 )
>
> Any idea on what I am doing wrong?
>
>
>
>
> On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> I think you have a misconception of HBase.
>>
>> You don't need to actually have mutable data for it to be effective.
>> The key is that you need to have access to specific records and work a
>> very small subset of the data and not the complete data set.
>>
>>
>> On Jun 13, 2013, at 11:59 AM, Tecno Brain <ce...@gmail.com>
>> wrote:
>>
>> Hi Mike,
>>
>> Yes, I also have thought about HBase or Cassandra but my data is pretty
>> much a snapshot, it does not require updates. Most of my aggregations will
>> also need to be computed once and won't change over time with the exception
>> of some aggregation that is based on the last N days of data.  Should I
>> still consider HBase ? I think that probably it will be good for the
>> aggregated data.
>>
>> I have no idea what are sequence files, but I will take a look.  My raw
>> data is stored in the cloud, not in my Hadoop cluster.
>>
>> I'll keep looking at Pig with ElephantBird.
>> Thanks,
>>
>> -Jorge
>>
>>
>>
>>
>>
>> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <michael_segel@hotmail.com
>> > wrote:
>>
>>> Hi..
>>>
>>> Have you thought about HBase?
>>>
>>> I would suggest that if you're using Hive or Pig, to look at taking
>>> these files and putting the JSON records in to a sequence file.
>>> Or set of sequence files.... (Then look at HBase to help index them...)
>>> 200KB is small.
>>>
>>> That would be the same for either pig/hive.
>>>
>>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice.
>>> And yes you get each record as a row, however you can always flatten them
>>> as needed.
>>>
>>> Hive?
>>> I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward
>>> Capriolo could give you a better answer.
>>> Going from memory, I don't know that there is a good SerDe that would
>>> write JSON, just read it. (Hive)
>>>
>>> IMHO Pig/ElephantBird is the best so far, but then again I may be dated
>>> and biased.
>>>
>>> I think you're on the right track or at least train of thought.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>>
>>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com>
>>> wrote:
>>>
>>> Hello,
>>>    I'm new to Hadoop.
>>>    I have a large quantity of JSON documents with a structure similar to
>>> what is shown below.
>>>
>>>    {
>>>      g : "some-group-identifier",
>>>      sg: "some-subgroup-identifier",
>>>      j      : "some-job-identifier",
>>>      page     : 23,
>>>      ... // other fields omitted
>>>      important-data : [
>>>          {
>>>            f1  : "abc",
>>>            f2  : "a",
>>>            f3  : "/"
>>>            ...
>>>          },
>>>          ...
>>>          {
>>>            f1 : "xyz",
>>>            f2  : "q",
>>>            f3  : "/",
>>>            ...
>>>          },
>>>      ],
>>>     ... // other fields omitted
>>>      other-important-data : [
>>>         {
>>>            x1  : "ford",
>>>            x2  : "green",
>>>            x3  : 35
>>>            map : {
>>>                "free-field" : "value",
>>>                "other-free-field" : value2"
>>>               }
>>>          },
>>>          ...
>>>          {
>>>            x1 : "vw",
>>>            x2  : "red",
>>>            x3  : 54,
>>>            ...
>>>          },
>>>      ]
>>>    },
>>> }
>>>
>>>
>>> Each file contains a single JSON document (gzip compressed, and roughly
>>> about 200KB uncompressed of pretty-printed json text per document)
>>>
>>> I am interested in analyzing only the  "important-data" array and the
>>> "other-important-data" array.
>>> My source data would ideally be easier to analyze if it looked like a
>>> couple of tables with a fixed set of columns. Only the column "map" would
>>> be a complex column, all others would be primitives.
>>>
>>> ( g, sg, j, page, f1, f2, f3 )
>>>
>>> ( g, sg, j, page, x1, x2, x3, map )
>>>
>>> So, for each JSON document, I would like to "create" several rows, but I
>>> would like to avoid the intermediate step of persisting -and duplicating-
>>> the "flattened" data.
>>>
>>> In order to avoid persisting the data flattened, I thought I had to
>>> write my own map-reduce in Java code, but discovered that others have had
>>> the same problem of using JSON as the source and there are somewhat
>>> "standard" solutions.
>>>
>>> By reading about the SerDe approach for Hive I get the impression that
>>> each JSON document is transformed into a single "row" of the table with
>>> some columns being an array, a map of other nested structures.
>>> a) Is there a way to break each JSON document into several "rows" for a
>>> Hive external table?
>>> b) It seems there are too many JSON SerDe libraries! Is there any of
>>> them considered the de-facto standard?
>>>
>>> The Pig approach seems also promising using Elephant Bird Do anybody has
>>> pointers to more user documentation on this project? Or is browsing through
>>> the examples in GitHub my only source?
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: Aggregating data nested into JSON documents

Posted by Tecno Brain <ce...@gmail.com>.

I also tried:

doc = LOAD '/json-pcr/pcr-000001.json' USING
 com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]);
flat = FOREACH doc  GENERATE  (chararray)json#'a' AS first, (long)json#'b'
AS second ;
DUMP flat;

but I got no output either.

     Input(s):
     Successfully read 0 records (35863 bytes) from:
"/json-pcr/pcr-000001.json"

     Output(s):
     Successfully stored 0 records in:
"hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210"



On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain
<ce...@gmail.com>wrote:

> I got Pig and Hive working ona single-node and I am able to run some
> script/queries over regular text files (access log files); with a record
> per line.
>
> Now, I want to process some JSON files.
>
> As mentioned before, it seems  that ElephantBird would be a would be a
> good solution to read JSON files.
>
> I uploaded 5 files to HDFS. Each file only contain a single JSON document.
> The documents are NOT in a single line, but rather contain pretty-printed
> JSON expanding over multiple lines.
>
> I'm trying something simple, extracting two (primitive) attributes at the
> top of the document:
> {
>    a : "some value",
>    ...
>    b : 133,
>    ...
> }
>
> So, lets start with a LOAD of a single file (single JSON document):
>
> REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
> doc = LOAD '/json-pcr/pcr-000001.json' using
>  com.twitter.elephantbird.pig.load.JsonLoader();
> flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b' AS
> second ;
> DUMP flat;
>
> Apparently the job runs without problem, but I get no output. The output I
> get includes this message:
>
>    Input(s):
>    Successfully read 0 records (35863 bytes) from:
> "/json-pcr/pcr-000001.json"
>
> I was expecting to get
>
> ( "some value", 133 )
>
> Any idea on what I am doing wrong?
>
>
>
>
> On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> I think you have a misconception of HBase.
>>
>> You don't need to actually have mutable data for it to be effective.
>> The key is that you need to have access to specific records and work a
>> very small subset of the data and not the complete data set.
>>
>>
>> On Jun 13, 2013, at 11:59 AM, Tecno Brain <ce...@gmail.com>
>> wrote:
>>
>> Hi Mike,
>>
>> Yes, I also have thought about HBase or Cassandra but my data is pretty
>> much a snapshot, it does not require updates. Most of my aggregations will
>> also need to be computed once and won't change over time with the exception
>> of some aggregation that is based on the last N days of data.  Should I
>> still consider HBase ? I think that probably it will be good for the
>> aggregated data.
>>
>> I have no idea what are sequence files, but I will take a look.  My raw
>> data is stored in the cloud, not in my Hadoop cluster.
>>
>> I'll keep looking at Pig with ElephantBird.
>> Thanks,
>>
>> -Jorge
>>
>>
>>
>>
>>
>> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <michael_segel@hotmail.com
>> > wrote:
>>
>>> Hi..
>>>
>>> Have you thought about HBase?
>>>
>>> I would suggest that if you're using Hive or Pig, to look at taking
>>> these files and putting the JSON records in to a sequence file.
>>> Or set of sequence files.... (Then look at HBase to help index them...)
>>> 200KB is small.
>>>
>>> That would be the same for either pig/hive.
>>>
>>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice.
>>> And yes you get each record as a row, however you can always flatten them
>>> as needed.
>>>
>>> Hive?
>>> I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward
>>> Capriolo could give you a better answer.
>>> Going from memory, I don't know that there is a good SerDe that would
>>> write JSON, just read it. (Hive)
>>>
>>> IMHO Pig/ElephantBird is the best so far, but then again I may be dated
>>> and biased.
>>>
>>> I think you're on the right track or at least train of thought.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>>
>>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com>
>>> wrote:
>>>
>>> Hello,
>>>    I'm new to Hadoop.
>>>    I have a large quantity of JSON documents with a structure similar to
>>> what is shown below.
>>>
>>>    {
>>>      g : "some-group-identifier",
>>>      sg: "some-subgroup-identifier",
>>>      j      : "some-job-identifier",
>>>      page     : 23,
>>>      ... // other fields omitted
>>>      important-data : [
>>>          {
>>>            f1  : "abc",
>>>            f2  : "a",
>>>            f3  : "/"
>>>            ...
>>>          },
>>>          ...
>>>          {
>>>            f1 : "xyz",
>>>            f2  : "q",
>>>            f3  : "/",
>>>            ...
>>>          },
>>>      ],
>>>     ... // other fields omitted
>>>      other-important-data : [
>>>         {
>>>            x1  : "ford",
>>>            x2  : "green",
>>>            x3  : 35
>>>            map : {
>>>                "free-field" : "value",
>>>                "other-free-field" : value2"
>>>               }
>>>          },
>>>          ...
>>>          {
>>>            x1 : "vw",
>>>            x2  : "red",
>>>            x3  : 54,
>>>            ...
>>>          },
>>>      ]
>>>    },
>>> }
>>>
>>>
>>> Each file contains a single JSON document (gzip compressed, and roughly
>>> about 200KB uncompressed of pretty-printed json text per document)
>>>
>>> I am interested in analyzing only the  "important-data" array and the
>>> "other-important-data" array.
>>> My source data would ideally be easier to analyze if it looked like a
>>> couple of tables with a fixed set of columns. Only the column "map" would
>>> be a complex column, all others would be primitives.
>>>
>>> ( g, sg, j, page, f1, f2, f3 )
>>>
>>> ( g, sg, j, page, x1, x2, x3, map )
>>>
>>> So, for each JSON document, I would like to "create" several rows, but I
>>> would like to avoid the intermediate step of persisting -and duplicating-
>>> the "flattened" data.
>>>
>>> In order to avoid persisting the data flattened, I thought I had to
>>> write my own map-reduce in Java code, but discovered that others have had
>>> the same problem of using JSON as the source and there are somewhat
>>> "standard" solutions.
>>>
>>> By reading about the SerDe approach for Hive I get the impression that
>>> each JSON document is transformed into a single "row" of the table with
>>> some columns being an array, a map of other nested structures.
>>> a) Is there a way to break each JSON document into several "rows" for a
>>> Hive external table?
>>> b) It seems there are too many JSON SerDe libraries! Is there any of
>>> them considered the de-facto standard?
>>>
>>> The Pig approach seems also promising using Elephant Bird Do anybody has
>>> pointers to more user documentation on this project? Or is browsing through
>>> the examples in GitHub my only source?
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: Aggregating data nested into JSON documents

Posted by Tecno Brain <ce...@gmail.com>.

I also tried:

doc = LOAD '/json-pcr/pcr-000001.json' USING
 com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]);
flat = FOREACH doc  GENERATE  (chararray)json#'a' AS first, (long)json#'b'
AS second ;
DUMP flat;

but I got no output either.

     Input(s):
     Successfully read 0 records (35863 bytes) from:
"/json-pcr/pcr-000001.json"

     Output(s):
     Successfully stored 0 records in:
"hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210"



On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain
<ce...@gmail.com>wrote:

> I got Pig and Hive working ona single-node and I am able to run some
> script/queries over regular text files (access log files); with a record
> per line.
>
> Now, I want to process some JSON files.
>
> As mentioned before, it seems  that ElephantBird would be a would be a
> good solution to read JSON files.
>
> I uploaded 5 files to HDFS. Each file only contain a single JSON document.
> The documents are NOT in a single line, but rather contain pretty-printed
> JSON expanding over multiple lines.
>
> I'm trying something simple, extracting two (primitive) attributes at the
> top of the document:
> {
>    a : "some value",
>    ...
>    b : 133,
>    ...
> }
>
> So, lets start with a LOAD of a single file (single JSON document):
>
> REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
> doc = LOAD '/json-pcr/pcr-000001.json' using
>  com.twitter.elephantbird.pig.load.JsonLoader();
> flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b' AS
> second ;
> DUMP flat;
>
> Apparently the job runs without problem, but I get no output. The output I
> get includes this message:
>
>    Input(s):
>    Successfully read 0 records (35863 bytes) from:
> "/json-pcr/pcr-000001.json"
>
> I was expecting to get
>
> ( "some value", 133 )
>
> Any idea on what I am doing wrong?
>
>
>
>
> On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> I think you have a misconception of HBase.
>>
>> You don't need to actually have mutable data for it to be effective.
>> The key is that you need to have access to specific records and work a
>> very small subset of the data and not the complete data set.
>>
>>
>> On Jun 13, 2013, at 11:59 AM, Tecno Brain <ce...@gmail.com>
>> wrote:
>>
>> Hi Mike,
>>
>> Yes, I also have thought about HBase or Cassandra but my data is pretty
>> much a snapshot, it does not require updates. Most of my aggregations will
>> also need to be computed once and won't change over time with the exception
>> of some aggregation that is based on the last N days of data.  Should I
>> still consider HBase ? I think that probably it will be good for the
>> aggregated data.
>>
>> I have no idea what are sequence files, but I will take a look.  My raw
>> data is stored in the cloud, not in my Hadoop cluster.
>>
>> I'll keep looking at Pig with ElephantBird.
>> Thanks,
>>
>> -Jorge
>>
>>
>>
>>
>>
>> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <michael_segel@hotmail.com
>> > wrote:
>>
>>> Hi..
>>>
>>> Have you thought about HBase?
>>>
>>> I would suggest that if you're using Hive or Pig, to look at taking
>>> these files and putting the JSON records in to a sequence file.
>>> Or set of sequence files.... (Then look at HBase to help index them...)
>>> 200KB is small.
>>>
>>> That would be the same for either pig/hive.
>>>
>>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice.
>>> And yes you get each record as a row, however you can always flatten them
>>> as needed.
>>>
>>> Hive?
>>> I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward
>>> Capriolo could give you a better answer.
>>> Going from memory, I don't know that there is a good SerDe that would
>>> write JSON, just read it. (Hive)
>>>
>>> IMHO Pig/ElephantBird is the best so far, but then again I may be dated
>>> and biased.
>>>
>>> I think you're on the right track or at least train of thought.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>>
>>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com>
>>> wrote:
>>>
>>> Hello,
>>>    I'm new to Hadoop.
>>>    I have a large quantity of JSON documents with a structure similar to
>>> what is shown below.
>>>
>>>    {
>>>      g : "some-group-identifier",
>>>      sg: "some-subgroup-identifier",
>>>      j      : "some-job-identifier",
>>>      page     : 23,
>>>      ... // other fields omitted
>>>      important-data : [
>>>          {
>>>            f1  : "abc",
>>>            f2  : "a",
>>>            f3  : "/"
>>>            ...
>>>          },
>>>          ...
>>>          {
>>>            f1 : "xyz",
>>>            f2  : "q",
>>>            f3  : "/",
>>>            ...
>>>          },
>>>      ],
>>>     ... // other fields omitted
>>>      other-important-data : [
>>>         {
>>>            x1  : "ford",
>>>            x2  : "green",
>>>            x3  : 35
>>>            map : {
>>>                "free-field" : "value",
>>>                "other-free-field" : value2"
>>>               }
>>>          },
>>>          ...
>>>          {
>>>            x1 : "vw",
>>>            x2  : "red",
>>>            x3  : 54,
>>>            ...
>>>          },
>>>      ]
>>>    },
>>> }
>>>
>>>
>>> Each file contains a single JSON document (gzip compressed, and roughly
>>> about 200KB uncompressed of pretty-printed json text per document)
>>>
>>> I am interested in analyzing only the  "important-data" array and the
>>> "other-important-data" array.
>>> My source data would ideally be easier to analyze if it looked like a
>>> couple of tables with a fixed set of columns. Only the column "map" would
>>> be a complex column, all others would be primitives.
>>>
>>> ( g, sg, j, page, f1, f2, f3 )
>>>
>>> ( g, sg, j, page, x1, x2, x3, map )
>>>
>>> So, for each JSON document, I would like to "create" several rows, but I
>>> would like to avoid the intermediate step of persisting -and duplicating-
>>> the "flattened" data.
>>>
>>> In order to avoid persisting the data flattened, I thought I had to
>>> write my own map-reduce in Java code, but discovered that others have had
>>> the same problem of using JSON as the source and there are somewhat
>>> "standard" solutions.
>>>
>>> By reading about the SerDe approach for Hive I get the impression that
>>> each JSON document is transformed into a single "row" of the table with
>>> some columns being an array, a map of other nested structures.
>>> a) Is there a way to break each JSON document into several "rows" for a
>>> Hive external table?
>>> b) It seems there are too many JSON SerDe libraries! Is there any of
>>> them considered the de-facto standard?
>>>
>>> The Pig approach seems also promising using Elephant Bird Do anybody has
>>> pointers to more user documentation on this project? Or is browsing through
>>> the examples in GitHub my only source?
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: Aggregating data nested into JSON documents

Posted by Tecno Brain <ce...@gmail.com>.

I got Pig and Hive working ona single-node and I am able to run some
script/queries over regular text files (access log files); with a record
per line.

Now, I want to process some JSON files.

As mentioned before, it seems  that ElephantBird would be a would be a good
solution to read JSON files.

I uploaded 5 files to HDFS. Each file only contain a single JSON document.
The documents are NOT in a single line, but rather contain pretty-printed
JSON expanding over multiple lines.

I'm trying something simple, extracting two (primitive) attributes at the
top of the document:
{
   a : "some value",
   ...
   b : 133,
   ...
}

So, lets start with a LOAD of a single file (single JSON document):

REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
doc = LOAD '/json-pcr/pcr-000001.json' using
 com.twitter.elephantbird.pig.load.JsonLoader();
flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b' AS
second ;
DUMP flat;

Apparently the job runs without problem, but I get no output. The output I
get includes this message:

   Input(s):
   Successfully read 0 records (35863 bytes) from:
"/json-pcr/pcr-000001.json"

I was expecting to get

( "some value", 133 )

Any idea on what I am doing wrong?




On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <mi...@hotmail.com>wrote:

> I think you have a misconception of HBase.
>
> You don't need to actually have mutable data for it to be effective.
> The key is that you need to have access to specific records and work a
> very small subset of the data and not the complete data set.
>
>
> On Jun 13, 2013, at 11:59 AM, Tecno Brain <ce...@gmail.com>
> wrote:
>
> Hi Mike,
>
> Yes, I also have thought about HBase or Cassandra but my data is pretty
> much a snapshot, it does not require updates. Most of my aggregations will
> also need to be computed once and won't change over time with the exception
> of some aggregation that is based on the last N days of data.  Should I
> still consider HBase ? I think that probably it will be good for the
> aggregated data.
>
> I have no idea what are sequence files, but I will take a look.  My raw
> data is stored in the cloud, not in my Hadoop cluster.
>
> I'll keep looking at Pig with ElephantBird.
> Thanks,
>
> -Jorge
>
>
>
>
>
> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> Hi..
>>
>> Have you thought about HBase?
>>
>> I would suggest that if you're using Hive or Pig, to look at taking these
>> files and putting the JSON records in to a sequence file.
>> Or set of sequence files.... (Then look at HBase to help index them...)
>> 200KB is small.
>>
>> That would be the same for either pig/hive.
>>
>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice.
>> And yes you get each record as a row, however you can always flatten them
>> as needed.
>>
>> Hive?
>> I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward
>> Capriolo could give you a better answer.
>> Going from memory, I don't know that there is a good SerDe that would
>> write JSON, just read it. (Hive)
>>
>> IMHO Pig/ElephantBird is the best so far, but then again I may be dated
>> and biased.
>>
>> I think you're on the right track or at least train of thought.
>>
>> HTH
>>
>> -Mike
>>
>>
>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com>
>> wrote:
>>
>> Hello,
>>    I'm new to Hadoop.
>>    I have a large quantity of JSON documents with a structure similar to
>> what is shown below.
>>
>>    {
>>      g : "some-group-identifier",
>>      sg: "some-subgroup-identifier",
>>      j      : "some-job-identifier",
>>      page     : 23,
>>      ... // other fields omitted
>>      important-data : [
>>          {
>>            f1  : "abc",
>>            f2  : "a",
>>            f3  : "/"
>>            ...
>>          },
>>          ...
>>          {
>>            f1 : "xyz",
>>            f2  : "q",
>>            f3  : "/",
>>            ...
>>          },
>>      ],
>>     ... // other fields omitted
>>      other-important-data : [
>>         {
>>            x1  : "ford",
>>            x2  : "green",
>>            x3  : 35
>>            map : {
>>                "free-field" : "value",
>>                "other-free-field" : value2"
>>               }
>>          },
>>          ...
>>          {
>>            x1 : "vw",
>>            x2  : "red",
>>            x3  : 54,
>>            ...
>>          },
>>      ]
>>    },
>> }
>>
>>
>> Each file contains a single JSON document (gzip compressed, and roughly
>> about 200KB uncompressed of pretty-printed json text per document)
>>
>> I am interested in analyzing only the  "important-data" array and the
>> "other-important-data" array.
>> My source data would ideally be easier to analyze if it looked like a
>> couple of tables with a fixed set of columns. Only the column "map" would
>> be a complex column, all others would be primitives.
>>
>> ( g, sg, j, page, f1, f2, f3 )
>>
>> ( g, sg, j, page, x1, x2, x3, map )
>>
>> So, for each JSON document, I would like to "create" several rows, but I
>> would like to avoid the intermediate step of persisting -and duplicating-
>> the "flattened" data.
>>
>> In order to avoid persisting the data flattened, I thought I had to write
>> my own map-reduce in Java code, but discovered that others have had the
>> same problem of using JSON as the source and there are somewhat "standard"
>> solutions.
>>
>> By reading about the SerDe approach for Hive I get the impression that
>> each JSON document is transformed into a single "row" of the table with
>> some columns being an array, a map of other nested structures.
>> a) Is there a way to break each JSON document into several "rows" for a
>> Hive external table?
>> b) It seems there are too many JSON SerDe libraries! Is there any of them
>> considered the de-facto standard?
>>
>> The Pig approach seems also promising using Elephant Bird Do anybody has
>> pointers to more user documentation on this project? Or is browsing through
>> the examples in GitHub my only source?
>>
>> Thanks
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: Aggregating data nested into JSON documents

Posted by Tecno Brain <ce...@gmail.com>.

I got Pig and Hive working ona single-node and I am able to run some
script/queries over regular text files (access log files); with a record
per line.

Now, I want to process some JSON files.

As mentioned before, it seems  that ElephantBird would be a would be a good
solution to read JSON files.

I uploaded 5 files to HDFS. Each file only contain a single JSON document.
The documents are NOT in a single line, but rather contain pretty-printed
JSON expanding over multiple lines.

I'm trying something simple, extracting two (primitive) attributes at the
top of the document:
{
   a : "some value",
   ...
   b : 133,
   ...
}

So, lets start with a LOAD of a single file (single JSON document):

REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
doc = LOAD '/json-pcr/pcr-000001.json' using
 com.twitter.elephantbird.pig.load.JsonLoader();
flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b' AS
second ;
DUMP flat;

Apparently the job runs without problem, but I get no output. The output I
get includes this message:

   Input(s):
   Successfully read 0 records (35863 bytes) from:
"/json-pcr/pcr-000001.json"

I was expecting to get

( "some value", 133 )

Any idea on what I am doing wrong?




On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <mi...@hotmail.com>wrote:

> I think you have a misconception of HBase.
>
> You don't need to actually have mutable data for it to be effective.
> The key is that you need to have access to specific records and work a
> very small subset of the data and not the complete data set.
>
>
> On Jun 13, 2013, at 11:59 AM, Tecno Brain <ce...@gmail.com>
> wrote:
>
> Hi Mike,
>
> Yes, I also have thought about HBase or Cassandra but my data is pretty
> much a snapshot, it does not require updates. Most of my aggregations will
> also need to be computed once and won't change over time with the exception
> of some aggregation that is based on the last N days of data.  Should I
> still consider HBase ? I think that probably it will be good for the
> aggregated data.
>
> I have no idea what are sequence files, but I will take a look.  My raw
> data is stored in the cloud, not in my Hadoop cluster.
>
> I'll keep looking at Pig with ElephantBird.
> Thanks,
>
> -Jorge
>
>
>
>
>
> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> Hi..
>>
>> Have you thought about HBase?
>>
>> I would suggest that if you're using Hive or Pig, to look at taking these
>> files and putting the JSON records in to a sequence file.
>> Or set of sequence files.... (Then look at HBase to help index them...)
>> 200KB is small.
>>
>> That would be the same for either pig/hive.
>>
>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice.
>> And yes you get each record as a row, however you can always flatten them
>> as needed.
>>
>> Hive?
>> I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward
>> Capriolo could give you a better answer.
>> Going from memory, I don't know that there is a good SerDe that would
>> write JSON, just read it. (Hive)
>>
>> IMHO Pig/ElephantBird is the best so far, but then again I may be dated
>> and biased.
>>
>> I think you're on the right track or at least train of thought.
>>
>> HTH
>>
>> -Mike
>>
>>
>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com>
>> wrote:
>>
>> Hello,
>>    I'm new to Hadoop.
>>    I have a large quantity of JSON documents with a structure similar to
>> what is shown below.
>>
>>    {
>>      g : "some-group-identifier",
>>      sg: "some-subgroup-identifier",
>>      j      : "some-job-identifier",
>>      page     : 23,
>>      ... // other fields omitted
>>      important-data : [
>>          {
>>            f1  : "abc",
>>            f2  : "a",
>>            f3  : "/"
>>            ...
>>          },
>>          ...
>>          {
>>            f1 : "xyz",
>>            f2  : "q",
>>            f3  : "/",
>>            ...
>>          },
>>      ],
>>     ... // other fields omitted
>>      other-important-data : [
>>         {
>>            x1  : "ford",
>>            x2  : "green",
>>            x3  : 35
>>            map : {
>>                "free-field" : "value",
>>                "other-free-field" : value2"
>>               }
>>          },
>>          ...
>>          {
>>            x1 : "vw",
>>            x2  : "red",
>>            x3  : 54,
>>            ...
>>          },
>>      ]
>>    },
>> }
>>
>>
>> Each file contains a single JSON document (gzip compressed, and roughly
>> about 200KB uncompressed of pretty-printed json text per document)
>>
>> I am interested in analyzing only the  "important-data" array and the
>> "other-important-data" array.
>> My source data would ideally be easier to analyze if it looked like a
>> couple of tables with a fixed set of columns. Only the column "map" would
>> be a complex column, all others would be primitives.
>>
>> ( g, sg, j, page, f1, f2, f3 )
>>
>> ( g, sg, j, page, x1, x2, x3, map )
>>
>> So, for each JSON document, I would like to "create" several rows, but I
>> would like to avoid the intermediate step of persisting -and duplicating-
>> the "flattened" data.
>>
>> In order to avoid persisting the data flattened, I thought I had to write
>> my own map-reduce in Java code, but discovered that others have had the
>> same problem of using JSON as the source and there are somewhat "standard"
>> solutions.
>>
>> By reading about the SerDe approach for Hive I get the impression that
>> each JSON document is transformed into a single "row" of the table with
>> some columns being an array, a map of other nested structures.
>> a) Is there a way to break each JSON document into several "rows" for a
>> Hive external table?
>> b) It seems there are too many JSON SerDe libraries! Is there any of them
>> considered the de-facto standard?
>>
>> The Pig approach seems also promising using Elephant Bird Do anybody has
>> pointers to more user documentation on this project? Or is browsing through
>> the examples in GitHub my only source?
>>
>> Thanks
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: Aggregating data nested into JSON documents

Posted by Tecno Brain <ce...@gmail.com>.

I got Pig and Hive working ona single-node and I am able to run some
script/queries over regular text files (access log files); with a record
per line.

Now, I want to process some JSON files.

As mentioned before, it seems  that ElephantBird would be a would be a good
solution to read JSON files.

I uploaded 5 files to HDFS. Each file only contain a single JSON document.
The documents are NOT in a single line, but rather contain pretty-printed
JSON expanding over multiple lines.

I'm trying something simple, extracting two (primitive) attributes at the
top of the document:
{
   a : "some value",
   ...
   b : 133,
   ...
}

So, lets start with a LOAD of a single file (single JSON document):

REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
doc = LOAD '/json-pcr/pcr-000001.json' using
 com.twitter.elephantbird.pig.load.JsonLoader();
flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b' AS
second ;
DUMP flat;

Apparently the job runs without problem, but I get no output. The output I
get includes this message:

   Input(s):
   Successfully read 0 records (35863 bytes) from:
"/json-pcr/pcr-000001.json"

I was expecting to get

( "some value", 133 )

Any idea on what I am doing wrong?




On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <mi...@hotmail.com>wrote:

> I think you have a misconception of HBase.
>
> You don't need to actually have mutable data for it to be effective.
> The key is that you need to have access to specific records and work a
> very small subset of the data and not the complete data set.
>
>
> On Jun 13, 2013, at 11:59 AM, Tecno Brain <ce...@gmail.com>
> wrote:
>
> Hi Mike,
>
> Yes, I also have thought about HBase or Cassandra but my data is pretty
> much a snapshot, it does not require updates. Most of my aggregations will
> also need to be computed once and won't change over time with the exception
> of some aggregation that is based on the last N days of data.  Should I
> still consider HBase ? I think that probably it will be good for the
> aggregated data.
>
> I have no idea what are sequence files, but I will take a look.  My raw
> data is stored in the cloud, not in my Hadoop cluster.
>
> I'll keep looking at Pig with ElephantBird.
> Thanks,
>
> -Jorge
>
>
>
>
>
> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> Hi..
>>
>> Have you thought about HBase?
>>
>> I would suggest that if you're using Hive or Pig, to look at taking these
>> files and putting the JSON records in to a sequence file.
>> Or set of sequence files.... (Then look at HBase to help index them...)
>> 200KB is small.
>>
>> That would be the same for either pig/hive.
>>
>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice.
>> And yes you get each record as a row, however you can always flatten them
>> as needed.
>>
>> Hive?
>> I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward
>> Capriolo could give you a better answer.
>> Going from memory, I don't know that there is a good SerDe that would
>> write JSON, just read it. (Hive)
>>
>> IMHO Pig/ElephantBird is the best so far, but then again I may be dated
>> and biased.
>>
>> I think you're on the right track or at least train of thought.
>>
>> HTH
>>
>> -Mike
>>
>>
>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com>
>> wrote:
>>
>> Hello,
>>    I'm new to Hadoop.
>>    I have a large quantity of JSON documents with a structure similar to
>> what is shown below.
>>
>>    {
>>      g : "some-group-identifier",
>>      sg: "some-subgroup-identifier",
>>      j      : "some-job-identifier",
>>      page     : 23,
>>      ... // other fields omitted
>>      important-data : [
>>          {
>>            f1  : "abc",
>>            f2  : "a",
>>            f3  : "/"
>>            ...
>>          },
>>          ...
>>          {
>>            f1 : "xyz",
>>            f2  : "q",
>>            f3  : "/",
>>            ...
>>          },
>>      ],
>>     ... // other fields omitted
>>      other-important-data : [
>>         {
>>            x1  : "ford",
>>            x2  : "green",
>>            x3  : 35
>>            map : {
>>                "free-field" : "value",
>>                "other-free-field" : value2"
>>               }
>>          },
>>          ...
>>          {
>>            x1 : "vw",
>>            x2  : "red",
>>            x3  : 54,
>>            ...
>>          },
>>      ]
>>    },
>> }
>>
>>
>> Each file contains a single JSON document (gzip compressed, and roughly
>> about 200KB uncompressed of pretty-printed json text per document)
>>
>> I am interested in analyzing only the  "important-data" array and the
>> "other-important-data" array.
>> My source data would ideally be easier to analyze if it looked like a
>> couple of tables with a fixed set of columns. Only the column "map" would
>> be a complex column, all others would be primitives.
>>
>> ( g, sg, j, page, f1, f2, f3 )
>>
>> ( g, sg, j, page, x1, x2, x3, map )
>>
>> So, for each JSON document, I would like to "create" several rows, but I
>> would like to avoid the intermediate step of persisting -and duplicating-
>> the "flattened" data.
>>
>> In order to avoid persisting the data flattened, I thought I had to write
>> my own map-reduce in Java code, but discovered that others have had the
>> same problem of using JSON as the source and there are somewhat "standard"
>> solutions.
>>
>> By reading about the SerDe approach for Hive I get the impression that
>> each JSON document is transformed into a single "row" of the table with
>> some columns being an array, a map of other nested structures.
>> a) Is there a way to break each JSON document into several "rows" for a
>> Hive external table?
>> b) It seems there are too many JSON SerDe libraries! Is there any of them
>> considered the de-facto standard?
>>
>> The Pig approach seems also promising using Elephant Bird Do anybody has
>> pointers to more user documentation on this project? Or is browsing through
>> the examples in GitHub my only source?
>>
>> Thanks
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: Aggregating data nested into JSON documents

Posted by Tecno Brain <ce...@gmail.com>.

I got Pig and Hive working ona single-node and I am able to run some
script/queries over regular text files (access log files); with a record
per line.

Now, I want to process some JSON files.

As mentioned before, it seems  that ElephantBird would be a would be a good
solution to read JSON files.

I uploaded 5 files to HDFS. Each file only contain a single JSON document.
The documents are NOT in a single line, but rather contain pretty-printed
JSON expanding over multiple lines.

I'm trying something simple, extracting two (primitive) attributes at the
top of the document:
{
   a : "some value",
   ...
   b : 133,
   ...
}

So, lets start with a LOAD of a single file (single JSON document):

REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
doc = LOAD '/json-pcr/pcr-000001.json' using
 com.twitter.elephantbird.pig.load.JsonLoader();
flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b' AS
second ;
DUMP flat;

Apparently the job runs without problem, but I get no output. The output I
get includes this message:

   Input(s):
   Successfully read 0 records (35863 bytes) from:
"/json-pcr/pcr-000001.json"

I was expecting to get

( "some value", 133 )

Any idea on what I am doing wrong?




On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <mi...@hotmail.com>wrote:

> I think you have a misconception of HBase.
>
> You don't need to actually have mutable data for it to be effective.
> The key is that you need to have access to specific records and work a
> very small subset of the data and not the complete data set.
>
>
> On Jun 13, 2013, at 11:59 AM, Tecno Brain <ce...@gmail.com>
> wrote:
>
> Hi Mike,
>
> Yes, I also have thought about HBase or Cassandra but my data is pretty
> much a snapshot, it does not require updates. Most of my aggregations will
> also need to be computed once and won't change over time with the exception
> of some aggregation that is based on the last N days of data.  Should I
> still consider HBase ? I think that probably it will be good for the
> aggregated data.
>
> I have no idea what are sequence files, but I will take a look.  My raw
> data is stored in the cloud, not in my Hadoop cluster.
>
> I'll keep looking at Pig with ElephantBird.
> Thanks,
>
> -Jorge
>
>
>
>
>
> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> Hi..
>>
>> Have you thought about HBase?
>>
>> I would suggest that if you're using Hive or Pig, to look at taking these
>> files and putting the JSON records in to a sequence file.
>> Or set of sequence files.... (Then look at HBase to help index them...)
>> 200KB is small.
>>
>> That would be the same for either pig/hive.
>>
>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice.
>> And yes you get each record as a row, however you can always flatten them
>> as needed.
>>
>> Hive?
>> I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward
>> Capriolo could give you a better answer.
>> Going from memory, I don't know that there is a good SerDe that would
>> write JSON, just read it. (Hive)
>>
>> IMHO Pig/ElephantBird is the best so far, but then again I may be dated
>> and biased.
>>
>> I think you're on the right track or at least train of thought.
>>
>> HTH
>>
>> -Mike
>>
>>
>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com>
>> wrote:
>>
>> Hello,
>>    I'm new to Hadoop.
>>    I have a large quantity of JSON documents with a structure similar to
>> what is shown below.
>>
>>    {
>>      g : "some-group-identifier",
>>      sg: "some-subgroup-identifier",
>>      j      : "some-job-identifier",
>>      page     : 23,
>>      ... // other fields omitted
>>      important-data : [
>>          {
>>            f1  : "abc",
>>            f2  : "a",
>>            f3  : "/"
>>            ...
>>          },
>>          ...
>>          {
>>            f1 : "xyz",
>>            f2  : "q",
>>            f3  : "/",
>>            ...
>>          },
>>      ],
>>     ... // other fields omitted
>>      other-important-data : [
>>         {
>>            x1  : "ford",
>>            x2  : "green",
>>            x3  : 35
>>            map : {
>>                "free-field" : "value",
>>                "other-free-field" : value2"
>>               }
>>          },
>>          ...
>>          {
>>            x1 : "vw",
>>            x2  : "red",
>>            x3  : 54,
>>            ...
>>          },
>>      ]
>>    },
>> }
>>
>>
>> Each file contains a single JSON document (gzip compressed, and roughly
>> about 200KB uncompressed of pretty-printed json text per document)
>>
>> I am interested in analyzing only the  "important-data" array and the
>> "other-important-data" array.
>> My source data would ideally be easier to analyze if it looked like a
>> couple of tables with a fixed set of columns. Only the column "map" would
>> be a complex column, all others would be primitives.
>>
>> ( g, sg, j, page, f1, f2, f3 )
>>
>> ( g, sg, j, page, x1, x2, x3, map )
>>
>> So, for each JSON document, I would like to "create" several rows, but I
>> would like to avoid the intermediate step of persisting -and duplicating-
>> the "flattened" data.
>>
>> In order to avoid persisting the data flattened, I thought I had to write
>> my own map-reduce in Java code, but discovered that others have had the
>> same problem of using JSON as the source and there are somewhat "standard"
>> solutions.
>>
>> By reading about the SerDe approach for Hive I get the impression that
>> each JSON document is transformed into a single "row" of the table with
>> some columns being an array, a map of other nested structures.
>> a) Is there a way to break each JSON document into several "rows" for a
>> Hive external table?
>> b) It seems there are too many JSON SerDe libraries! Is there any of them
>> considered the de-facto standard?
>>
>> The Pig approach seems also promising using Elephant Bird Do anybody has
>> pointers to more user documentation on this project? Or is browsing through
>> the examples in GitHub my only source?
>>
>> Thanks
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: Aggregating data nested into JSON documents

Posted by Michael Segel <mi...@hotmail.com>.

I think you have a misconception of HBase. 

You don't need to actually have mutable data for it to be effective. 
The key is that you need to have access to specific records and work a very small subset of the data and not the complete data set. 


On Jun 13, 2013, at 11:59 AM, Tecno Brain <ce...@gmail.com> wrote:

> Hi Mike,
> 
> Yes, I also have thought about HBase or Cassandra but my data is pretty much a snapshot, it does not require updates. Most of my aggregations will also need to be computed once and won't change over time with the exception of some aggregation that is based on the last N days of data.  Should I still consider HBase ? I think that probably it will be good for the aggregated data. 
> 
> I have no idea what are sequence files, but I will take a look.  My raw data is stored in the cloud, not in my Hadoop cluster. 
> 
> I'll keep looking at Pig with ElephantBird. 
> Thanks,
> 
> -Jorge 
> 
> 
> 
> 
> 
> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <mi...@hotmail.com> wrote:
> Hi..
> 
> Have you thought about HBase? 
> 
> I would suggest that if you're using Hive or Pig, to look at taking these files and putting the JSON records in to a sequence file. 
> Or set of sequence files.... (Then look at HBase to help index them...) 200KB is small. 
> 
> That would be the same for either pig/hive.
> 
> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice. And yes you get each record as a row, however you can always flatten them as needed. 
> 
> Hive? 
> I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward Capriolo could give you a better answer. 
> Going from memory, I don't know that there is a good SerDe that would write JSON, just read it. (Hive)
> 
> IMHO Pig/ElephantBird is the best so far, but then again I may be dated and biased. 
> 
> I think you're on the right track or at least train of thought. 
> 
> HTH
> 
> -Mike
> 
> 
> On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com> wrote:
> 
>> Hello, 
>>    I'm new to Hadoop. 
>>    I have a large quantity of JSON documents with a structure similar to what is shown below.  
>> 
>>    {
>>      g : "some-group-identifier",
>>      sg: "some-subgroup-identifier",
>>      j      : "some-job-identifier",
>>      page     : 23,
>>      ... // other fields omitted
>>      important-data : [
>>          {
>>            f1  : "abc",
>>            f2  : "a",
>>            f3  : "/"
>>            ...
>>          },
>>          ...
>>          {
>>            f1 : "xyz",
>>            f2  : "q",
>>            f3  : "/",
>>            ... 
>>          },
>>      ],
>>     ... // other fields omitted 
>>      other-important-data : [
>>         {
>>            x1  : "ford",
>>            x2  : "green",
>>            x3  : 35
>>            map : {
>>                "free-field" : "value",
>>                "other-free-field" : value2"
>>               }
>>          },
>>          ...
>>          {
>>            x1 : "vw",
>>            x2  : "red",
>>            x3  : 54,
>>            ... 
>>          },    
>>      ]
>>    },
>> }
>>  
>> 
>> Each file contains a single JSON document (gzip compressed, and roughly about 200KB uncompressed of pretty-printed json text per document)
>> 
>> I am interested in analyzing only the  "important-data" array and the "other-important-data" array.
>> My source data would ideally be easier to analyze if it looked like a couple of tables with a fixed set of columns. Only the column "map" would be a complex column, all others would be primitives.
>> 
>> ( g, sg, j, page, f1, f2, f3 )
>>  
>> ( g, sg, j, page, x1, x2, x3, map )
>> 
>> So, for each JSON document, I would like to "create" several rows, but I would like to avoid the intermediate step of persisting -and duplicating- the "flattened" data.
>> 
>> In order to avoid persisting the data flattened, I thought I had to write my own map-reduce in Java code, but discovered that others have had the same problem of using JSON as the source and there are somewhat "standard" solutions. 
>> 
>> By reading about the SerDe approach for Hive I get the impression that each JSON document is transformed into a single "row" of the table with some columns being an array, a map of other nested structures. 
>> a) Is there a way to break each JSON document into several "rows" for a Hive external table?
>> b) It seems there are too many JSON SerDe libraries! Is there any of them considered the de-facto standard? 
>> 
>> The Pig approach seems also promising using Elephant Bird Do anybody has pointers to more user documentation on this project? Or is browsing through the examples in GitHub my only source?
>> 
>> Thanks
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
>

Re: Aggregating data nested into JSON documents

Posted by Michael Segel <mi...@hotmail.com>.

I think you have a misconception of HBase. 

You don't need to actually have mutable data for it to be effective. 
The key is that you need to have access to specific records and work a very small subset of the data and not the complete data set. 


On Jun 13, 2013, at 11:59 AM, Tecno Brain <ce...@gmail.com> wrote:

> Hi Mike,
> 
> Yes, I also have thought about HBase or Cassandra but my data is pretty much a snapshot, it does not require updates. Most of my aggregations will also need to be computed once and won't change over time with the exception of some aggregation that is based on the last N days of data.  Should I still consider HBase ? I think that probably it will be good for the aggregated data. 
> 
> I have no idea what are sequence files, but I will take a look.  My raw data is stored in the cloud, not in my Hadoop cluster. 
> 
> I'll keep looking at Pig with ElephantBird. 
> Thanks,
> 
> -Jorge 
> 
> 
> 
> 
> 
> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <mi...@hotmail.com> wrote:
> Hi..
> 
> Have you thought about HBase? 
> 
> I would suggest that if you're using Hive or Pig, to look at taking these files and putting the JSON records in to a sequence file. 
> Or set of sequence files.... (Then look at HBase to help index them...) 200KB is small. 
> 
> That would be the same for either pig/hive.
> 
> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice. And yes you get each record as a row, however you can always flatten them as needed. 
> 
> Hive? 
> I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward Capriolo could give you a better answer. 
> Going from memory, I don't know that there is a good SerDe that would write JSON, just read it. (Hive)
> 
> IMHO Pig/ElephantBird is the best so far, but then again I may be dated and biased. 
> 
> I think you're on the right track or at least train of thought. 
> 
> HTH
> 
> -Mike
> 
> 
> On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com> wrote:
> 
>> Hello, 
>>    I'm new to Hadoop. 
>>    I have a large quantity of JSON documents with a structure similar to what is shown below.  
>> 
>>    {
>>      g : "some-group-identifier",
>>      sg: "some-subgroup-identifier",
>>      j      : "some-job-identifier",
>>      page     : 23,
>>      ... // other fields omitted
>>      important-data : [
>>          {
>>            f1  : "abc",
>>            f2  : "a",
>>            f3  : "/"
>>            ...
>>          },
>>          ...
>>          {
>>            f1 : "xyz",
>>            f2  : "q",
>>            f3  : "/",
>>            ... 
>>          },
>>      ],
>>     ... // other fields omitted 
>>      other-important-data : [
>>         {
>>            x1  : "ford",
>>            x2  : "green",
>>            x3  : 35
>>            map : {
>>                "free-field" : "value",
>>                "other-free-field" : value2"
>>               }
>>          },
>>          ...
>>          {
>>            x1 : "vw",
>>            x2  : "red",
>>            x3  : 54,
>>            ... 
>>          },    
>>      ]
>>    },
>> }
>>  
>> 
>> Each file contains a single JSON document (gzip compressed, and roughly about 200KB uncompressed of pretty-printed json text per document)
>> 
>> I am interested in analyzing only the  "important-data" array and the "other-important-data" array.
>> My source data would ideally be easier to analyze if it looked like a couple of tables with a fixed set of columns. Only the column "map" would be a complex column, all others would be primitives.
>> 
>> ( g, sg, j, page, f1, f2, f3 )
>>  
>> ( g, sg, j, page, x1, x2, x3, map )
>> 
>> So, for each JSON document, I would like to "create" several rows, but I would like to avoid the intermediate step of persisting -and duplicating- the "flattened" data.
>> 
>> In order to avoid persisting the data flattened, I thought I had to write my own map-reduce in Java code, but discovered that others have had the same problem of using JSON as the source and there are somewhat "standard" solutions. 
>> 
>> By reading about the SerDe approach for Hive I get the impression that each JSON document is transformed into a single "row" of the table with some columns being an array, a map of other nested structures. 
>> a) Is there a way to break each JSON document into several "rows" for a Hive external table?
>> b) It seems there are too many JSON SerDe libraries! Is there any of them considered the de-facto standard? 
>> 
>> The Pig approach seems also promising using Elephant Bird Do anybody has pointers to more user documentation on this project? Or is browsing through the examples in GitHub my only source?
>> 
>> Thanks
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
>

Re: Aggregating data nested into JSON documents

Posted by Michael Segel <mi...@hotmail.com>.

I think you have a misconception of HBase. 

You don't need to actually have mutable data for it to be effective. 
The key is that you need to have access to specific records and work a very small subset of the data and not the complete data set. 


On Jun 13, 2013, at 11:59 AM, Tecno Brain <ce...@gmail.com> wrote:

> Hi Mike,
> 
> Yes, I also have thought about HBase or Cassandra but my data is pretty much a snapshot, it does not require updates. Most of my aggregations will also need to be computed once and won't change over time with the exception of some aggregation that is based on the last N days of data.  Should I still consider HBase ? I think that probably it will be good for the aggregated data. 
> 
> I have no idea what are sequence files, but I will take a look.  My raw data is stored in the cloud, not in my Hadoop cluster. 
> 
> I'll keep looking at Pig with ElephantBird. 
> Thanks,
> 
> -Jorge 
> 
> 
> 
> 
> 
> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <mi...@hotmail.com> wrote:
> Hi..
> 
> Have you thought about HBase? 
> 
> I would suggest that if you're using Hive or Pig, to look at taking these files and putting the JSON records in to a sequence file. 
> Or set of sequence files.... (Then look at HBase to help index them...) 200KB is small. 
> 
> That would be the same for either pig/hive.
> 
> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice. And yes you get each record as a row, however you can always flatten them as needed. 
> 
> Hive? 
> I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward Capriolo could give you a better answer. 
> Going from memory, I don't know that there is a good SerDe that would write JSON, just read it. (Hive)
> 
> IMHO Pig/ElephantBird is the best so far, but then again I may be dated and biased. 
> 
> I think you're on the right track or at least train of thought. 
> 
> HTH
> 
> -Mike
> 
> 
> On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com> wrote:
> 
>> Hello, 
>>    I'm new to Hadoop. 
>>    I have a large quantity of JSON documents with a structure similar to what is shown below.  
>> 
>>    {
>>      g : "some-group-identifier",
>>      sg: "some-subgroup-identifier",
>>      j      : "some-job-identifier",
>>      page     : 23,
>>      ... // other fields omitted
>>      important-data : [
>>          {
>>            f1  : "abc",
>>            f2  : "a",
>>            f3  : "/"
>>            ...
>>          },
>>          ...
>>          {
>>            f1 : "xyz",
>>            f2  : "q",
>>            f3  : "/",
>>            ... 
>>          },
>>      ],
>>     ... // other fields omitted 
>>      other-important-data : [
>>         {
>>            x1  : "ford",
>>            x2  : "green",
>>            x3  : 35
>>            map : {
>>                "free-field" : "value",
>>                "other-free-field" : value2"
>>               }
>>          },
>>          ...
>>          {
>>            x1 : "vw",
>>            x2  : "red",
>>            x3  : 54,
>>            ... 
>>          },    
>>      ]
>>    },
>> }
>>  
>> 
>> Each file contains a single JSON document (gzip compressed, and roughly about 200KB uncompressed of pretty-printed json text per document)
>> 
>> I am interested in analyzing only the  "important-data" array and the "other-important-data" array.
>> My source data would ideally be easier to analyze if it looked like a couple of tables with a fixed set of columns. Only the column "map" would be a complex column, all others would be primitives.
>> 
>> ( g, sg, j, page, f1, f2, f3 )
>>  
>> ( g, sg, j, page, x1, x2, x3, map )
>> 
>> So, for each JSON document, I would like to "create" several rows, but I would like to avoid the intermediate step of persisting -and duplicating- the "flattened" data.
>> 
>> In order to avoid persisting the data flattened, I thought I had to write my own map-reduce in Java code, but discovered that others have had the same problem of using JSON as the source and there are somewhat "standard" solutions. 
>> 
>> By reading about the SerDe approach for Hive I get the impression that each JSON document is transformed into a single "row" of the table with some columns being an array, a map of other nested structures. 
>> a) Is there a way to break each JSON document into several "rows" for a Hive external table?
>> b) It seems there are too many JSON SerDe libraries! Is there any of them considered the de-facto standard? 
>> 
>> The Pig approach seems also promising using Elephant Bird Do anybody has pointers to more user documentation on this project? Or is browsing through the examples in GitHub my only source?
>> 
>> Thanks
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
>

Re: Aggregating data nested into JSON documents

Posted by Michael Segel <mi...@hotmail.com>.

I think you have a misconception of HBase. 

You don't need to actually have mutable data for it to be effective. 
The key is that you need to have access to specific records and work a very small subset of the data and not the complete data set. 


On Jun 13, 2013, at 11:59 AM, Tecno Brain <ce...@gmail.com> wrote:

> Hi Mike,
> 
> Yes, I also have thought about HBase or Cassandra but my data is pretty much a snapshot, it does not require updates. Most of my aggregations will also need to be computed once and won't change over time with the exception of some aggregation that is based on the last N days of data.  Should I still consider HBase ? I think that probably it will be good for the aggregated data. 
> 
> I have no idea what are sequence files, but I will take a look.  My raw data is stored in the cloud, not in my Hadoop cluster. 
> 
> I'll keep looking at Pig with ElephantBird. 
> Thanks,
> 
> -Jorge 
> 
> 
> 
> 
> 
> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <mi...@hotmail.com> wrote:
> Hi..
> 
> Have you thought about HBase? 
> 
> I would suggest that if you're using Hive or Pig, to look at taking these files and putting the JSON records in to a sequence file. 
> Or set of sequence files.... (Then look at HBase to help index them...) 200KB is small. 
> 
> That would be the same for either pig/hive.
> 
> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice. And yes you get each record as a row, however you can always flatten them as needed. 
> 
> Hive? 
> I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward Capriolo could give you a better answer. 
> Going from memory, I don't know that there is a good SerDe that would write JSON, just read it. (Hive)
> 
> IMHO Pig/ElephantBird is the best so far, but then again I may be dated and biased. 
> 
> I think you're on the right track or at least train of thought. 
> 
> HTH
> 
> -Mike
> 
> 
> On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com> wrote:
> 
>> Hello, 
>>    I'm new to Hadoop. 
>>    I have a large quantity of JSON documents with a structure similar to what is shown below.  
>> 
>>    {
>>      g : "some-group-identifier",
>>      sg: "some-subgroup-identifier",
>>      j      : "some-job-identifier",
>>      page     : 23,
>>      ... // other fields omitted
>>      important-data : [
>>          {
>>            f1  : "abc",
>>            f2  : "a",
>>            f3  : "/"
>>            ...
>>          },
>>          ...
>>          {
>>            f1 : "xyz",
>>            f2  : "q",
>>            f3  : "/",
>>            ... 
>>          },
>>      ],
>>     ... // other fields omitted 
>>      other-important-data : [
>>         {
>>            x1  : "ford",
>>            x2  : "green",
>>            x3  : 35
>>            map : {
>>                "free-field" : "value",
>>                "other-free-field" : value2"
>>               }
>>          },
>>          ...
>>          {
>>            x1 : "vw",
>>            x2  : "red",
>>            x3  : 54,
>>            ... 
>>          },    
>>      ]
>>    },
>> }
>>  
>> 
>> Each file contains a single JSON document (gzip compressed, and roughly about 200KB uncompressed of pretty-printed json text per document)
>> 
>> I am interested in analyzing only the  "important-data" array and the "other-important-data" array.
>> My source data would ideally be easier to analyze if it looked like a couple of tables with a fixed set of columns. Only the column "map" would be a complex column, all others would be primitives.
>> 
>> ( g, sg, j, page, f1, f2, f3 )
>>  
>> ( g, sg, j, page, x1, x2, x3, map )
>> 
>> So, for each JSON document, I would like to "create" several rows, but I would like to avoid the intermediate step of persisting -and duplicating- the "flattened" data.
>> 
>> In order to avoid persisting the data flattened, I thought I had to write my own map-reduce in Java code, but discovered that others have had the same problem of using JSON as the source and there are somewhat "standard" solutions. 
>> 
>> By reading about the SerDe approach for Hive I get the impression that each JSON document is transformed into a single "row" of the table with some columns being an array, a map of other nested structures. 
>> a) Is there a way to break each JSON document into several "rows" for a Hive external table?
>> b) It seems there are too many JSON SerDe libraries! Is there any of them considered the de-facto standard? 
>> 
>> The Pig approach seems also promising using Elephant Bird Do anybody has pointers to more user documentation on this project? Or is browsing through the examples in GitHub my only source?
>> 
>> Thanks
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
>

Re: Aggregating data nested into JSON documents

Posted by Tecno Brain <ce...@gmail.com>.

Hi Mike,

Yes, I also have thought about HBase or Cassandra but my data is pretty
much a snapshot, it does not require updates. Most of my aggregations will
also need to be computed once and won't change over time with the exception
of some aggregation that is based on the last N days of data.  Should I
still consider HBase ? I think that probably it will be good for the
aggregated data.

I have no idea what are sequence files, but I will take a look.  My raw
data is stored in the cloud, not in my Hadoop cluster.

I'll keep looking at Pig with ElephantBird.
Thanks,

-Jorge





On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <mi...@hotmail.com>wrote:

> Hi..
>
> Have you thought about HBase?
>
> I would suggest that if you're using Hive or Pig, to look at taking these
> files and putting the JSON records in to a sequence file.
> Or set of sequence files.... (Then look at HBase to help index them...)
> 200KB is small.
>
> That would be the same for either pig/hive.
>
> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice.
> And yes you get each record as a row, however you can always flatten them
> as needed.
>
> Hive?
> I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward
> Capriolo could give you a better answer.
> Going from memory, I don't know that there is a good SerDe that would
> write JSON, just read it. (Hive)
>
> IMHO Pig/ElephantBird is the best so far, but then again I may be dated
> and biased.
>
> I think you're on the right track or at least train of thought.
>
> HTH
>
> -Mike
>
>
> On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com>
> wrote:
>
> Hello,
>    I'm new to Hadoop.
>    I have a large quantity of JSON documents with a structure similar to
> what is shown below.
>
>    {
>      g : "some-group-identifier",
>      sg: "some-subgroup-identifier",
>      j      : "some-job-identifier",
>      page     : 23,
>      ... // other fields omitted
>      important-data : [
>          {
>            f1  : "abc",
>            f2  : "a",
>            f3  : "/"
>            ...
>          },
>          ...
>          {
>            f1 : "xyz",
>            f2  : "q",
>            f3  : "/",
>            ...
>          },
>      ],
>     ... // other fields omitted
>      other-important-data : [
>         {
>            x1  : "ford",
>            x2  : "green",
>            x3  : 35
>            map : {
>                "free-field" : "value",
>                "other-free-field" : value2"
>               }
>          },
>          ...
>          {
>            x1 : "vw",
>            x2  : "red",
>            x3  : 54,
>            ...
>          },
>      ]
>    },
> }
>
>
> Each file contains a single JSON document (gzip compressed, and roughly
> about 200KB uncompressed of pretty-printed json text per document)
>
> I am interested in analyzing only the  "important-data" array and the
> "other-important-data" array.
> My source data would ideally be easier to analyze if it looked like a
> couple of tables with a fixed set of columns. Only the column "map" would
> be a complex column, all others would be primitives.
>
> ( g, sg, j, page, f1, f2, f3 )
>
> ( g, sg, j, page, x1, x2, x3, map )
>
> So, for each JSON document, I would like to "create" several rows, but I
> would like to avoid the intermediate step of persisting -and duplicating-
> the "flattened" data.
>
> In order to avoid persisting the data flattened, I thought I had to write
> my own map-reduce in Java code, but discovered that others have had the
> same problem of using JSON as the source and there are somewhat "standard"
> solutions.
>
> By reading about the SerDe approach for Hive I get the impression that
> each JSON document is transformed into a single "row" of the table with
> some columns being an array, a map of other nested structures.
> a) Is there a way to break each JSON document into several "rows" for a
> Hive external table?
> b) It seems there are too many JSON SerDe libraries! Is there any of them
> considered the de-facto standard?
>
> The Pig approach seems also promising using Elephant Bird Do anybody has
> pointers to more user documentation on this project? Or is browsing through
> the examples in GitHub my only source?
>
> Thanks
>
>
>
>
>
>
>
>
>
>
>
>

Re: Aggregating data nested into JSON documents

Posted by Tecno Brain <ce...@gmail.com>.

Hi Mike,

Yes, I also have thought about HBase or Cassandra but my data is pretty
much a snapshot, it does not require updates. Most of my aggregations will
also need to be computed once and won't change over time with the exception
of some aggregation that is based on the last N days of data.  Should I
still consider HBase ? I think that probably it will be good for the
aggregated data.

I have no idea what are sequence files, but I will take a look.  My raw
data is stored in the cloud, not in my Hadoop cluster.

I'll keep looking at Pig with ElephantBird.
Thanks,

-Jorge





On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <mi...@hotmail.com>wrote:

> Hi..
>
> Have you thought about HBase?
>
> I would suggest that if you're using Hive or Pig, to look at taking these
> files and putting the JSON records in to a sequence file.
> Or set of sequence files.... (Then look at HBase to help index them...)
> 200KB is small.
>
> That would be the same for either pig/hive.
>
> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice.
> And yes you get each record as a row, however you can always flatten them
> as needed.
>
> Hive?
> I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward
> Capriolo could give you a better answer.
> Going from memory, I don't know that there is a good SerDe that would
> write JSON, just read it. (Hive)
>
> IMHO Pig/ElephantBird is the best so far, but then again I may be dated
> and biased.
>
> I think you're on the right track or at least train of thought.
>
> HTH
>
> -Mike
>
>
> On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com>
> wrote:
>
> Hello,
>    I'm new to Hadoop.
>    I have a large quantity of JSON documents with a structure similar to
> what is shown below.
>
>    {
>      g : "some-group-identifier",
>      sg: "some-subgroup-identifier",
>      j      : "some-job-identifier",
>      page     : 23,
>      ... // other fields omitted
>      important-data : [
>          {
>            f1  : "abc",
>            f2  : "a",
>            f3  : "/"
>            ...
>          },
>          ...
>          {
>            f1 : "xyz",
>            f2  : "q",
>            f3  : "/",
>            ...
>          },
>      ],
>     ... // other fields omitted
>      other-important-data : [
>         {
>            x1  : "ford",
>            x2  : "green",
>            x3  : 35
>            map : {
>                "free-field" : "value",
>                "other-free-field" : value2"
>               }
>          },
>          ...
>          {
>            x1 : "vw",
>            x2  : "red",
>            x3  : 54,
>            ...
>          },
>      ]
>    },
> }
>
>
> Each file contains a single JSON document (gzip compressed, and roughly
> about 200KB uncompressed of pretty-printed json text per document)
>
> I am interested in analyzing only the  "important-data" array and the
> "other-important-data" array.
> My source data would ideally be easier to analyze if it looked like a
> couple of tables with a fixed set of columns. Only the column "map" would
> be a complex column, all others would be primitives.
>
> ( g, sg, j, page, f1, f2, f3 )
>
> ( g, sg, j, page, x1, x2, x3, map )
>
> So, for each JSON document, I would like to "create" several rows, but I
> would like to avoid the intermediate step of persisting -and duplicating-
> the "flattened" data.
>
> In order to avoid persisting the data flattened, I thought I had to write
> my own map-reduce in Java code, but discovered that others have had the
> same problem of using JSON as the source and there are somewhat "standard"
> solutions.
>
> By reading about the SerDe approach for Hive I get the impression that
> each JSON document is transformed into a single "row" of the table with
> some columns being an array, a map of other nested structures.
> a) Is there a way to break each JSON document into several "rows" for a
> Hive external table?
> b) It seems there are too many JSON SerDe libraries! Is there any of them
> considered the de-facto standard?
>
> The Pig approach seems also promising using Elephant Bird Do anybody has
> pointers to more user documentation on this project? Or is browsing through
> the examples in GitHub my only source?
>
> Thanks
>
>
>
>
>
>
>
>
>
>
>
>

Re: Aggregating data nested into JSON documents

Posted by Tecno Brain <ce...@gmail.com>.

Hi Mike,

Yes, I also have thought about HBase or Cassandra but my data is pretty
much a snapshot, it does not require updates. Most of my aggregations will
also need to be computed once and won't change over time with the exception
of some aggregation that is based on the last N days of data.  Should I
still consider HBase ? I think that probably it will be good for the
aggregated data.

I have no idea what are sequence files, but I will take a look.  My raw
data is stored in the cloud, not in my Hadoop cluster.

I'll keep looking at Pig with ElephantBird.
Thanks,

-Jorge





On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <mi...@hotmail.com>wrote:

> Hi..
>
> Have you thought about HBase?
>
> I would suggest that if you're using Hive or Pig, to look at taking these
> files and putting the JSON records in to a sequence file.
> Or set of sequence files.... (Then look at HBase to help index them...)
> 200KB is small.
>
> That would be the same for either pig/hive.
>
> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice.
> And yes you get each record as a row, however you can always flatten them
> as needed.
>
> Hive?
> I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward
> Capriolo could give you a better answer.
> Going from memory, I don't know that there is a good SerDe that would
> write JSON, just read it. (Hive)
>
> IMHO Pig/ElephantBird is the best so far, but then again I may be dated
> and biased.
>
> I think you're on the right track or at least train of thought.
>
> HTH
>
> -Mike
>
>
> On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com>
> wrote:
>
> Hello,
>    I'm new to Hadoop.
>    I have a large quantity of JSON documents with a structure similar to
> what is shown below.
>
>    {
>      g : "some-group-identifier",
>      sg: "some-subgroup-identifier",
>      j      : "some-job-identifier",
>      page     : 23,
>      ... // other fields omitted
>      important-data : [
>          {
>            f1  : "abc",
>            f2  : "a",
>            f3  : "/"
>            ...
>          },
>          ...
>          {
>            f1 : "xyz",
>            f2  : "q",
>            f3  : "/",
>            ...
>          },
>      ],
>     ... // other fields omitted
>      other-important-data : [
>         {
>            x1  : "ford",
>            x2  : "green",
>            x3  : 35
>            map : {
>                "free-field" : "value",
>                "other-free-field" : value2"
>               }
>          },
>          ...
>          {
>            x1 : "vw",
>            x2  : "red",
>            x3  : 54,
>            ...
>          },
>      ]
>    },
> }
>
>
> Each file contains a single JSON document (gzip compressed, and roughly
> about 200KB uncompressed of pretty-printed json text per document)
>
> I am interested in analyzing only the  "important-data" array and the
> "other-important-data" array.
> My source data would ideally be easier to analyze if it looked like a
> couple of tables with a fixed set of columns. Only the column "map" would
> be a complex column, all others would be primitives.
>
> ( g, sg, j, page, f1, f2, f3 )
>
> ( g, sg, j, page, x1, x2, x3, map )
>
> So, for each JSON document, I would like to "create" several rows, but I
> would like to avoid the intermediate step of persisting -and duplicating-
> the "flattened" data.
>
> In order to avoid persisting the data flattened, I thought I had to write
> my own map-reduce in Java code, but discovered that others have had the
> same problem of using JSON as the source and there are somewhat "standard"
> solutions.
>
> By reading about the SerDe approach for Hive I get the impression that
> each JSON document is transformed into a single "row" of the table with
> some columns being an array, a map of other nested structures.
> a) Is there a way to break each JSON document into several "rows" for a
> Hive external table?
> b) It seems there are too many JSON SerDe libraries! Is there any of them
> considered the de-facto standard?
>
> The Pig approach seems also promising using Elephant Bird Do anybody has
> pointers to more user documentation on this project? Or is browsing through
> the examples in GitHub my only source?
>
> Thanks
>
>
>
>
>
>
>
>
>
>
>
>

Re: Aggregating data nested into JSON documents

Posted by Tecno Brain <ce...@gmail.com>.

Hi Mike,

Yes, I also have thought about HBase or Cassandra but my data is pretty
much a snapshot, it does not require updates. Most of my aggregations will
also need to be computed once and won't change over time with the exception
of some aggregation that is based on the last N days of data.  Should I
still consider HBase ? I think that probably it will be good for the
aggregated data.

I have no idea what are sequence files, but I will take a look.  My raw
data is stored in the cloud, not in my Hadoop cluster.

I'll keep looking at Pig with ElephantBird.
Thanks,

-Jorge





On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <mi...@hotmail.com>wrote:

> Hi..
>
> Have you thought about HBase?
>
> I would suggest that if you're using Hive or Pig, to look at taking these
> files and putting the JSON records in to a sequence file.
> Or set of sequence files.... (Then look at HBase to help index them...)
> 200KB is small.
>
> That would be the same for either pig/hive.
>
> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice.
> And yes you get each record as a row, however you can always flatten them
> as needed.
>
> Hive?
> I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward
> Capriolo could give you a better answer.
> Going from memory, I don't know that there is a good SerDe that would
> write JSON, just read it. (Hive)
>
> IMHO Pig/ElephantBird is the best so far, but then again I may be dated
> and biased.
>
> I think you're on the right track or at least train of thought.
>
> HTH
>
> -Mike
>
>
> On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com>
> wrote:
>
> Hello,
>    I'm new to Hadoop.
>    I have a large quantity of JSON documents with a structure similar to
> what is shown below.
>
>    {
>      g : "some-group-identifier",
>      sg: "some-subgroup-identifier",
>      j      : "some-job-identifier",
>      page     : 23,
>      ... // other fields omitted
>      important-data : [
>          {
>            f1  : "abc",
>            f2  : "a",
>            f3  : "/"
>            ...
>          },
>          ...
>          {
>            f1 : "xyz",
>            f2  : "q",
>            f3  : "/",
>            ...
>          },
>      ],
>     ... // other fields omitted
>      other-important-data : [
>         {
>            x1  : "ford",
>            x2  : "green",
>            x3  : 35
>            map : {
>                "free-field" : "value",
>                "other-free-field" : value2"
>               }
>          },
>          ...
>          {
>            x1 : "vw",
>            x2  : "red",
>            x3  : 54,
>            ...
>          },
>      ]
>    },
> }
>
>
> Each file contains a single JSON document (gzip compressed, and roughly
> about 200KB uncompressed of pretty-printed json text per document)
>
> I am interested in analyzing only the  "important-data" array and the
> "other-important-data" array.
> My source data would ideally be easier to analyze if it looked like a
> couple of tables with a fixed set of columns. Only the column "map" would
> be a complex column, all others would be primitives.
>
> ( g, sg, j, page, f1, f2, f3 )
>
> ( g, sg, j, page, x1, x2, x3, map )
>
> So, for each JSON document, I would like to "create" several rows, but I
> would like to avoid the intermediate step of persisting -and duplicating-
> the "flattened" data.
>
> In order to avoid persisting the data flattened, I thought I had to write
> my own map-reduce in Java code, but discovered that others have had the
> same problem of using JSON as the source and there are somewhat "standard"
> solutions.
>
> By reading about the SerDe approach for Hive I get the impression that
> each JSON document is transformed into a single "row" of the table with
> some columns being an array, a map of other nested structures.
> a) Is there a way to break each JSON document into several "rows" for a
> Hive external table?
> b) It seems there are too many JSON SerDe libraries! Is there any of them
> considered the de-facto standard?
>
> The Pig approach seems also promising using Elephant Bird Do anybody has
> pointers to more user documentation on this project? Or is browsing through
> the examples in GitHub my only source?
>
> Thanks
>
>
>
>
>
>
>
>
>
>
>
>

Re: Aggregating data nested into JSON documents

Posted by Michael Segel <mi...@hotmail.com>.

Hi..

Have you thought about HBase? 

I would suggest that if you're using Hive or Pig, to look at taking these files and putting the JSON records in to a sequence file. 
Or set of sequence files.... (Then look at HBase to help index them...) 200KB is small. 

That would be the same for either pig/hive.

In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice. And yes you get each record as a row, however you can always flatten them as needed. 

Hive? 
I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward Capriolo could give you a better answer. 
Going from memory, I don't know that there is a good SerDe that would write JSON, just read it. (Hive)

IMHO Pig/ElephantBird is the best so far, but then again I may be dated and biased. 

I think you're on the right track or at least train of thought. 

HTH

-Mike


On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com> wrote:

> Hello, 
>    I'm new to Hadoop. 
>    I have a large quantity of JSON documents with a structure similar to what is shown below.  
> 
>    {
>      g : "some-group-identifier",
>      sg: "some-subgroup-identifier",
>      j      : "some-job-identifier",
>      page     : 23,
>      ... // other fields omitted
>      important-data : [
>          {
>            f1  : "abc",
>            f2  : "a",
>            f3  : "/"
>            ...
>          },
>          ...
>          {
>            f1 : "xyz",
>            f2  : "q",
>            f3  : "/",
>            ... 
>          },
>      ],
>     ... // other fields omitted 
>      other-important-data : [
>         {
>            x1  : "ford",
>            x2  : "green",
>            x3  : 35
>            map : {
>                "free-field" : "value",
>                "other-free-field" : value2"
>               }
>          },
>          ...
>          {
>            x1 : "vw",
>            x2  : "red",
>            x3  : 54,
>            ... 
>          },    
>      ]
>    },
> }
>  
> 
> Each file contains a single JSON document (gzip compressed, and roughly about 200KB uncompressed of pretty-printed json text per document)
> 
> I am interested in analyzing only the  "important-data" array and the "other-important-data" array.
> My source data would ideally be easier to analyze if it looked like a couple of tables with a fixed set of columns. Only the column "map" would be a complex column, all others would be primitives.
> 
> ( g, sg, j, page, f1, f2, f3 )
>  
> ( g, sg, j, page, x1, x2, x3, map )
> 
> So, for each JSON document, I would like to "create" several rows, but I would like to avoid the intermediate step of persisting -and duplicating- the "flattened" data.
> 
> In order to avoid persisting the data flattened, I thought I had to write my own map-reduce in Java code, but discovered that others have had the same problem of using JSON as the source and there are somewhat "standard" solutions. 
> 
> By reading about the SerDe approach for Hive I get the impression that each JSON document is transformed into a single "row" of the table with some columns being an array, a map of other nested structures. 
> a) Is there a way to break each JSON document into several "rows" for a Hive external table?
> b) It seems there are too many JSON SerDe libraries! Is there any of them considered the de-facto standard? 
> 
> The Pig approach seems also promising using Elephant Bird Do anybody has pointers to more user documentation on this project? Or is browsing through the examples in GitHub my only source?
> 
> Thanks
> 
> 
> 
> 
> 
> 
> 
> 
> 
>

Re: Aggregating data nested into JSON documents

Posted by Michael Segel <mi...@hotmail.com>.

Hi..

Have you thought about HBase? 

I would suggest that if you're using Hive or Pig, to look at taking these files and putting the JSON records in to a sequence file. 
Or set of sequence files.... (Then look at HBase to help index them...) 200KB is small. 

That would be the same for either pig/hive.

In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice. And yes you get each record as a row, however you can always flatten them as needed. 

Hive? 
I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward Capriolo could give you a better answer. 
Going from memory, I don't know that there is a good SerDe that would write JSON, just read it. (Hive)

IMHO Pig/ElephantBird is the best so far, but then again I may be dated and biased. 

I think you're on the right track or at least train of thought. 

HTH

-Mike


On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com> wrote:

> Hello, 
>    I'm new to Hadoop. 
>    I have a large quantity of JSON documents with a structure similar to what is shown below.  
> 
>    {
>      g : "some-group-identifier",
>      sg: "some-subgroup-identifier",
>      j      : "some-job-identifier",
>      page     : 23,
>      ... // other fields omitted
>      important-data : [
>          {
>            f1  : "abc",
>            f2  : "a",
>            f3  : "/"
>            ...
>          },
>          ...
>          {
>            f1 : "xyz",
>            f2  : "q",
>            f3  : "/",
>            ... 
>          },
>      ],
>     ... // other fields omitted 
>      other-important-data : [
>         {
>            x1  : "ford",
>            x2  : "green",
>            x3  : 35
>            map : {
>                "free-field" : "value",
>                "other-free-field" : value2"
>               }
>          },
>          ...
>          {
>            x1 : "vw",
>            x2  : "red",
>            x3  : 54,
>            ... 
>          },    
>      ]
>    },
> }
>  
> 
> Each file contains a single JSON document (gzip compressed, and roughly about 200KB uncompressed of pretty-printed json text per document)
> 
> I am interested in analyzing only the  "important-data" array and the "other-important-data" array.
> My source data would ideally be easier to analyze if it looked like a couple of tables with a fixed set of columns. Only the column "map" would be a complex column, all others would be primitives.
> 
> ( g, sg, j, page, f1, f2, f3 )
>  
> ( g, sg, j, page, x1, x2, x3, map )
> 
> So, for each JSON document, I would like to "create" several rows, but I would like to avoid the intermediate step of persisting -and duplicating- the "flattened" data.
> 
> In order to avoid persisting the data flattened, I thought I had to write my own map-reduce in Java code, but discovered that others have had the same problem of using JSON as the source and there are somewhat "standard" solutions. 
> 
> By reading about the SerDe approach for Hive I get the impression that each JSON document is transformed into a single "row" of the table with some columns being an array, a map of other nested structures. 
> a) Is there a way to break each JSON document into several "rows" for a Hive external table?
> b) It seems there are too many JSON SerDe libraries! Is there any of them considered the de-facto standard? 
> 
> The Pig approach seems also promising using Elephant Bird Do anybody has pointers to more user documentation on this project? Or is browsing through the examples in GitHub my only source?
> 
> Thanks
> 
> 
> 
> 
> 
> 
> 
> 
> 
>

Re: Aggregating data nested into JSON documents

Posted by Michael Segel <mi...@hotmail.com>.

Hi..

Have you thought about HBase? 

I would suggest that if you're using Hive or Pig, to look at taking these files and putting the JSON records in to a sequence file. 
Or set of sequence files.... (Then look at HBase to help index them...) 200KB is small. 

That would be the same for either pig/hive.

In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice. And yes you get each record as a row, however you can always flatten them as needed. 

Hive? 
I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward Capriolo could give you a better answer. 
Going from memory, I don't know that there is a good SerDe that would write JSON, just read it. (Hive)

IMHO Pig/ElephantBird is the best so far, but then again I may be dated and biased. 

I think you're on the right track or at least train of thought. 

HTH

-Mike


On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com> wrote:

> Hello, 
>    I'm new to Hadoop. 
>    I have a large quantity of JSON documents with a structure similar to what is shown below.  
> 
>    {
>      g : "some-group-identifier",
>      sg: "some-subgroup-identifier",
>      j      : "some-job-identifier",
>      page     : 23,
>      ... // other fields omitted
>      important-data : [
>          {
>            f1  : "abc",
>            f2  : "a",
>            f3  : "/"
>            ...
>          },
>          ...
>          {
>            f1 : "xyz",
>            f2  : "q",
>            f3  : "/",
>            ... 
>          },
>      ],
>     ... // other fields omitted 
>      other-important-data : [
>         {
>            x1  : "ford",
>            x2  : "green",
>            x3  : 35
>            map : {
>                "free-field" : "value",
>                "other-free-field" : value2"
>               }
>          },
>          ...
>          {
>            x1 : "vw",
>            x2  : "red",
>            x3  : 54,
>            ... 
>          },    
>      ]
>    },
> }
>  
> 
> Each file contains a single JSON document (gzip compressed, and roughly about 200KB uncompressed of pretty-printed json text per document)
> 
> I am interested in analyzing only the  "important-data" array and the "other-important-data" array.
> My source data would ideally be easier to analyze if it looked like a couple of tables with a fixed set of columns. Only the column "map" would be a complex column, all others would be primitives.
> 
> ( g, sg, j, page, f1, f2, f3 )
>  
> ( g, sg, j, page, x1, x2, x3, map )
> 
> So, for each JSON document, I would like to "create" several rows, but I would like to avoid the intermediate step of persisting -and duplicating- the "flattened" data.
> 
> In order to avoid persisting the data flattened, I thought I had to write my own map-reduce in Java code, but discovered that others have had the same problem of using JSON as the source and there are somewhat "standard" solutions. 
> 
> By reading about the SerDe approach for Hive I get the impression that each JSON document is transformed into a single "row" of the table with some columns being an array, a map of other nested structures. 
> a) Is there a way to break each JSON document into several "rows" for a Hive external table?
> b) It seems there are too many JSON SerDe libraries! Is there any of them considered the de-facto standard? 
> 
> The Pig approach seems also promising using Elephant Bird Do anybody has pointers to more user documentation on this project? Or is browsing through the examples in GitHub my only source?
> 
> Thanks
> 
> 
> 
> 
> 
> 
> 
> 
> 
>

Re: Aggregating data nested into JSON documents

Posted by Michael Segel <mi...@hotmail.com>.

Hi..

Have you thought about HBase? 

I would suggest that if you're using Hive or Pig, to look at taking these files and putting the JSON records in to a sequence file. 
Or set of sequence files.... (Then look at HBase to help index them...) 200KB is small. 

That would be the same for either pig/hive.

In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice. And yes you get each record as a row, however you can always flatten them as needed. 

Hive? 
I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward Capriolo could give you a better answer. 
Going from memory, I don't know that there is a good SerDe that would write JSON, just read it. (Hive)

IMHO Pig/ElephantBird is the best so far, but then again I may be dated and biased. 

I think you're on the right track or at least train of thought. 

HTH

-Mike


On Jun 12, 2013, at 7:57 PM, Tecno Brain <ce...@gmail.com> wrote:

> Hello, 
>    I'm new to Hadoop. 
>    I have a large quantity of JSON documents with a structure similar to what is shown below.  
> 
>    {
>      g : "some-group-identifier",
>      sg: "some-subgroup-identifier",
>      j      : "some-job-identifier",
>      page     : 23,
>      ... // other fields omitted
>      important-data : [
>          {
>            f1  : "abc",
>            f2  : "a",
>            f3  : "/"
>            ...
>          },
>          ...
>          {
>            f1 : "xyz",
>            f2  : "q",
>            f3  : "/",
>            ... 
>          },
>      ],
>     ... // other fields omitted 
>      other-important-data : [
>         {
>            x1  : "ford",
>            x2  : "green",
>            x3  : 35
>            map : {
>                "free-field" : "value",
>                "other-free-field" : value2"
>               }
>          },
>          ...
>          {
>            x1 : "vw",
>            x2  : "red",
>            x3  : 54,
>            ... 
>          },    
>      ]
>    },
> }
>  
> 
> Each file contains a single JSON document (gzip compressed, and roughly about 200KB uncompressed of pretty-printed json text per document)
> 
> I am interested in analyzing only the  "important-data" array and the "other-important-data" array.
> My source data would ideally be easier to analyze if it looked like a couple of tables with a fixed set of columns. Only the column "map" would be a complex column, all others would be primitives.
> 
> ( g, sg, j, page, f1, f2, f3 )
>  
> ( g, sg, j, page, x1, x2, x3, map )
> 
> So, for each JSON document, I would like to "create" several rows, but I would like to avoid the intermediate step of persisting -and duplicating- the "flattened" data.
> 
> In order to avoid persisting the data flattened, I thought I had to write my own map-reduce in Java code, but discovered that others have had the same problem of using JSON as the source and there are somewhat "standard" solutions. 
> 
> By reading about the SerDe approach for Hive I get the impression that each JSON document is transformed into a single "row" of the table with some columns being an array, a map of other nested structures. 
> a) Is there a way to break each JSON document into several "rows" for a Hive external table?
> b) It seems there are too many JSON SerDe libraries! Is there any of them considered the de-facto standard? 
> 
> The Pig approach seems also promising using Elephant Bird Do anybody has pointers to more user documentation on this project? Or is browsing through the examples in GitHub my only source?
> 
> Thanks
> 
> 
> 
> 
> 
> 
> 
> 
> 
>