You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Chao Tian (JIRA)" <ji...@apache.org> on 2011/03/17 08:30:29 UTC

[jira] Created: (PIG-1914) Support load/store JSON data in Pig

Support load/store JSON data in Pig
-----------------------------------

                 Key: PIG-1914
                 URL: https://issues.apache.org/jira/browse/PIG-1914
             Project: Pig
          Issue Type: New Feature
    Affects Versions: 0.8.0
            Reporter: Chao Tian
             Fix For: 0.9.0


The JSON is a commonly used data storage format. It is popular for storing structured data, especially for JavaScript data exchange. 
Pig should have the ability to load/store JSON format data. I plan to write one for the piggy bank.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PIG-1914) Support load/store JSON data in Pig

Posted by "Chao Tian (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008312#comment-13008312 ] 

Chao Tian commented on PIG-1914:
--------------------------------

Hi Dmitry,

Thanks to your comment. It is good to see that there is one JSON loader already. I read that code. I found the current solution is parsing the input json data into a map object. 

However, in my design , i plan to support JSON to Tuple conversion. The element key of each JSON object would be load as the alias of Tuple. And the element value would be load as data in tuple. The simple data type could be converted easily. For the complex type, the object of JSON could be mapped into Tuple of Pig, and the array of JSON could be mapped into DataBag of Pig.

And I also plan to write a storer to store data in JSON format.

Any thought?

Thanks,
Chao


> Support load/store JSON data in Pig
> -----------------------------------
>
>                 Key: PIG-1914
>                 URL: https://issues.apache.org/jira/browse/PIG-1914
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Chao Tian
>
> The JSON is a commonly used data storage format. It is popular for storing structured data, especially for JavaScript data exchange. 
> Pig should have the ability to load/store JSON format data. I plan to write one for the piggy bank.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1914) Support load/store JSON data in Pig

Posted by "Michael May (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael May updated PIG-1914:
-----------------------------

                 Tags: JSON LoadFunc
    Affects Version/s: 0.9.0
         Release Note: Adds support for loading JSON data in Pig
               Status: Patch Available  (was: Open)

> Support load/store JSON data in Pig
> -----------------------------------
>
>                 Key: PIG-1914
>                 URL: https://issues.apache.org/jira/browse/PIG-1914
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.8.0, 0.9.0
>            Reporter: Chao Tian
>         Attachments: PIG-1914.patch
>
>
> The JSON is a commonly used data storage format. It is popular for storing structured data, especially for JavaScript data exchange. 
> Pig should have the ability to load/store JSON format data. I plan to write one for the piggy bank.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-1914) Support load/store JSON data in Pig

Posted by "Bill Graham (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021965#comment-13021965 ] 

Bill Graham commented on PIG-1914:
----------------------------------

+1 a Map solution that allows for unknown json key/values to be handled. We often run jobs that create summaries of counts of all json keys, many of which are either unknown or not reliably implied by reading a random row.

If instead a json loader is contributed that returns Tuples from either a pre-difined schema or via introspection, I suggest it's named in a way that implies this. Multiple implementations can be supported.

> Support load/store JSON data in Pig
> -----------------------------------
>
>                 Key: PIG-1914
>                 URL: https://issues.apache.org/jira/browse/PIG-1914
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Chao Tian
>
> The JSON is a commonly used data storage format. It is popular for storing structured data, especially for JavaScript data exchange. 
> Pig should have the ability to load/store JSON format data. I plan to write one for the piggy bank.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1914) Support load/store JSON data in Pig

Posted by "Michael May (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael May updated PIG-1914:
-----------------------------

    Attachment: PIG-1914.patch

> Support load/store JSON data in Pig
> -----------------------------------
>
>                 Key: PIG-1914
>                 URL: https://issues.apache.org/jira/browse/PIG-1914
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.8.0, 0.9.0
>            Reporter: Chao Tian
>         Attachments: PIG-1914.patch
>
>
> The JSON is a commonly used data storage format. It is popular for storing structured data, especially for JavaScript data exchange. 
> Pig should have the ability to load/store JSON format data. I plan to write one for the piggy bank.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-1914) Support load/store JSON data in Pig

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065961#comment-13065961 ] 

Dmitriy V. Ryaboy commented on PIG-1914:
----------------------------------------

Very cool.

Some quick code review notes:

Tiny typo here:
"e = foreach d generate flatten(men#'value') as val;" -- that should read menu#'value'


{code}
boolean notDone = in.nextKeyValue();
if (!notDone) {
    return null;
}
{code}

Better: {code}
if (!in.nextKeyValue()) {
    return null;
}
{code}

Parse exceptions: it's better to increment a counter and move on than to break on a bad input string. Throwing an exception kills the whole job. So maybe something like 
{code}
t = null;
while (t == null && in.nextKeyValue()) {
 ...
}
return t;
{code}

In flatten_array, if the value is an array, you allocate a new bag, populate it recursively, and add the contents of the new bag to the old bag. Why not skip the object allocation and copy, and simply pass the original bag into the recursive call?

Also: are null values for keys just plain unsupported? You skip them.

setLocation: not that it really matters, but for consistency, you should use PigTextInputFormat instead of PigFileInputFormat here.

schema: probably makes sense to implement getSchema?

> Support load/store JSON data in Pig
> -----------------------------------
>
>                 Key: PIG-1914
>                 URL: https://issues.apache.org/jira/browse/PIG-1914
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.8.0, 0.9.0
>            Reporter: Chao Tian
>         Attachments: PIG-1914.patch
>
>
> The JSON is a commonly used data storage format. It is popular for storing structured data, especially for JavaScript data exchange. 
> Pig should have the ability to load/store JSON format data. I plan to write one for the piggy bank.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-1914) Support load/store JSON data in Pig

Posted by "Michael May (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058628#comment-13058628 ] 

Michael May commented on PIG-1914:
----------------------------------

I'm getting close to being ready to post a patch for a loader, but have a question (pardon me if this is not the right place to ask it, but this thread seems like a reasonable place).
 
The JSON Parser I'm currently using is part of an external dependency (namely, json-simple). I'm /assuming/ it's ok to add this dependency into the project. I'm familiar with maven's way of handling dependencies, but not so much with ant's. After doing a little digging around I found /ivy/pig.pom which looks similar to the dependency section of a maven pom.xml file. Can I add the dependency in here, or is there some other location where I can specify this dependency? 

Also, (somewhat unrelated, but a noob question) I'm currently working with this feature off of trunk. Is that where I should be working? The specified 'affected version' is 0.8.0 and I see there are 0.8 branches and 0.9 branches. Just want to make sure I'm working in the right place.

Thanks

     

> Support load/store JSON data in Pig
> -----------------------------------
>
>                 Key: PIG-1914
>                 URL: https://issues.apache.org/jira/browse/PIG-1914
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Chao Tian
>
> The JSON is a commonly used data storage format. It is popular for storing structured data, especially for JavaScript data exchange. 
> Pig should have the ability to load/store JSON format data. I plan to write one for the piggy bank.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-1914) Support load/store JSON data in Pig

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058646#comment-13058646 ] 

Dmitriy V. Ryaboy commented on PIG-1914:
----------------------------------------

no, storage is the right place, I just meant don't put it into Pig builtins.

> Support load/store JSON data in Pig
> -----------------------------------
>
>                 Key: PIG-1914
>                 URL: https://issues.apache.org/jira/browse/PIG-1914
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Chao Tian
>
> The JSON is a commonly used data storage format. It is popular for storing structured data, especially for JavaScript data exchange. 
> Pig should have the ability to load/store JSON format data. I plan to write one for the piggy bank.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (PIG-1914) Support load/store JSON data in Pig

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008296#comment-13008296 ] 

Dmitriy V. Ryaboy commented on PIG-1914:
----------------------------------------

For Pig 0.6: https://github.com/kevinweil/elephant-bird/tree/master/src/java/com/twitter/elephantbird/pig/load
For Pig 0.8: https://github.com/dvryaboy/elephant-bird/tree/pig-08/src/java/com/twitter/elephantbird/pig8/load

A Pig 0.9 version might be interesting because in this version, Pig understands typed keys, so it's finally possible to return complex structures as values, actually delivering the whole Json object.

If you want to add directly to Pig, you'll probably want to use Jackson for parsing instead of SimpleJson, as that library is already included in Pig dependencies (and maybe even Hadoop ones?).

> Support load/store JSON data in Pig
> -----------------------------------
>
>                 Key: PIG-1914
>                 URL: https://issues.apache.org/jira/browse/PIG-1914
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Chao Tian
>
> The JSON is a commonly used data storage format. It is popular for storing structured data, especially for JavaScript data exchange. 
> Pig should have the ability to load/store JSON format data. I plan to write one for the piggy bank.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1914) Support load/store JSON data in Pig

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051450#comment-13051450 ] 

Dmitriy V. Ryaboy commented on PIG-1914:
----------------------------------------

All you gotta do is post a patch.. though that particular gist is a little encumbered since it's had so many authors and all of them would have to sign off on the fact that they are cool with an apache license.

> Support load/store JSON data in Pig
> -----------------------------------
>
>                 Key: PIG-1914
>                 URL: https://issues.apache.org/jira/browse/PIG-1914
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Chao Tian
>
> The JSON is a commonly used data storage format. It is popular for storing structured data, especially for JavaScript data exchange. 
> Pig should have the ability to load/store JSON format data. I plan to write one for the piggy bank.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-1914) Support load/store JSON data in Pig

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy V. Ryaboy updated PIG-1914:
-----------------------------------

    Status: Open  (was: Patch Available)

canceling patch status, pending review response.

please note that in the mean time, JsonStorage/Loader were added to Pig, but they are bound to a strict schema and the loader essentially only works on json stored by JsonStorage, not any json.  So we probably still need an alternative loader.

Also note that EB is now much more modular (so, fewer dependencies required if you do not need them), and the json storage module there allows deep parsing (tuples, maps, the works). It does not sample any records to auto-determine schema, and still returns a map.

-D
                
> Support load/store JSON data in Pig
> -----------------------------------
>
>                 Key: PIG-1914
>                 URL: https://issues.apache.org/jira/browse/PIG-1914
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.9.0, 0.8.0
>            Reporter: Chao Tian
>         Attachments: PIG-1914.patch
>
>
> The JSON is a commonly used data storage format. It is popular for storing structured data, especially for JavaScript data exchange. 
> Pig should have the ability to load/store JSON format data. I plan to write one for the piggy bank.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1914) Support load/store JSON data in Pig

Posted by "Chao Tian (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009003#comment-13009003 ] 

Chao Tian commented on PIG-1914:
--------------------------------

Thanks Ed. I am working on the loader right now. I have finished a json.org version now, and i try to re-write this one by using the jackson streaming api to parse JSON from bytes stream.

> Support load/store JSON data in Pig
> -----------------------------------
>
>                 Key: PIG-1914
>                 URL: https://issues.apache.org/jira/browse/PIG-1914
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Chao Tian
>
> The JSON is a commonly used data storage format. It is popular for storing structured data, especially for JavaScript data exchange. 
> Pig should have the ability to load/store JSON format data. I plan to write one for the piggy bank.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1914) Support load/store JSON data in Pig

Posted by "Ed Summers (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009000#comment-13009000 ] 

Ed Summers commented on PIG-1914:
---------------------------------

+1 for a JSON Loader/Storer that are part of PiggyBank. elephant-bird is nice, but elephantbird needs to a) be discovered, and b) built ... which is non-trivial given the various dependencies. elephant-bird also seems to only be compatible with Pig v0.6. 

> Support load/store JSON data in Pig
> -----------------------------------
>
>                 Key: PIG-1914
>                 URL: https://issues.apache.org/jira/browse/PIG-1914
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Chao Tian
>
> The JSON is a commonly used data storage format. It is popular for storing structured data, especially for JavaScript data exchange. 
> Pig should have the ability to load/store JSON format data. I plan to write one for the piggy bank.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1914) Support load/store JSON data in Pig

Posted by "Michael May (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058639#comment-13058639 ] 

Michael May commented on PIG-1914:
----------------------------------

I didn't realize there was already a dependency for doing json parsing.  That is good news! I'll work with it.

Currently I have this in contrib/piggybank/storage. If I need to move it up one directory level, then that is no problem.

> Support load/store JSON data in Pig
> -----------------------------------
>
>                 Key: PIG-1914
>                 URL: https://issues.apache.org/jira/browse/PIG-1914
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Chao Tian
>
> The JSON is a commonly used data storage format. It is popular for storing structured data, especially for JavaScript data exchange. 
> Pig should have the ability to load/store JSON format data. I plan to write one for the piggy bank.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (PIG-1914) Support load/store JSON data in Pig

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13007956#comment-13007956 ] 

Dmitriy V. Ryaboy commented on PIG-1914:
----------------------------------------

There is already a JSON loader in Elephant-Bird.

> Support load/store JSON data in Pig
> -----------------------------------
>
>                 Key: PIG-1914
>                 URL: https://issues.apache.org/jira/browse/PIG-1914
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Chao Tian
>             Fix For: 0.9.0
>
>
> The JSON is a commonly used data storage format. It is popular for storing structured data, especially for JavaScript data exchange. 
> Pig should have the ability to load/store JSON format data. I plan to write one for the piggy bank.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PIG-1914) Support load/store JSON data in Pig

Posted by "Chao Tian (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008292#comment-13008292 ] 

Chao Tian commented on PIG-1914:
--------------------------------

Hi Dmitry, could you share a link for the JSON loader you talked about.

> Support load/store JSON data in Pig
> -----------------------------------
>
>                 Key: PIG-1914
>                 URL: https://issues.apache.org/jira/browse/PIG-1914
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Chao Tian
>
> The JSON is a commonly used data storage format. It is popular for storing structured data, especially for JavaScript data exchange. 
> Pig should have the ability to load/store JSON format data. I plan to write one for the piggy bank.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PIG-1914) Support load/store JSON data in Pig

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008322#comment-13008322 ] 

Dmitriy V. Ryaboy commented on PIG-1914:
----------------------------------------

That is a good idea, it would be quite useful for a number of scenarios.

One problem with this design is that JSON objects often do not have a consistent set of keys, and each of the json objects you read may in fact have a totally new set of keys. How do you suggest dealing with something like that? 

> Support load/store JSON data in Pig
> -----------------------------------
>
>                 Key: PIG-1914
>                 URL: https://issues.apache.org/jira/browse/PIG-1914
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Chao Tian
>
> The JSON is a commonly used data storage format. It is popular for storing structured data, especially for JavaScript data exchange. 
> Pig should have the ability to load/store JSON format data. I plan to write one for the piggy bank.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1914) Support load/store JSON data in Pig

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058633#comment-13058633 ] 

Dmitriy V. Ryaboy commented on PIG-1914:
----------------------------------------

Michael,
I would strongly encourage you to use Jackson instead. It's already a dependency, and a lot of folks are starting to complain about the weight of the pig jar.

Trunk's the right place to add new features. Initially this should go into contrib/piggybank until it proves stable.

> Support load/store JSON data in Pig
> -----------------------------------
>
>                 Key: PIG-1914
>                 URL: https://issues.apache.org/jira/browse/PIG-1914
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Chao Tian
>
> The JSON is a commonly used data storage format. It is popular for storing structured data, especially for JavaScript data exchange. 
> Pig should have the ability to load/store JSON format data. I plan to write one for the piggy bank.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (PIG-1914) Support load/store JSON data in Pig

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1914:
--------------------------------

    Fix Version/s:     (was: 0.9.0)

Unlinking from the release.

Please check the one Dmitry suggested. Also, to get this into the release, you need to find somebody who would commit to do the work on this soon. We are going to be starting to stabilize 0.9 in a week or so

> Support load/store JSON data in Pig
> -----------------------------------
>
>                 Key: PIG-1914
>                 URL: https://issues.apache.org/jira/browse/PIG-1914
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Chao Tian
>
> The JSON is a commonly used data storage format. It is popular for storing structured data, especially for JavaScript data exchange. 
> Pig should have the ability to load/store JSON format data. I plan to write one for the piggy bank.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1914) Support load/store JSON data in Pig

Posted by "Michael May (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051153#comment-13051153 ] 

Michael May commented on PIG-1914:
----------------------------------

This issues is from several months ago, any word on progress? I haven't seen any JSON stuff pop up in PiggyBank.

I've been using the JSON loader as seen here: https://gist.github.com/601331
Note this is only for loading, not storing!

I realize this is only half of the requested JSON features(load, not store) but I think having a JSON loader is better than the JSON nothing that is currently in PiggyBank. I know I was very sad when I noticed that PiggyBank contains a CSV Loader and XML Loader, but no JSON Loader.

I'd be more than happy to get this loader rolled into the PiggyBank SVN with approval.



> Support load/store JSON data in Pig
> -----------------------------------
>
>                 Key: PIG-1914
>                 URL: https://issues.apache.org/jira/browse/PIG-1914
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Chao Tian
>
> The JSON is a commonly used data storage format. It is popular for storing structured data, especially for JavaScript data exchange. 
> Pig should have the ability to load/store JSON format data. I plan to write one for the piggy bank.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (PIG-1914) Support load/store JSON data in Pig

Posted by "Chao Tian (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008330#comment-13008330 ] 

Chao Tian commented on PIG-1914:
--------------------------------

Yeah, i agree with you that we have this problem. However, i thought we should have the assumption that the JSON records in the same data file should have similar schema. The small difference could be allowed, but they should be similar, right?

To deal with these small difference, we could define the schema for the loaded tuple by using the complete set keys. I plan to have two method of loading schema of the data, 1) User could pass a schema string which indicate the schema of the loaded data 2) If user pass nothing, the loader would parse the first line of input data to get the schema.  After doing that, the loaded data would have a schema anyway. This schema should be the complete set of the keys. If some JSON records do not contain some fileds, they would be left as null in Pig. 

I thought this method could solve our problem. And by this method, we could also support the columnar filter, which means we just load the desired columns of JSON data, in future.


> Support load/store JSON data in Pig
> -----------------------------------
>
>                 Key: PIG-1914
>                 URL: https://issues.apache.org/jira/browse/PIG-1914
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Chao Tian
>
> The JSON is a commonly used data storage format. It is popular for storing structured data, especially for JavaScript data exchange. 
> Pig should have the ability to load/store JSON format data. I plan to write one for the piggy bank.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PIG-1914) Support load/store JSON data in Pig

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008335#comment-13008335 ] 

Dmitriy V. Ryaboy commented on PIG-1914:
----------------------------------------

That design makes sense, but the assumption that the first few records you read are going to have the full set of keys often does not hold true in my experience. It's probably very useful for a large subset of json-loading needs out there, though. Sounds like a good approach.


> Support load/store JSON data in Pig
> -----------------------------------
>
>                 Key: PIG-1914
>                 URL: https://issues.apache.org/jira/browse/PIG-1914
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Chao Tian
>
> The JSON is a commonly used data storage format. It is popular for storing structured data, especially for JavaScript data exchange. 
> Pig should have the ability to load/store JSON format data. I plan to write one for the piggy bank.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira