You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Ron Bodkin (JIRA)" <ji...@apache.org> on 2010/09/21 01:05:33 UTC

[jira] Created: (AVRO-672) Convert JSON Text Input to Avro Tool

Convert JSON Text Input to Avro Tool
------------------------------------

                 Key: AVRO-672
                 URL: https://issues.apache.org/jira/browse/AVRO-672
             Project: Avro
          Issue Type: New Feature
            Reporter: Ron Bodkin


The attached patch allows reading a JSON-formatted text file in, converting to a conforming Avro text file, emitting one record per line, e.g., it can read this input file:
{"intval":12}
{"intval":-73,"strval":"hello, there!!"}

with this schema:
{ "type":"record", "name":"TestRecord", "fields": [ {"name":"intval","type":"int"}, {"name":"strval","type":["string", "null"]}]}

returning valid Avro. This is different than the DataFileWriteTool, which would read in the following internal encoding:
{"intval":12,"strval":null}
{"intval":-73,"strval":{"string":"hello, there!!"}}

In general, the internal encodings used by Avro aren't natural when reading in JSON text that appears in the wild. Likewise, this utility allows changing invalid Avro identifier characters into an underscore, again to tolerate JSON that wasn't designed to be readable by Avro.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AVRO-672) Convert JSON Text Input to Avro Tool

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-672:
------------------------------

    Attachment: AVRO-672.patch

It might be confusing to provide two different JSON encodings for Avro data.  Also, the encoding in your patch is indeed simpler, but can lose information.  For example, a string that looks like base64-encoded binary data would be assumed by Jackson to be binary data, which might not always be the case.  Schemas that include fixed or enum values are not supported by this encoding, nor are many unions.

If reading and writing arbitrary JSON is a priority, then the approach taken in AVRO-251 might be of interest.  Here's a patch that provides a DatumReader and DatumWriter for Jackson's JsonNode.  This uses a schema that permits arbitrary JSON data.  Would this be useful to you?  If so, we could provide it as a tool.

> Convert JSON Text Input to Avro Tool
> ------------------------------------
>
>                 Key: AVRO-672
>                 URL: https://issues.apache.org/jira/browse/AVRO-672
>             Project: Avro
>          Issue Type: New Feature
>            Reporter: Ron Bodkin
>         Attachments: AVRO-672.patch, AVRO-672.patch
>
>
> The attached patch allows reading a JSON-formatted text file in, converting to a conforming Avro text file, emitting one record per line, e.g., it can read this input file:
> {"intval":12}
> {"intval":-73,"strval":"hello, there!!"}
> with this schema:
> { "type":"record", "name":"TestRecord", "fields": [ {"name":"intval","type":"int"}, {"name":"strval","type":["string", "null"]}]}
> returning valid Avro. This is different than the DataFileWriteTool, which would read in the following internal encoding:
> {"intval":12,"strval":null}
> {"intval":-73,"strval":{"string":"hello, there!!"}}
> In general, the internal encodings used by Avro aren't natural when reading in JSON text that appears in the wild. Likewise, this utility allows changing invalid Avro identifier characters into an underscore, again to tolerate JSON that wasn't designed to be readable by Avro.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-672) Convert JSON Text Input to Avro Tool

Posted by "Ron Bodkin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12918322#action_12918322 ] 

Ron Bodkin commented on AVRO-672:
---------------------------------

The use case I'm most interested in supporting is converting from JSON data to a previously-defined Avro schema, either in a batch file conversion, or in memory (for use with map-reduce). 

This newer patch emits the output in a standard, different schema and conversion to a previously-defined (custom) schema seems to be a problem that would require code like I wrote in my patch. Also, it'd be nice to be able to read in a value like "1" even to a double or a long field, even though it'd be parsed as a JSON integer node.

Also I have found it valuable to have transformation of names that have invalid characters since there's lots of valid JSON with identifiers that don't conform to the Avro identifier grammar. That would be pretty easy to put in this patch (although the regexp I used before was way too slow so I have a newer version that's efficient).

To allow reading in JSON text and creating objects in memory that conform to that schema, I think it'd be necessary to have hints for the type of data that arrays contain (e.g., in generated code or in runtime annotations if using a reflective style). That is something that I already ran into in trying to get the reflection reader to work with specific data (on AVRO-669).


> Convert JSON Text Input to Avro Tool
> ------------------------------------
>
>                 Key: AVRO-672
>                 URL: https://issues.apache.org/jira/browse/AVRO-672
>             Project: Avro
>          Issue Type: New Feature
>            Reporter: Ron Bodkin
>         Attachments: AVRO-672.patch, AVRO-672.patch
>
>
> The attached patch allows reading a JSON-formatted text file in, converting to a conforming Avro text file, emitting one record per line, e.g., it can read this input file:
> {"intval":12}
> {"intval":-73,"strval":"hello, there!!"}
> with this schema:
> { "type":"record", "name":"TestRecord", "fields": [ {"name":"intval","type":"int"}, {"name":"strval","type":["string", "null"]}]}
> returning valid Avro. This is different than the DataFileWriteTool, which would read in the following internal encoding:
> {"intval":12,"strval":null}
> {"intval":-73,"strval":{"string":"hello, there!!"}}
> In general, the internal encodings used by Avro aren't natural when reading in JSON text that appears in the wild. Likewise, this utility allows changing invalid Avro identifier characters into an underscore, again to tolerate JSON that wasn't designed to be readable by Avro.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AVRO-672) Convert JSON Text Input to Avro Tool

Posted by "Ron Bodkin (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ron Bodkin updated AVRO-672:
----------------------------

    Status: Patch Available  (was: Open)

> Convert JSON Text Input to Avro Tool
> ------------------------------------
>
>                 Key: AVRO-672
>                 URL: https://issues.apache.org/jira/browse/AVRO-672
>             Project: Avro
>          Issue Type: New Feature
>            Reporter: Ron Bodkin
>         Attachments: AVRO-672.patch
>
>
> The attached patch allows reading a JSON-formatted text file in, converting to a conforming Avro text file, emitting one record per line, e.g., it can read this input file:
> {"intval":12}
> {"intval":-73,"strval":"hello, there!!"}
> with this schema:
> { "type":"record", "name":"TestRecord", "fields": [ {"name":"intval","type":"int"}, {"name":"strval","type":["string", "null"]}]}
> returning valid Avro. This is different than the DataFileWriteTool, which would read in the following internal encoding:
> {"intval":12,"strval":null}
> {"intval":-73,"strval":{"string":"hello, there!!"}}
> In general, the internal encodings used by Avro aren't natural when reading in JSON text that appears in the wild. Likewise, this utility allows changing invalid Avro identifier characters into an underscore, again to tolerate JSON that wasn't designed to be readable by Avro.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AVRO-672) Convert JSON Text Input to Avro Tool

Posted by "Ron Bodkin (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ron Bodkin updated AVRO-672:
----------------------------

    Attachment: AVRO-672.patch

Patch with implementation of this feature, including a test.


> Convert JSON Text Input to Avro Tool
> ------------------------------------
>
>                 Key: AVRO-672
>                 URL: https://issues.apache.org/jira/browse/AVRO-672
>             Project: Avro
>          Issue Type: New Feature
>            Reporter: Ron Bodkin
>         Attachments: AVRO-672.patch
>
>
> The attached patch allows reading a JSON-formatted text file in, converting to a conforming Avro text file, emitting one record per line, e.g., it can read this input file:
> {"intval":12}
> {"intval":-73,"strval":"hello, there!!"}
> with this schema:
> { "type":"record", "name":"TestRecord", "fields": [ {"name":"intval","type":"int"}, {"name":"strval","type":["string", "null"]}]}
> returning valid Avro. This is different than the DataFileWriteTool, which would read in the following internal encoding:
> {"intval":12,"strval":null}
> {"intval":-73,"strval":{"string":"hello, there!!"}}
> In general, the internal encodings used by Avro aren't natural when reading in JSON text that appears in the wild. Likewise, this utility allows changing invalid Avro identifier characters into an underscore, again to tolerate JSON that wasn't designed to be readable by Avro.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-672) Convert JSON Text Input to Avro Tool

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12918348#action_12918348 ] 

Doug Cutting commented on AVRO-672:
-----------------------------------

> I like the idea of having tools that manipulate "traditional" data formats into avro records, including guessing at the schema.

Do you think Ron's patch here is a good example of this that we should commit?

I worry that such tools might do 90% of what each application wants and require constant tweaking.  And each tweak might break other users.  So a tool has to either have lots of flexibility or be lossless.  But perhaps I'm just paranoid...

> Convert JSON Text Input to Avro Tool
> ------------------------------------
>
>                 Key: AVRO-672
>                 URL: https://issues.apache.org/jira/browse/AVRO-672
>             Project: Avro
>          Issue Type: New Feature
>            Reporter: Ron Bodkin
>         Attachments: AVRO-672.patch, AVRO-672.patch
>
>
> The attached patch allows reading a JSON-formatted text file in, converting to a conforming Avro text file, emitting one record per line, e.g., it can read this input file:
> {"intval":12}
> {"intval":-73,"strval":"hello, there!!"}
> with this schema:
> { "type":"record", "name":"TestRecord", "fields": [ {"name":"intval","type":"int"}, {"name":"strval","type":["string", "null"]}]}
> returning valid Avro. This is different than the DataFileWriteTool, which would read in the following internal encoding:
> {"intval":12,"strval":null}
> {"intval":-73,"strval":{"string":"hello, there!!"}}
> In general, the internal encodings used by Avro aren't natural when reading in JSON text that appears in the wild. Likewise, this utility allows changing invalid Avro identifier characters into an underscore, again to tolerate JSON that wasn't designed to be readable by Avro.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-672) Convert JSON Text Input to Avro Tool

Posted by "Philip Zeyliger (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12918341#action_12918341 ] 

Philip Zeyliger commented on AVRO-672:
--------------------------------------

I like the idea of having tools that manipulate "traditional" data formats into avro records, including guessing at the schema.  CSV and TSV and one-json-per-line are obvious candidates here.

> Convert JSON Text Input to Avro Tool
> ------------------------------------
>
>                 Key: AVRO-672
>                 URL: https://issues.apache.org/jira/browse/AVRO-672
>             Project: Avro
>          Issue Type: New Feature
>            Reporter: Ron Bodkin
>         Attachments: AVRO-672.patch, AVRO-672.patch
>
>
> The attached patch allows reading a JSON-formatted text file in, converting to a conforming Avro text file, emitting one record per line, e.g., it can read this input file:
> {"intval":12}
> {"intval":-73,"strval":"hello, there!!"}
> with this schema:
> { "type":"record", "name":"TestRecord", "fields": [ {"name":"intval","type":"int"}, {"name":"strval","type":["string", "null"]}]}
> returning valid Avro. This is different than the DataFileWriteTool, which would read in the following internal encoding:
> {"intval":12,"strval":null}
> {"intval":-73,"strval":{"string":"hello, there!!"}}
> In general, the internal encodings used by Avro aren't natural when reading in JSON text that appears in the wild. Likewise, this utility allows changing invalid Avro identifier characters into an underscore, again to tolerate JSON that wasn't designed to be readable by Avro.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-672) Convert JSON Text Input to Avro Tool

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12918338#action_12918338 ] 

Doug Cutting commented on AVRO-672:
-----------------------------------

I am not convinced that the tool you need is a general-purpose tool that others will use or whether it might be better to keep this in your application.  Avro's existing JSON encoding is primarily a tool for debugging.  Tools that can losslessly import and export JSON data into and out of Avro might also be generally useful.  A tool that adapts JSON data to pre-existing schemas could be generally useful if it permitted enough control of how the adaptation is done, but might also be rather application-specific.  What do you think?


> Convert JSON Text Input to Avro Tool
> ------------------------------------
>
>                 Key: AVRO-672
>                 URL: https://issues.apache.org/jira/browse/AVRO-672
>             Project: Avro
>          Issue Type: New Feature
>            Reporter: Ron Bodkin
>         Attachments: AVRO-672.patch, AVRO-672.patch
>
>
> The attached patch allows reading a JSON-formatted text file in, converting to a conforming Avro text file, emitting one record per line, e.g., it can read this input file:
> {"intval":12}
> {"intval":-73,"strval":"hello, there!!"}
> with this schema:
> { "type":"record", "name":"TestRecord", "fields": [ {"name":"intval","type":"int"}, {"name":"strval","type":["string", "null"]}]}
> returning valid Avro. This is different than the DataFileWriteTool, which would read in the following internal encoding:
> {"intval":12,"strval":null}
> {"intval":-73,"strval":{"string":"hello, there!!"}}
> In general, the internal encodings used by Avro aren't natural when reading in JSON text that appears in the wild. Likewise, this utility allows changing invalid Avro identifier characters into an underscore, again to tolerate JSON that wasn't designed to be readable by Avro.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.