You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Matt Massie (JIRA)" <ji...@apache.org> on 2009/12/30 00:06:29 UTC

[jira] Created: (AVRO-268) Replace lemon-generated JSON parser with simpler recursive descent parser

Replace lemon-generated JSON parser with simpler recursive descent parser
-------------------------------------------------------------------------

                 Key: AVRO-268
                 URL: https://issues.apache.org/jira/browse/AVRO-268
             Project: Avro
          Issue Type: Improvement
          Components: c
            Reporter: Matt Massie
             Fix For: 1.3.0


This is a drop-in replacement for the current JSON parser which is based on lemon (a LALR parser generator).

This parser 
* reads and returns a single JSON_value and its nested children (using recursive descent parsing)
* allows you to process JSON from streams in addition to static memory buffers
* correctly processes unicode \u escaping including surrogates
* distinguishes between integer and real number representations 
* provides information about the line and character in JSON that failed to parse
* is much simpler to understand and maintain (less lines of code and source files)
* is written to allow error recovery to be added later

This patch also adds more unit tests.





-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AVRO-268) Replace lemon-generated JSON parser with simpler recursive descent parser

Posted by "Matt Massie (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Massie updated AVRO-268:
-----------------------------

    Resolution: Won't Fix
        Status: Resolved  (was: Patch Available)

No need for this work.  I've finally found a high-quality C parser with a friendly license called Jansson which will serve as the JSON parser moving forward.  Sorry for the JIRA noise.

> Replace lemon-generated JSON parser with simpler recursive descent parser
> -------------------------------------------------------------------------
>
>                 Key: AVRO-268
>                 URL: https://issues.apache.org/jira/browse/AVRO-268
>             Project: Avro
>          Issue Type: Improvement
>          Components: c
>            Reporter: Matt Massie
>            Assignee: Matt Massie
>             Fix For: 1.3.0
>
>         Attachments: AVRO-268.patch
>
>
> This is a drop-in replacement for the current JSON parser which is based on lemon (a LALR parser generator).
> This parser 
> * reads and returns a single JSON_value and its nested children (using recursive descent parsing)
> * allows you to process JSON from streams in addition to static memory buffers
> * correctly processes unicode \u escaping including surrogates
> * distinguishes between integer and real number representations 
> * provides information about the line and character in JSON that failed to parse
> * is much simpler to understand and maintain (less lines of code and source files)
> * is written to allow error recovery to be added later
> This patch also adds more unit tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (AVRO-268) Replace lemon-generated JSON parser with simpler recursive descent parser

Posted by "Matt Massie (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Massie reassigned AVRO-268:
--------------------------------

    Assignee: Matt Massie

> Replace lemon-generated JSON parser with simpler recursive descent parser
> -------------------------------------------------------------------------
>
>                 Key: AVRO-268
>                 URL: https://issues.apache.org/jira/browse/AVRO-268
>             Project: Avro
>          Issue Type: Improvement
>          Components: c
>            Reporter: Matt Massie
>            Assignee: Matt Massie
>             Fix For: 1.3.0
>
>
> This is a drop-in replacement for the current JSON parser which is based on lemon (a LALR parser generator).
> This parser 
> * reads and returns a single JSON_value and its nested children (using recursive descent parsing)
> * allows you to process JSON from streams in addition to static memory buffers
> * correctly processes unicode \u escaping including surrogates
> * distinguishes between integer and real number representations 
> * provides information about the line and character in JSON that failed to parse
> * is much simpler to understand and maintain (less lines of code and source files)
> * is written to allow error recovery to be added later
> This patch also adds more unit tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AVRO-268) Replace lemon-generated JSON parser with simpler recursive descent parser

Posted by "Matt Massie (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Massie updated AVRO-268:
-----------------------------

    Status: Patch Available  (was: Open)

> Replace lemon-generated JSON parser with simpler recursive descent parser
> -------------------------------------------------------------------------
>
>                 Key: AVRO-268
>                 URL: https://issues.apache.org/jira/browse/AVRO-268
>             Project: Avro
>          Issue Type: Improvement
>          Components: c
>            Reporter: Matt Massie
>            Assignee: Matt Massie
>             Fix For: 1.3.0
>
>         Attachments: AVRO-268.patch
>
>
> This is a drop-in replacement for the current JSON parser which is based on lemon (a LALR parser generator).
> This parser 
> * reads and returns a single JSON_value and its nested children (using recursive descent parsing)
> * allows you to process JSON from streams in addition to static memory buffers
> * correctly processes unicode \u escaping including surrogates
> * distinguishes between integer and real number representations 
> * provides information about the line and character in JSON that failed to parse
> * is much simpler to understand and maintain (less lines of code and source files)
> * is written to allow error recovery to be added later
> This patch also adds more unit tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-268) Replace lemon-generated JSON parser with simpler recursive descent parser

Posted by "Jeff Hammerbacher (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795306#action_12795306 ] 

Jeff Hammerbacher commented on AVRO-268:
----------------------------------------

Worth noting that the new behavior is not standard:

JavaScript:
{code}
js> var a = JSON.parse('{ "key": "value" } foo bar baz');
js: "/Users/hammer/codebox/narwhal/engines/default/lib/json.js", line 474: exception from uncaught JavaScript throw: SyntaxError: JSON.parse
{code}

Python:
{code}
>>> b = json.loads('{ "key": "value" } foo bar baz')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "build/bdist.macosx-10.5-i386/egg/simplejson/__init__.py", line 307, in loads
  File "build/bdist.macosx-10.5-i386/egg/simplejson/decoder.py", line 338, in decode
ValueError: Extra data: line 1 column 19 - line 1 column 30 (char 19 - 30)
{code}

> Replace lemon-generated JSON parser with simpler recursive descent parser
> -------------------------------------------------------------------------
>
>                 Key: AVRO-268
>                 URL: https://issues.apache.org/jira/browse/AVRO-268
>             Project: Avro
>          Issue Type: Improvement
>          Components: c
>            Reporter: Matt Massie
>            Assignee: Matt Massie
>             Fix For: 1.3.0
>
>         Attachments: AVRO-268.patch
>
>
> This is a drop-in replacement for the current JSON parser which is based on lemon (a LALR parser generator).
> This parser 
> * reads and returns a single JSON_value and its nested children (using recursive descent parsing)
> * allows you to process JSON from streams in addition to static memory buffers
> * correctly processes unicode \u escaping including surrogates
> * distinguishes between integer and real number representations 
> * provides information about the line and character in JSON that failed to parse
> * is much simpler to understand and maintain (less lines of code and source files)
> * is written to allow error recovery to be added later
> This patch also adds more unit tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-268) Replace lemon-generated JSON parser with simpler recursive descent parser

Posted by "Matt Massie (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795372#action_12795372 ] 

Matt Massie commented on AVRO-268:
----------------------------------

Ryan-

We posted comments about two minutes apart.

I hope that my earlier comment clarifies that I didn't implement a non-standard JSON parser.

-Matt

> Replace lemon-generated JSON parser with simpler recursive descent parser
> -------------------------------------------------------------------------
>
>                 Key: AVRO-268
>                 URL: https://issues.apache.org/jira/browse/AVRO-268
>             Project: Avro
>          Issue Type: Improvement
>          Components: c
>            Reporter: Matt Massie
>            Assignee: Matt Massie
>             Fix For: 1.3.0
>
>         Attachments: AVRO-268.patch
>
>
> This is a drop-in replacement for the current JSON parser which is based on lemon (a LALR parser generator).
> This parser 
> * reads and returns a single JSON_value and its nested children (using recursive descent parsing)
> * allows you to process JSON from streams in addition to static memory buffers
> * correctly processes unicode \u escaping including surrogates
> * distinguishes between integer and real number representations 
> * provides information about the line and character in JSON that failed to parse
> * is much simpler to understand and maintain (less lines of code and source files)
> * is written to allow error recovery to be added later
> This patch also adds more unit tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-268) Replace lemon-generated JSON parser with simpler recursive descent parser

Posted by "Ryan King (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795359#action_12795359 ] 

Ryan King commented on AVRO-268:
--------------------------------

Please don't use a non-standard json parser.

I was under the understanding  that part of the reason for choosing JSON is that it is a standard format (http://www.ietf.org/rfc/rfc4627.txt) with parsers available in many languages already. If you use a non-standard JSON parser, you lose that benefit.

> Replace lemon-generated JSON parser with simpler recursive descent parser
> -------------------------------------------------------------------------
>
>                 Key: AVRO-268
>                 URL: https://issues.apache.org/jira/browse/AVRO-268
>             Project: Avro
>          Issue Type: Improvement
>          Components: c
>            Reporter: Matt Massie
>            Assignee: Matt Massie
>             Fix For: 1.3.0
>
>         Attachments: AVRO-268.patch
>
>
> This is a drop-in replacement for the current JSON parser which is based on lemon (a LALR parser generator).
> This parser 
> * reads and returns a single JSON_value and its nested children (using recursive descent parsing)
> * allows you to process JSON from streams in addition to static memory buffers
> * correctly processes unicode \u escaping including surrogates
> * distinguishes between integer and real number representations 
> * provides information about the line and character in JSON that failed to parse
> * is much simpler to understand and maintain (less lines of code and source files)
> * is written to allow error recovery to be added later
> This patch also adds more unit tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-268) Replace lemon-generated JSON parser with simpler recursive descent parser

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795194#action_12795194 ] 

Doug Cutting commented on AVRO-268:
-----------------------------------

Sounds like a fine approach.


> Replace lemon-generated JSON parser with simpler recursive descent parser
> -------------------------------------------------------------------------
>
>                 Key: AVRO-268
>                 URL: https://issues.apache.org/jira/browse/AVRO-268
>             Project: Avro
>          Issue Type: Improvement
>          Components: c
>            Reporter: Matt Massie
>            Assignee: Matt Massie
>             Fix For: 1.3.0
>
>         Attachments: AVRO-268.patch
>
>
> This is a drop-in replacement for the current JSON parser which is based on lemon (a LALR parser generator).
> This parser 
> * reads and returns a single JSON_value and its nested children (using recursive descent parsing)
> * allows you to process JSON from streams in addition to static memory buffers
> * correctly processes unicode \u escaping including surrogates
> * distinguishes between integer and real number representations 
> * provides information about the line and character in JSON that failed to parse
> * is much simpler to understand and maintain (less lines of code and source files)
> * is written to allow error recovery to be added later
> This patch also adds more unit tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-268) Replace lemon-generated JSON parser with simpler recursive descent parser

Posted by "Ryan King (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795453#action_12795453 ] 

Ryan King commented on AVRO-268:
--------------------------------

Alright, I now understand a bit better what you're going for.

Now forgive me if this is a naive question: whenever you need to read JSON for avro, do we know ahead of time how long the JSON blob will be? 

Most JSON parsers don't have the property of returning as soon as a full object has been parsed, so we'll need to be able to read just the appropriate length, then parse as JSON.

> Replace lemon-generated JSON parser with simpler recursive descent parser
> -------------------------------------------------------------------------
>
>                 Key: AVRO-268
>                 URL: https://issues.apache.org/jira/browse/AVRO-268
>             Project: Avro
>          Issue Type: Improvement
>          Components: c
>            Reporter: Matt Massie
>            Assignee: Matt Massie
>             Fix For: 1.3.0
>
>         Attachments: AVRO-268.patch
>
>
> This is a drop-in replacement for the current JSON parser which is based on lemon (a LALR parser generator).
> This parser 
> * reads and returns a single JSON_value and its nested children (using recursive descent parsing)
> * allows you to process JSON from streams in addition to static memory buffers
> * correctly processes unicode \u escaping including surrogates
> * distinguishes between integer and real number representations 
> * provides information about the line and character in JSON that failed to parse
> * is much simpler to understand and maintain (less lines of code and source files)
> * is written to allow error recovery to be added later
> This patch also adds more unit tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AVRO-268) Replace lemon-generated JSON parser with simpler recursive descent parser

Posted by "Matt Massie (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Massie updated AVRO-268:
-----------------------------

    Attachment: AVRO-268.patch

I noticed that two old directories still exist in svn

$ rm -rf src/c/json/fail
$ rm -rf src/c/json/pass

will also be performed when this patch is committed.



> Replace lemon-generated JSON parser with simpler recursive descent parser
> -------------------------------------------------------------------------
>
>                 Key: AVRO-268
>                 URL: https://issues.apache.org/jira/browse/AVRO-268
>             Project: Avro
>          Issue Type: Improvement
>          Components: c
>            Reporter: Matt Massie
>            Assignee: Matt Massie
>             Fix For: 1.3.0
>
>         Attachments: AVRO-268.patch
>
>
> This is a drop-in replacement for the current JSON parser which is based on lemon (a LALR parser generator).
> This parser 
> * reads and returns a single JSON_value and its nested children (using recursive descent parsing)
> * allows you to process JSON from streams in addition to static memory buffers
> * correctly processes unicode \u escaping including surrogates
> * distinguishes between integer and real number representations 
> * provides information about the line and character in JSON that failed to parse
> * is much simpler to understand and maintain (less lines of code and source files)
> * is written to allow error recovery to be added later
> This patch also adds more unit tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-268) Replace lemon-generated JSON parser with simpler recursive descent parser

Posted by "Matt Massie (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795193#action_12795193 ] 

Matt Massie commented on AVRO-268:
----------------------------------

I should also mention that a few unit tests are *removed* with this patch as well.  Tests with trailing character that used to fail now succeed with the new parser.

For example, the following JSON used to fail to parse

{code}
{ "key": "value" } foo bar baz
{code}

while now, the new parser will return immediately when it hits the last '}' ignoring the trailing junk characters.





> Replace lemon-generated JSON parser with simpler recursive descent parser
> -------------------------------------------------------------------------
>
>                 Key: AVRO-268
>                 URL: https://issues.apache.org/jira/browse/AVRO-268
>             Project: Avro
>          Issue Type: Improvement
>          Components: c
>            Reporter: Matt Massie
>            Assignee: Matt Massie
>             Fix For: 1.3.0
>
>         Attachments: AVRO-268.patch
>
>
> This is a drop-in replacement for the current JSON parser which is based on lemon (a LALR parser generator).
> This parser 
> * reads and returns a single JSON_value and its nested children (using recursive descent parsing)
> * allows you to process JSON from streams in addition to static memory buffers
> * correctly processes unicode \u escaping including surrogates
> * distinguishes between integer and real number representations 
> * provides information about the line and character in JSON that failed to parse
> * is much simpler to understand and maintain (less lines of code and source files)
> * is written to allow error recovery to be added later
> This patch also adds more unit tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-268) Replace lemon-generated JSON parser with simpler recursive descent parser

Posted by "Matt Massie (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795357#action_12795357 ] 

Matt Massie commented on AVRO-268:
----------------------------------

I should have been clearer.  

I'm not saying that

{code}
{ "key" : "value" } foo bar baz
{code}

is valid JSON.  It's not.

I was just speaking to the fact that the parser isn't greedy and will return as soon as it's completed a JSON value.  However, if the stream in this example remained pointed to 'foo', the parser would throw an error the next time it's called.

> Replace lemon-generated JSON parser with simpler recursive descent parser
> -------------------------------------------------------------------------
>
>                 Key: AVRO-268
>                 URL: https://issues.apache.org/jira/browse/AVRO-268
>             Project: Avro
>          Issue Type: Improvement
>          Components: c
>            Reporter: Matt Massie
>            Assignee: Matt Massie
>             Fix For: 1.3.0
>
>         Attachments: AVRO-268.patch
>
>
> This is a drop-in replacement for the current JSON parser which is based on lemon (a LALR parser generator).
> This parser 
> * reads and returns a single JSON_value and its nested children (using recursive descent parsing)
> * allows you to process JSON from streams in addition to static memory buffers
> * correctly processes unicode \u escaping including surrogates
> * distinguishes between integer and real number representations 
> * provides information about the line and character in JSON that failed to parse
> * is much simpler to understand and maintain (less lines of code and source files)
> * is written to allow error recovery to be added later
> This patch also adds more unit tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.