You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Miki Tebeka (JIRA)" <ji...@apache.org> on 2011/07/14 20:37:59 UTC

[jira] [Created] (AVRO-860) Invalid JSON when printing out records with unicode

Invalid JSON when printing out records with unicode
---------------------------------------------------

                 Key: AVRO-860
                 URL: https://issues.apache.org/jira/browse/AVRO-860
             Project: Avro
          Issue Type: Bug
          Components: java
    Affects Versions: 1.5.1
            Reporter: Miki Tebeka


I have an avro file, that when printed returns invalid JSON.
The code for iterating and printing is:
{code}

            DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
            DataFileReader<GenericRecord> dataFileReader =
                new DataFileReader<GenericRecord>(data, reader);

            while (dataFileReader.hasNext()) {
                System.out.println(dataFileReader.next().toString());
            }
{code}
and the relevant JSON snippet is
{code}
    "description": "Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives\uMOVE™ OFFERS ADVERTISERS THE OPPORTUNITY TO DELIVER MESSAGES TO CONSUMERS AT A TIME WHEN CONSUMERS ARE MAKING THE BIGGEST PURCHASES OF THEIR LIVES—OR REMODELING, REDECORATING AND MAINTAINING THEIR MOST IMPORTANT ASSETS.or remodeling, redecorating and maintaining their most important assets.",
{code}
(The \uMOVE is the problematic part).

However if I do:
{code}
                GenericRecord record = dataFileReader.next();
                Utf8 desc = (Utf8)record.get("description");
                System.out.println(desc);
{code}

Then I get
{code}
Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives—or remodeling, redecorating and maintaining their most important assets.
{code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (AVRO-860) Invalid JSON when printing out records with unicode

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065467#comment-13065467 ] 

Doug Cutting commented on AVRO-860:
-----------------------------------

The problem looks to be in GenericData#writeEscapedString(), added in AVRO-713.

> Invalid JSON when printing out records with unicode
> ---------------------------------------------------
>
>                 Key: AVRO-860
>                 URL: https://issues.apache.org/jira/browse/AVRO-860
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Miki Tebeka
>              Labels: java, json, unicode
>
> I have an avro file, that when printed returns invalid JSON.
> The code for iterating and printing is:
> {code}
>             DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
>             DataFileReader<GenericRecord> dataFileReader =
>                 new DataFileReader<GenericRecord>(data, reader);
>             while (dataFileReader.hasNext()) {
>                 System.out.println(dataFileReader.next().toString());
>             }
> {code}
> and the relevant JSON snippet is
> {code}
>     "description": "Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives\uMOVE™ OFFERS ADVERTISERS THE OPPORTUNITY TO DELIVER MESSAGES TO CONSUMERS AT A TIME WHEN CONSUMERS ARE MAKING THE BIGGEST PURCHASES OF THEIR LIVES—OR REMODELING, REDECORATING AND MAINTAINING THEIR MOST IMPORTANT ASSETS.or remodeling, redecorating and maintaining their most important assets.",
> {code}
> (The \uMOVE is the problematic part).
> However if I do:
> {code}
>                 GenericRecord record = dataFileReader.next();
>                 Utf8 desc = (Utf8)record.get("description");
>                 System.out.println(desc);
> {code}
> Then I get
> {code}
> Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives—or remodeling, redecorating and maintaining their most important assets.
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (AVRO-860) Invalid JSON when printing out records with unicode

Posted by "Miki Tebeka (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067172#comment-13067172 ] 

Miki Tebeka commented on AVRO-860:
----------------------------------

I'm not an expert on the subject, but IMO JSON is utf-8.

> Invalid JSON when printing out records with unicode
> ---------------------------------------------------
>
>                 Key: AVRO-860
>                 URL: https://issues.apache.org/jira/browse/AVRO-860
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Miki Tebeka
>              Labels: java, json, unicode
>         Attachments: AVRO-860.diff, AVRO-860.diff, m.avro
>
>
> I have an avro file, that when printed returns invalid JSON.
> The code for iterating and printing is:
> {code}
>             DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
>             DataFileReader<GenericRecord> dataFileReader =
>                 new DataFileReader<GenericRecord>(data, reader);
>             while (dataFileReader.hasNext()) {
>                 System.out.println(dataFileReader.next().toString());
>             }
> {code}
> and the relevant JSON snippet is
> {code}
>     "description": "Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives\uMOVE™ OFFERS ADVERTISERS THE OPPORTUNITY TO DELIVER MESSAGES TO CONSUMERS AT A TIME WHEN CONSUMERS ARE MAKING THE BIGGEST PURCHASES OF THEIR LIVES—OR REMODELING, REDECORATING AND MAINTAINING THEIR MOST IMPORTANT ASSETS.or remodeling, redecorating and maintaining their most important assets.",
> {code}
> (The \uMOVE is the problematic part).
> However if I do:
> {code}
>                 GenericRecord record = dataFileReader.next();
>                 Utf8 desc = (Utf8)record.get("description");
>                 System.out.println(desc);
> {code}
> Then I get
> {code}
> Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives—or remodeling, redecorating and maintaining their most important assets.
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (AVRO-860) Invalid JSON when printing out records with unicode

Posted by "Miki Tebeka (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065616#comment-13065616 ] 

Miki Tebeka commented on AVRO-860:
----------------------------------

Any reason the JSON is constructed manually and not using jackson? (which is already an requirement).

> Invalid JSON when printing out records with unicode
> ---------------------------------------------------
>
>                 Key: AVRO-860
>                 URL: https://issues.apache.org/jira/browse/AVRO-860
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Miki Tebeka
>              Labels: java, json, unicode
>
> I have an avro file, that when printed returns invalid JSON.
> The code for iterating and printing is:
> {code}
>             DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
>             DataFileReader<GenericRecord> dataFileReader =
>                 new DataFileReader<GenericRecord>(data, reader);
>             while (dataFileReader.hasNext()) {
>                 System.out.println(dataFileReader.next().toString());
>             }
> {code}
> and the relevant JSON snippet is
> {code}
>     "description": "Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives\uMOVE™ OFFERS ADVERTISERS THE OPPORTUNITY TO DELIVER MESSAGES TO CONSUMERS AT A TIME WHEN CONSUMERS ARE MAKING THE BIGGEST PURCHASES OF THEIR LIVES—OR REMODELING, REDECORATING AND MAINTAINING THEIR MOST IMPORTANT ASSETS.or remodeling, redecorating and maintaining their most important assets.",
> {code}
> (The \uMOVE is the problematic part).
> However if I do:
> {code}
>                 GenericRecord record = dataFileReader.next();
>                 Utf8 desc = (Utf8)record.get("description");
>                 System.out.println(desc);
> {code}
> Then I get
> {code}
> Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives—or remodeling, redecorating and maintaining their most important assets.
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (AVRO-860) Invalid JSON when printing out records with unicode

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066231#comment-13066231 ] 

Scott Carey commented on AVRO-860:
----------------------------------

It could be as simple as creating a very simple GenercData.Record with a string field set to have the "™" in there (you can place unicode utf8 directly in the source, or use a \u literal).
{code}
Schema s = Schema.parse("{\"type\":\"record\", \"fields\": [{\"name":\"bar\", \"type\":\"string\"}]}");
GenericData.Record foo = new GenericData.Record(s);
foo.put(0, "utf8 trademark char-> ™ <-");
Assert.assertEquals(expected, foo.toString());
{code} 

> Invalid JSON when printing out records with unicode
> ---------------------------------------------------
>
>                 Key: AVRO-860
>                 URL: https://issues.apache.org/jira/browse/AVRO-860
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Miki Tebeka
>              Labels: java, json, unicode
>         Attachments: AVRO-860.diff
>
>
> I have an avro file, that when printed returns invalid JSON.
> The code for iterating and printing is:
> {code}
>             DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
>             DataFileReader<GenericRecord> dataFileReader =
>                 new DataFileReader<GenericRecord>(data, reader);
>             while (dataFileReader.hasNext()) {
>                 System.out.println(dataFileReader.next().toString());
>             }
> {code}
> and the relevant JSON snippet is
> {code}
>     "description": "Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives\uMOVE™ OFFERS ADVERTISERS THE OPPORTUNITY TO DELIVER MESSAGES TO CONSUMERS AT A TIME WHEN CONSUMERS ARE MAKING THE BIGGEST PURCHASES OF THEIR LIVES—OR REMODELING, REDECORATING AND MAINTAINING THEIR MOST IMPORTANT ASSETS.or remodeling, redecorating and maintaining their most important assets.",
> {code}
> (The \uMOVE is the problematic part).
> However if I do:
> {code}
>                 GenericRecord record = dataFileReader.next();
>                 Utf8 desc = (Utf8)record.get("description");
>                 System.out.println(desc);
> {code}
> Then I get
> {code}
> Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives—or remodeling, redecorating and maintaining their most important assets.
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (AVRO-860) Invalid JSON when printing out records with unicode

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067283#comment-13067283 ] 

Scott Carey commented on AVRO-860:
----------------------------------

http://www.json.org/
indicates that all Unicode characters that are not control characters or " or \  are valid.  It does not specify what encoding is valid, just that it is Unicode.  So I assume that it must be consistent with whatever encoding the entire document is in.

http://www.ietf.org/rfc/rfc4627.txt

Is more precise, and Jackson seems to be implementing that.  In that case, only the control characters between 00 and 1F inclusive are required to be encoded, along with \ and ".  

The old code encoded more code points, which will print out more cleanly in some cases but has nothing to do with JSON compliance.

I think we can safely delegate this to Jackson and trust it outputs valid JSON string encodings.



> Invalid JSON when printing out records with unicode
> ---------------------------------------------------
>
>                 Key: AVRO-860
>                 URL: https://issues.apache.org/jira/browse/AVRO-860
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Miki Tebeka
>              Labels: java, json, unicode
>         Attachments: AVRO-860.diff, AVRO-860.diff, m.avro
>
>
> I have an avro file, that when printed returns invalid JSON.
> The code for iterating and printing is:
> {code}
>             DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
>             DataFileReader<GenericRecord> dataFileReader =
>                 new DataFileReader<GenericRecord>(data, reader);
>             while (dataFileReader.hasNext()) {
>                 System.out.println(dataFileReader.next().toString());
>             }
> {code}
> and the relevant JSON snippet is
> {code}
>     "description": "Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives\uMOVE™ OFFERS ADVERTISERS THE OPPORTUNITY TO DELIVER MESSAGES TO CONSUMERS AT A TIME WHEN CONSUMERS ARE MAKING THE BIGGEST PURCHASES OF THEIR LIVES—OR REMODELING, REDECORATING AND MAINTAINING THEIR MOST IMPORTANT ASSETS.or remodeling, redecorating and maintaining their most important assets.",
> {code}
> (The \uMOVE is the problematic part).
> However if I do:
> {code}
>                 GenericRecord record = dataFileReader.next();
>                 Utf8 desc = (Utf8)record.get("description");
>                 System.out.println(desc);
> {code}
> Then I get
> {code}
> Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives—or remodeling, redecorating and maintaining their most important assets.
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (AVRO-860) Invalid JSON when printing out records with unicode

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-860:
------------------------------

    Fix Version/s: 1.6.0
         Assignee: Miki Tebeka

> Invalid JSON when printing out records with unicode
> ---------------------------------------------------
>
>                 Key: AVRO-860
>                 URL: https://issues.apache.org/jira/browse/AVRO-860
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Miki Tebeka
>            Assignee: Miki Tebeka
>              Labels: java, json, unicode
>             Fix For: 1.6.0
>
>         Attachments: AVRO-860.diff, AVRO-860.diff, m.avro
>
>
> I have an avro file, that when printed returns invalid JSON.
> The code for iterating and printing is:
> {code}
>             DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
>             DataFileReader<GenericRecord> dataFileReader =
>                 new DataFileReader<GenericRecord>(data, reader);
>             while (dataFileReader.hasNext()) {
>                 System.out.println(dataFileReader.next().toString());
>             }
> {code}
> and the relevant JSON snippet is
> {code}
>     "description": "Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives\uMOVE™ OFFERS ADVERTISERS THE OPPORTUNITY TO DELIVER MESSAGES TO CONSUMERS AT A TIME WHEN CONSUMERS ARE MAKING THE BIGGEST PURCHASES OF THEIR LIVES—OR REMODELING, REDECORATING AND MAINTAINING THEIR MOST IMPORTANT ASSETS.or remodeling, redecorating and maintaining their most important assets.",
> {code}
> (The \uMOVE is the problematic part).
> However if I do:
> {code}
>                 GenericRecord record = dataFileReader.next();
>                 Utf8 desc = (Utf8)record.get("description");
>                 System.out.println(desc);
> {code}
> Then I get
> {code}
> Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives—or remodeling, redecorating and maintaining their most important assets.
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (AVRO-860) Invalid JSON when printing out records with unicode

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066296#comment-13066296 ] 

Scott Carey commented on AVRO-860:
----------------------------------

Sorry, I should have noticed this earlier:  This was fixed in AVRO-851

The test below fails if I revert AVRO-851 on trunk.  AVRO-851 likely fixes the issue you see too.  I am not sure if AVRO-851 made it into 1.5.2's release candidate.

AVRO-851 did not switch out to Jackson,  I think that is still a valuable improvement.  

However, it appears that the patch here alters the output -- it does not escape the character: '\u2013', leaving it as a literal utf8 char ('–').  Is it required to escape unicode characters in this range?  Jackson apparently does not in the default configuration.

{code}
  @Test
  public void testUtf8StringPrint() {
    Schema s = Schema.parse("{\"type\":\"record\", \"name\":\"foo\", \"fields\": [{\"name\":\"bar\", \"type\":[\"null\",\"string\"]}]}");
    GenericRecord foo = new GenericData.Record(s);
    foo.put(0, new Utf8("unicode char-> \u2013 <-"));
    assertEquals("{\"bar\": \"unicode char-> \\u2013 <-\"}", foo.toString());
  }
{code}


> Invalid JSON when printing out records with unicode
> ---------------------------------------------------
>
>                 Key: AVRO-860
>                 URL: https://issues.apache.org/jira/browse/AVRO-860
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Miki Tebeka
>              Labels: java, json, unicode
>         Attachments: AVRO-860.diff, AVRO-860.diff, m.avro
>
>
> I have an avro file, that when printed returns invalid JSON.
> The code for iterating and printing is:
> {code}
>             DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
>             DataFileReader<GenericRecord> dataFileReader =
>                 new DataFileReader<GenericRecord>(data, reader);
>             while (dataFileReader.hasNext()) {
>                 System.out.println(dataFileReader.next().toString());
>             }
> {code}
> and the relevant JSON snippet is
> {code}
>     "description": "Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives\uMOVE™ OFFERS ADVERTISERS THE OPPORTUNITY TO DELIVER MESSAGES TO CONSUMERS AT A TIME WHEN CONSUMERS ARE MAKING THE BIGGEST PURCHASES OF THEIR LIVES—OR REMODELING, REDECORATING AND MAINTAINING THEIR MOST IMPORTANT ASSETS.or remodeling, redecorating and maintaining their most important assets.",
> {code}
> (The \uMOVE is the problematic part).
> However if I do:
> {code}
>                 GenericRecord record = dataFileReader.next();
>                 Utf8 desc = (Utf8)record.get("description");
>                 System.out.println(desc);
> {code}
> Then I get
> {code}
> Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives—or remodeling, redecorating and maintaining their most important assets.
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (AVRO-860) Invalid JSON when printing out records with unicode

Posted by "Miki Tebeka (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Miki Tebeka updated AVRO-860:
-----------------------------

    Attachment: AVRO-860.diff

Adding the test Scott suggested.

> Invalid JSON when printing out records with unicode
> ---------------------------------------------------
>
>                 Key: AVRO-860
>                 URL: https://issues.apache.org/jira/browse/AVRO-860
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Miki Tebeka
>              Labels: java, json, unicode
>         Attachments: AVRO-860.diff, AVRO-860.diff, m.avro
>
>
> I have an avro file, that when printed returns invalid JSON.
> The code for iterating and printing is:
> {code}
>             DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
>             DataFileReader<GenericRecord> dataFileReader =
>                 new DataFileReader<GenericRecord>(data, reader);
>             while (dataFileReader.hasNext()) {
>                 System.out.println(dataFileReader.next().toString());
>             }
> {code}
> and the relevant JSON snippet is
> {code}
>     "description": "Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives\uMOVE™ OFFERS ADVERTISERS THE OPPORTUNITY TO DELIVER MESSAGES TO CONSUMERS AT A TIME WHEN CONSUMERS ARE MAKING THE BIGGEST PURCHASES OF THEIR LIVES—OR REMODELING, REDECORATING AND MAINTAINING THEIR MOST IMPORTANT ASSETS.or remodeling, redecorating and maintaining their most important assets.",
> {code}
> (The \uMOVE is the problematic part).
> However if I do:
> {code}
>                 GenericRecord record = dataFileReader.next();
>                 Utf8 desc = (Utf8)record.get("description");
>                 System.out.println(desc);
> {code}
> Then I get
> {code}
> Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives—or remodeling, redecorating and maintaining their most important assets.
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (AVRO-860) Invalid JSON when printing out records with unicode

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066244#comment-13066244 ] 

Scott Carey commented on AVRO-860:
----------------------------------

I tried adding the below to TestGenericData.java:

{code}
  @Test
  public void testUtf8String() {
    Schema s = Schema.parse("{\"type\":\"record\", \"name\":\"foo\", \"fields\": [{\"name\":\"bar\", \"type\":[\"null\",\"string\"]}]}");
    GenericRecord foo = new GenericData.Record(s);
    foo.put(0, new Utf8("utf8 trademark char-> ™ <-"));
    System.out.println(foo);
  }
{code}

But it did not print out anything suspicious.  I have not tried it using your data file.  This is with trunk -- are you using trunk or 1.5.1?

> Invalid JSON when printing out records with unicode
> ---------------------------------------------------
>
>                 Key: AVRO-860
>                 URL: https://issues.apache.org/jira/browse/AVRO-860
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Miki Tebeka
>              Labels: java, json, unicode
>         Attachments: AVRO-860.diff, m.avro
>
>
> I have an avro file, that when printed returns invalid JSON.
> The code for iterating and printing is:
> {code}
>             DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
>             DataFileReader<GenericRecord> dataFileReader =
>                 new DataFileReader<GenericRecord>(data, reader);
>             while (dataFileReader.hasNext()) {
>                 System.out.println(dataFileReader.next().toString());
>             }
> {code}
> and the relevant JSON snippet is
> {code}
>     "description": "Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives\uMOVE™ OFFERS ADVERTISERS THE OPPORTUNITY TO DELIVER MESSAGES TO CONSUMERS AT A TIME WHEN CONSUMERS ARE MAKING THE BIGGEST PURCHASES OF THEIR LIVES—OR REMODELING, REDECORATING AND MAINTAINING THEIR MOST IMPORTANT ASSETS.or remodeling, redecorating and maintaining their most important assets.",
> {code}
> (The \uMOVE is the problematic part).
> However if I do:
> {code}
>                 GenericRecord record = dataFileReader.next();
>                 Utf8 desc = (Utf8)record.get("description");
>                 System.out.println(desc);
> {code}
> Then I get
> {code}
> Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives—or remodeling, redecorating and maintaining their most important assets.
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (AVRO-860) Invalid JSON when printing out records with unicode

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066158#comment-13066158 ] 

Scott Carey commented on AVRO-860:
----------------------------------

Looks good, is there a unit test that shows this error before the patch, but works after?  If not, we should add it to this patch.

> Invalid JSON when printing out records with unicode
> ---------------------------------------------------
>
>                 Key: AVRO-860
>                 URL: https://issues.apache.org/jira/browse/AVRO-860
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Miki Tebeka
>              Labels: java, json, unicode
>         Attachments: AVRO-860.diff
>
>
> I have an avro file, that when printed returns invalid JSON.
> The code for iterating and printing is:
> {code}
>             DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
>             DataFileReader<GenericRecord> dataFileReader =
>                 new DataFileReader<GenericRecord>(data, reader);
>             while (dataFileReader.hasNext()) {
>                 System.out.println(dataFileReader.next().toString());
>             }
> {code}
> and the relevant JSON snippet is
> {code}
>     "description": "Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives\uMOVE™ OFFERS ADVERTISERS THE OPPORTUNITY TO DELIVER MESSAGES TO CONSUMERS AT A TIME WHEN CONSUMERS ARE MAKING THE BIGGEST PURCHASES OF THEIR LIVES—OR REMODELING, REDECORATING AND MAINTAINING THEIR MOST IMPORTANT ASSETS.or remodeling, redecorating and maintaining their most important assets.",
> {code}
> (The \uMOVE is the problematic part).
> However if I do:
> {code}
>                 GenericRecord record = dataFileReader.next();
>                 Utf8 desc = (Utf8)record.get("description");
>                 System.out.println(desc);
> {code}
> Then I get
> {code}
> Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives—or remodeling, redecorating and maintaining their most important assets.
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (AVRO-860) Invalid JSON when printing out records with unicode

Posted by "Miki Tebeka (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Miki Tebeka updated AVRO-860:
-----------------------------

    Attachment: AVRO-860.diff

Patch to have toString use jackson

> Invalid JSON when printing out records with unicode
> ---------------------------------------------------
>
>                 Key: AVRO-860
>                 URL: https://issues.apache.org/jira/browse/AVRO-860
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Miki Tebeka
>              Labels: java, json, unicode
>         Attachments: AVRO-860.diff
>
>
> I have an avro file, that when printed returns invalid JSON.
> The code for iterating and printing is:
> {code}
>             DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
>             DataFileReader<GenericRecord> dataFileReader =
>                 new DataFileReader<GenericRecord>(data, reader);
>             while (dataFileReader.hasNext()) {
>                 System.out.println(dataFileReader.next().toString());
>             }
> {code}
> and the relevant JSON snippet is
> {code}
>     "description": "Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives\uMOVE™ OFFERS ADVERTISERS THE OPPORTUNITY TO DELIVER MESSAGES TO CONSUMERS AT A TIME WHEN CONSUMERS ARE MAKING THE BIGGEST PURCHASES OF THEIR LIVES—OR REMODELING, REDECORATING AND MAINTAINING THEIR MOST IMPORTANT ASSETS.or remodeling, redecorating and maintaining their most important assets.",
> {code}
> (The \uMOVE is the problematic part).
> However if I do:
> {code}
>                 GenericRecord record = dataFileReader.next();
>                 Utf8 desc = (Utf8)record.get("description");
>                 System.out.println(desc);
> {code}
> Then I get
> {code}
> Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives—or remodeling, redecorating and maintaining their most important assets.
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (AVRO-860) Invalid JSON when printing out records with unicode

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065618#comment-13065618 ] 

Doug Cutting commented on AVRO-860:
-----------------------------------

No good reason.  History.  Schema.toString() uses Jackson.

> Invalid JSON when printing out records with unicode
> ---------------------------------------------------
>
>                 Key: AVRO-860
>                 URL: https://issues.apache.org/jira/browse/AVRO-860
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Miki Tebeka
>              Labels: java, json, unicode
>
> I have an avro file, that when printed returns invalid JSON.
> The code for iterating and printing is:
> {code}
>             DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
>             DataFileReader<GenericRecord> dataFileReader =
>                 new DataFileReader<GenericRecord>(data, reader);
>             while (dataFileReader.hasNext()) {
>                 System.out.println(dataFileReader.next().toString());
>             }
> {code}
> and the relevant JSON snippet is
> {code}
>     "description": "Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives\uMOVE™ OFFERS ADVERTISERS THE OPPORTUNITY TO DELIVER MESSAGES TO CONSUMERS AT A TIME WHEN CONSUMERS ARE MAKING THE BIGGEST PURCHASES OF THEIR LIVES—OR REMODELING, REDECORATING AND MAINTAINING THEIR MOST IMPORTANT ASSETS.or remodeling, redecorating and maintaining their most important assets.",
> {code}
> (The \uMOVE is the problematic part).
> However if I do:
> {code}
>                 GenericRecord record = dataFileReader.next();
>                 Utf8 desc = (Utf8)record.get("description");
>                 System.out.println(desc);
> {code}
> Then I get
> {code}
> Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives—or remodeling, redecorating and maintaining their most important assets.
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (AVRO-860) Invalid JSON when printing out records with unicode

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070718#comment-13070718 ] 

Doug Cutting commented on AVRO-860:
-----------------------------------

So we have two different patches for this, one here and one in AVRO-851.  This one has the advantage that it uses Jackson, and is thus more likely to produce valid JSON.  However it makes a deep copy of data structures, which probably adversely affects performance.  Performance here is probably important.

We could develop an implementation that, instead of Jackson's ObjectMapper, uses Jackson's lower-level JsonGenerator API, as is done in Schema.java.  That might both perform well and delegate JSON details to Jackson.  On the other hand, JSON is simple enough that the approach in AVRO-851 might be less code and work well enough.

Thoughts?

> Invalid JSON when printing out records with unicode
> ---------------------------------------------------
>
>                 Key: AVRO-860
>                 URL: https://issues.apache.org/jira/browse/AVRO-860
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Miki Tebeka
>              Labels: java, json, unicode
>             Fix For: 1.6.0
>
>         Attachments: AVRO-860.diff, AVRO-860.diff, m.avro
>
>
> I have an avro file, that when printed returns invalid JSON.
> The code for iterating and printing is:
> {code}
>             DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
>             DataFileReader<GenericRecord> dataFileReader =
>                 new DataFileReader<GenericRecord>(data, reader);
>             while (dataFileReader.hasNext()) {
>                 System.out.println(dataFileReader.next().toString());
>             }
> {code}
> and the relevant JSON snippet is
> {code}
>     "description": "Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives\uMOVE™ OFFERS ADVERTISERS THE OPPORTUNITY TO DELIVER MESSAGES TO CONSUMERS AT A TIME WHEN CONSUMERS ARE MAKING THE BIGGEST PURCHASES OF THEIR LIVES—OR REMODELING, REDECORATING AND MAINTAINING THEIR MOST IMPORTANT ASSETS.or remodeling, redecorating and maintaining their most important assets.",
> {code}
> (The \uMOVE is the problematic part).
> However if I do:
> {code}
>                 GenericRecord record = dataFileReader.next();
>                 Utf8 desc = (Utf8)record.get("description");
>                 System.out.println(desc);
> {code}
> Then I get
> {code}
> Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives—or remodeling, redecorating and maintaining their most important assets.
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Resolved] (AVRO-860) Invalid JSON when printing out records with unicode

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting resolved AVRO-860.
-------------------------------

    Resolution: Duplicate

Closing this as a duplicate of AVRO-851.
                
> Invalid JSON when printing out records with unicode
> ---------------------------------------------------
>
>                 Key: AVRO-860
>                 URL: https://issues.apache.org/jira/browse/AVRO-860
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Miki Tebeka
>            Assignee: Miki Tebeka
>              Labels: java, json, unicode
>         Attachments: AVRO-860.diff, AVRO-860.diff, m.avro
>
>
> I have an avro file, that when printed returns invalid JSON.
> The code for iterating and printing is:
> {code}
>             DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
>             DataFileReader<GenericRecord> dataFileReader =
>                 new DataFileReader<GenericRecord>(data, reader);
>             while (dataFileReader.hasNext()) {
>                 System.out.println(dataFileReader.next().toString());
>             }
> {code}
> and the relevant JSON snippet is
> {code}
>     "description": "Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives\uMOVE™ OFFERS ADVERTISERS THE OPPORTUNITY TO DELIVER MESSAGES TO CONSUMERS AT A TIME WHEN CONSUMERS ARE MAKING THE BIGGEST PURCHASES OF THEIR LIVES—OR REMODELING, REDECORATING AND MAINTAINING THEIR MOST IMPORTANT ASSETS.or remodeling, redecorating and maintaining their most important assets.",
> {code}
> (The \uMOVE is the problematic part).
> However if I do:
> {code}
>                 GenericRecord record = dataFileReader.next();
>                 Utf8 desc = (Utf8)record.get("description");
>                 System.out.println(desc);
> {code}
> Then I get
> {code}
> Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives—or remodeling, redecorating and maintaining their most important assets.
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-860) Invalid JSON when printing out records with unicode

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066216#comment-13066216 ] 

Doug Cutting commented on AVRO-860:
-----------------------------------

I think the "™" above is what triggered the bug and that any string with this would be mis-encoded.

> Invalid JSON when printing out records with unicode
> ---------------------------------------------------
>
>                 Key: AVRO-860
>                 URL: https://issues.apache.org/jira/browse/AVRO-860
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Miki Tebeka
>              Labels: java, json, unicode
>         Attachments: AVRO-860.diff
>
>
> I have an avro file, that when printed returns invalid JSON.
> The code for iterating and printing is:
> {code}
>             DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
>             DataFileReader<GenericRecord> dataFileReader =
>                 new DataFileReader<GenericRecord>(data, reader);
>             while (dataFileReader.hasNext()) {
>                 System.out.println(dataFileReader.next().toString());
>             }
> {code}
> and the relevant JSON snippet is
> {code}
>     "description": "Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives\uMOVE™ OFFERS ADVERTISERS THE OPPORTUNITY TO DELIVER MESSAGES TO CONSUMERS AT A TIME WHEN CONSUMERS ARE MAKING THE BIGGEST PURCHASES OF THEIR LIVES—OR REMODELING, REDECORATING AND MAINTAINING THEIR MOST IMPORTANT ASSETS.or remodeling, redecorating and maintaining their most important assets.",
> {code}
> (The \uMOVE is the problematic part).
> However if I do:
> {code}
>                 GenericRecord record = dataFileReader.next();
>                 Utf8 desc = (Utf8)record.get("description");
>                 System.out.println(desc);
> {code}
> Then I get
> {code}
> Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives—or remodeling, redecorating and maintaining their most important assets.
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (AVRO-860) Invalid JSON when printing out records with unicode

Posted by "Miki Tebeka (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Miki Tebeka updated AVRO-860:
-----------------------------

    Attachment: m.avro

Avro file that is encoded to invalid JSON

> Invalid JSON when printing out records with unicode
> ---------------------------------------------------
>
>                 Key: AVRO-860
>                 URL: https://issues.apache.org/jira/browse/AVRO-860
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Miki Tebeka
>              Labels: java, json, unicode
>         Attachments: AVRO-860.diff, m.avro
>
>
> I have an avro file, that when printed returns invalid JSON.
> The code for iterating and printing is:
> {code}
>             DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
>             DataFileReader<GenericRecord> dataFileReader =
>                 new DataFileReader<GenericRecord>(data, reader);
>             while (dataFileReader.hasNext()) {
>                 System.out.println(dataFileReader.next().toString());
>             }
> {code}
> and the relevant JSON snippet is
> {code}
>     "description": "Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives\uMOVE™ OFFERS ADVERTISERS THE OPPORTUNITY TO DELIVER MESSAGES TO CONSUMERS AT A TIME WHEN CONSUMERS ARE MAKING THE BIGGEST PURCHASES OF THEIR LIVES—OR REMODELING, REDECORATING AND MAINTAINING THEIR MOST IMPORTANT ASSETS.or remodeling, redecorating and maintaining their most important assets.",
> {code}
> (The \uMOVE is the problematic part).
> However if I do:
> {code}
>                 GenericRecord record = dataFileReader.next();
>                 Utf8 desc = (Utf8)record.get("description");
>                 System.out.println(desc);
> {code}
> Then I get
> {code}
> Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives—or remodeling, redecorating and maintaining their most important assets.
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (AVRO-860) Invalid JSON when printing out records with unicode

Posted by "Miki Tebeka (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066206#comment-13066206 ] 

Miki Tebeka commented on AVRO-860:
----------------------------------

OK, I'll work on that. Note the Java is not my strong side (I'm a Python developer). Will try to dig out an offending avro file.

> Invalid JSON when printing out records with unicode
> ---------------------------------------------------
>
>                 Key: AVRO-860
>                 URL: https://issues.apache.org/jira/browse/AVRO-860
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Miki Tebeka
>              Labels: java, json, unicode
>         Attachments: AVRO-860.diff
>
>
> I have an avro file, that when printed returns invalid JSON.
> The code for iterating and printing is:
> {code}
>             DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
>             DataFileReader<GenericRecord> dataFileReader =
>                 new DataFileReader<GenericRecord>(data, reader);
>             while (dataFileReader.hasNext()) {
>                 System.out.println(dataFileReader.next().toString());
>             }
> {code}
> and the relevant JSON snippet is
> {code}
>     "description": "Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives\uMOVE™ OFFERS ADVERTISERS THE OPPORTUNITY TO DELIVER MESSAGES TO CONSUMERS AT A TIME WHEN CONSUMERS ARE MAKING THE BIGGEST PURCHASES OF THEIR LIVES—OR REMODELING, REDECORATING AND MAINTAINING THEIR MOST IMPORTANT ASSETS.or remodeling, redecorating and maintaining their most important assets.",
> {code}
> (The \uMOVE is the problematic part).
> However if I do:
> {code}
>                 GenericRecord record = dataFileReader.next();
>                 Utf8 desc = (Utf8)record.get("description");
>                 System.out.println(desc);
> {code}
> Then I get
> {code}
> Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives—or remodeling, redecorating and maintaining their most important assets.
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira