You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Scott Carey (JIRA)" <ji...@apache.org> on 2010/09/16 07:47:32 UTC

[jira] Created: (AVRO-668) Java: Streamline writing of Strings for Encoders and GenericDatumWriter

Java: Streamline writing of Strings for Encoders and GenericDatumWriter
-----------------------------------------------------------------------

                 Key: AVRO-668
                 URL: https://issues.apache.org/jira/browse/AVRO-668
             Project: Avro
          Issue Type: Improvement
            Reporter: Scott Carey
            Assignee: Scott Carey
             Fix For: 1.4.1


We can streamline writing of strings to minimize object creation during writes.

We can avoid converting a String into Utf8 for Json output, and for Binary output we can avoid a Utf8 (but still create a byte[]).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-668) Java: Streamline writing of Strings for Encoders and GenericDatumWriter

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12912667#action_12912667 ] 

Scott Carey commented on AVRO-668:
----------------------------------

{quote}
Did you mean to use 'Utf8.class.equals(charSequence.getClass())'? I think 'charSequence instanceof Utf8' reads much better, and, as we discussed in AVRO-667, the performance improvement is perhaps not significant and this might be risky if Utf8 is not final. 
{quote}

At first I did this defensively.  It is not risky for a subclass, as that would still work, via the CharSequence contract on toString().

But thinking about it a bit more, this only requires that a sublcass implement getByteLength() and getBytes() properly, and these are unlikely to be done improperly.  
Unlike in other places where equals() and hashCode() have to remain consistent with Utf8 and it is easy for someone to write a subclass that is self-consistent with Object.equals() and Object.hashCode() but break Utf8's variations.  

Therefore instanceof has little value from the defensive side here, and not much performance-wise either.  Writing is far more expensive than the difference between instanceof and class.equals() here.

{quote}
Did we also want to update GenericDatumWriter#writeString()? Maybe not
{quote}

Oops, I did intend to include that!

Updated patch on the way.

> Java: Streamline writing of Strings for Encoders and GenericDatumWriter
> -----------------------------------------------------------------------
>
>                 Key: AVRO-668
>                 URL: https://issues.apache.org/jira/browse/AVRO-668
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Scott Carey
>            Assignee: Scott Carey
>             Fix For: 1.4.1
>
>         Attachments: AVRO-668.patch, AVRO-668.patch, AVRO-668.patch
>
>
> We can streamline writing of strings to minimize object creation during writes.
> We can avoid converting a String into Utf8 for Json output, and for Binary output we can avoid a Utf8 (but still create a byte[]).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AVRO-668) Java: Streamline writing of Strings for Encoders and GenericDatumWriter

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Scott Carey updated AVRO-668:
-----------------------------

    Attachment: AVRO-668.patch

Tests in o.a.a.io pass.

Doug's suggestions incorporated.  



> Java: Streamline writing of Strings for Encoders and GenericDatumWriter
> -----------------------------------------------------------------------
>
>                 Key: AVRO-668
>                 URL: https://issues.apache.org/jira/browse/AVRO-668
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Scott Carey
>            Assignee: Scott Carey
>             Fix For: 1.4.1
>
>         Attachments: AVRO-668.patch, AVRO-668.patch, AVRO-668.patch
>
>
> We can streamline writing of strings to minimize object creation during writes.
> We can avoid converting a String into Utf8 for Json output, and for Binary output we can avoid a Utf8 (but still create a byte[]).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-668) Java: Streamline writing of Strings for Encoders and GenericDatumWriter

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910246#action_12910246 ] 

Doug Cutting commented on AVRO-668:
-----------------------------------

> I can't run all java tests from ant on trunk.

That's a problem.  Do you have a '127.0.0.1 localhost' line in /etc/hosts?  How about a '0.0.0.0 localhost' line?  I have both, but don't know if either are required.

> I think providing both writeString(Utf8) and writeString(CharSequence) at the Encoder level would be useful. However thats an API change that would have to wait until 1.5.

Why?  I'm don't see how adding that method with a default implementation would be an incompatible change.


> Java: Streamline writing of Strings for Encoders and GenericDatumWriter
> -----------------------------------------------------------------------
>
>                 Key: AVRO-668
>                 URL: https://issues.apache.org/jira/browse/AVRO-668
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Scott Carey
>            Assignee: Scott Carey
>             Fix For: 1.4.1
>
>         Attachments: AVRO-668.patch
>
>
> We can streamline writing of strings to minimize object creation during writes.
> We can avoid converting a String into Utf8 for Json output, and for Binary output we can avoid a Utf8 (but still create a byte[]).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (AVRO-668) Java: Streamline writing of Strings for Encoders and GenericDatumWriter

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting resolved AVRO-668.
-------------------------------

    Hadoop Flags: [Reviewed]
      Resolution: Fixed

I just committed this.  Thanks, Scott!

> Java: Streamline writing of Strings for Encoders and GenericDatumWriter
> -----------------------------------------------------------------------
>
>                 Key: AVRO-668
>                 URL: https://issues.apache.org/jira/browse/AVRO-668
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Scott Carey
>            Assignee: Scott Carey
>             Fix For: 1.4.1
>
>         Attachments: AVRO-668.patch, AVRO-668.patch, AVRO-668.patch, AVRO-668.patch
>
>
> We can streamline writing of strings to minimize object creation during writes.
> We can avoid converting a String into Utf8 for Json output, and for Binary output we can avoid a Utf8 (but still create a byte[]).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AVRO-668) Java: Streamline writing of Strings for Encoders and GenericDatumWriter

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Scott Carey updated AVRO-668:
-----------------------------

    Attachment: AVRO-668.patch

> Java: Streamline writing of Strings for Encoders and GenericDatumWriter
> -----------------------------------------------------------------------
>
>                 Key: AVRO-668
>                 URL: https://issues.apache.org/jira/browse/AVRO-668
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Scott Carey
>            Assignee: Scott Carey
>             Fix For: 1.4.1
>
>         Attachments: AVRO-668.patch, AVRO-668.patch, AVRO-668.patch, AVRO-668.patch
>
>
> We can streamline writing of strings to minimize object creation during writes.
> We can avoid converting a String into Utf8 for Json output, and for Binary output we can avoid a Utf8 (but still create a byte[]).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-668) Java: Streamline writing of Strings for Encoders and GenericDatumWriter

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910239#action_12910239 ] 

Scott Carey commented on AVRO-668:
----------------------------------

As far as performance goes, I have not set up a benchmark.  I did note the String > Utf8 > byte[] > outputStream garbage creation in a profiler of my app.  But reducing garbage almost never does much in a microbenchmark, it is more of a system level optimization when you've got lots of code churning the GC.  It might be useful for:
http://wiki.github.com/eishay/jvm-serializers/   though.  Anything that makes String > out and in > String more streamlined will help that benchmark.

* The most recent patch compiles. I failed to include it the first time.
* I'll change it to UTF8_CS

I can't run all java tests from ant on trunk. Not sure if its my fault or not but it hangs forever in:
[junit] Running org.apache.avro.TestProtocolHttp
[junit] 5455 [main] INFO org.apache.avro.ipc.DatagramTransceiver - sent to /127.0.0.1:15276

The test prior fails too:
    [junit] Running org.apache.avro.TestProtocolGenericMeta
    [junit] 5427 [SocketServer on 0.0.0.0/0.0.0.0:0] INFO org.apache.avro.ipc.SocketTransceiver - open to /10.0.0.231:65043
    [junit] 5429 [SocketServer on 0.0.0.0/0.0.0.0:0] INFO org.apache.avro.ipc.SocketServer - stopping /0.0.0.0
    [junit] 5427 [main] INFO org.apache.avro.ipc.SocketTransceiver - open to 0.0.0.0/0.0.0.0:65040
    [junit] 5441 [Connection to /10.0.0.231:65043] INFO org.apache.avro.TestProtocolGeneric - hello: bob
    [junit] 5441 [main] INFO org.apache.avro.ipc.SocketTransceiver - closing to 0.0.0.0/0.0.0.0:65040
    [junit] Tests run: 6, Failures: 0, Errors: 5, Time elapsed: 0.071 sec

with 5 errors similar to:
Testcase: testEcho took 0 sec
        Caused an ERROR
null
java.nio.channels.ClosedChannelException
        at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
        at org.apache.avro.ipc.SocketTransceiver.writeLength(SocketTransceiver.java:94)
        at org.apache.avro.ipc.SocketTransceiver.writeBuffers(SocketTransceiver.java:84)
        at org.apache.avro.ipc.Transceiver.transceive(Transceiver.java:38)
        at org.apache.avro.ipc.Requestor.request(Requestor.java:123)
        at org.apache.avro.generic.GenericRequestor.request(GenericRequestor.java:43)
        at org.apache.avro.TestProtocolGeneric.testEcho(TestProtocolGeneric.java:132)

Either that is my setup, or a different JIRA to make that work.  Or its just more stupid Mac crap causing trouble.

If I run TestBlockIO2 by itself, I can see the error you mention but I'm puzzled by what that might be at the moment.  Maybe its because I switched from the deprecated Utf8 getLength() to getByteLength()?


I think providing both writeString(Utf8) and writeString(CharSequence) at the Encoder level would be useful.  However thats an API change that would have to wait until 1.5.   That is not so bad I guess this isn't a huge performance benefit, just something that combined with 4 or 5 other little things would add up.



> Java: Streamline writing of Strings for Encoders and GenericDatumWriter
> -----------------------------------------------------------------------
>
>                 Key: AVRO-668
>                 URL: https://issues.apache.org/jira/browse/AVRO-668
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Scott Carey
>            Assignee: Scott Carey
>             Fix For: 1.4.1
>
>         Attachments: AVRO-668.patch
>
>
> We can streamline writing of strings to minimize object creation during writes.
> We can avoid converting a String into Utf8 for Json output, and for Binary output we can avoid a Utf8 (but still create a byte[]).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-668) Java: Streamline writing of Strings for Encoders and GenericDatumWriter

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910755#action_12910755 ] 

Doug Cutting commented on AVRO-668:
-----------------------------------

Did you mean to use 'Utf8.class.equals(charSequence.getClass())'?  I think 'charSequence instanceof Utf8' reads much better, and, as we discussed in AVRO-667, the performance improvement is perhaps not significant and this might be risky if Utf8 is not final.

(I wonder whether HotSpot doesn't already optimize instanceof for classes with no subclasses yet defined to be ==.  It could, couldn't it?)

Did we also want to update GenericDatumWriter#writeString()?  Maybe not...

Other than that, +1


> Java: Streamline writing of Strings for Encoders and GenericDatumWriter
> -----------------------------------------------------------------------
>
>                 Key: AVRO-668
>                 URL: https://issues.apache.org/jira/browse/AVRO-668
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Scott Carey
>            Assignee: Scott Carey
>             Fix For: 1.4.1
>
>         Attachments: AVRO-668.patch, AVRO-668.patch, AVRO-668.patch
>
>
> We can streamline writing of strings to minimize object creation during writes.
> We can avoid converting a String into Utf8 for Json output, and for Binary output we can avoid a Utf8 (but still create a byte[]).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-668) Java: Streamline writing of Strings for Encoders and GenericDatumWriter

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910229#action_12910229 ] 

Doug Cutting commented on AVRO-668:
-----------------------------------

Scott, have you benchmarked this?  If so, how much does it improve things?

Some problems with the current patch:
 - it doesn't compile, failing with: ValidatingEncoder is not abstract and does not override abstract method writeString(java.lang.String)
 - once that's fixed, some tests fail for me (e.g., TestBlockIO2)
 - the utf8cs constant should be named UTF8_CS.  this fails checkstyle.

Should we also add Encoder#writeString(CharSequence s) as a convenience?  This could be implemented as:

{code}
if (s instanceof Utf8) 
  writeString((Utf8)s);
else
   writeString(s.toString());
{code}

That would move this logic from GenericDatumWriter to Encoder.  What do you think?  Are there other places that method might be useful?


> Java: Streamline writing of Strings for Encoders and GenericDatumWriter
> -----------------------------------------------------------------------
>
>                 Key: AVRO-668
>                 URL: https://issues.apache.org/jira/browse/AVRO-668
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Scott Carey
>            Assignee: Scott Carey
>             Fix For: 1.4.1
>
>         Attachments: AVRO-668.patch
>
>
> We can streamline writing of strings to minimize object creation during writes.
> We can avoid converting a String into Utf8 for Json output, and for Binary output we can avoid a Utf8 (but still create a byte[]).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AVRO-668) Java: Streamline writing of Strings for Encoders and GenericDatumWriter

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Scott Carey updated AVRO-668:
-----------------------------

    Attachment: AVRO-668.patch

> Java: Streamline writing of Strings for Encoders and GenericDatumWriter
> -----------------------------------------------------------------------
>
>                 Key: AVRO-668
>                 URL: https://issues.apache.org/jira/browse/AVRO-668
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Scott Carey
>            Assignee: Scott Carey
>             Fix For: 1.4.1
>
>         Attachments: AVRO-668.patch
>
>
> We can streamline writing of strings to minimize object creation during writes.
> We can avoid converting a String into Utf8 for Json output, and for Binary output we can avoid a Utf8 (but still create a byte[]).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AVRO-668) Java: Streamline writing of Strings for Encoders and GenericDatumWriter

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Scott Carey updated AVRO-668:
-----------------------------

    Attachment: AVRO-668.patch

Patch includes suggestion by Doug to use default implementations in Encoder for writeString(String) and writeString(CharSequence) and move logic out of GenericDatumWriter.

TestBlockIO2 still fails, but feedback on this variant would be useful regardless (or ideas on why these changes would cause such a failure).


Doug:
My hosts file has:
{noformat}
127.0.0.1	localhost
255.255.255.255	broadcasthost
::1             localhost 
fe80::1%lo0	localhost
{noformat}

Which looks like a typical BSD/Mac hosts file.
I've got several interfaces in ifconfig; one of those might be causing confusion.

> Java: Streamline writing of Strings for Encoders and GenericDatumWriter
> -----------------------------------------------------------------------
>
>                 Key: AVRO-668
>                 URL: https://issues.apache.org/jira/browse/AVRO-668
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Scott Carey
>            Assignee: Scott Carey
>             Fix For: 1.4.1
>
>         Attachments: AVRO-668.patch, AVRO-668.patch
>
>
> We can streamline writing of strings to minimize object creation during writes.
> We can avoid converting a String into Utf8 for Json output, and for Binary output we can avoid a Utf8 (but still create a byte[]).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AVRO-668) Java: Streamline writing of Strings for Encoders and GenericDatumWriter

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Scott Carey updated AVRO-668:
-----------------------------

    Attachment:     (was: AVRO-668.patch)

> Java: Streamline writing of Strings for Encoders and GenericDatumWriter
> -----------------------------------------------------------------------
>
>                 Key: AVRO-668
>                 URL: https://issues.apache.org/jira/browse/AVRO-668
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Scott Carey
>            Assignee: Scott Carey
>             Fix For: 1.4.1
>
>         Attachments: AVRO-668.patch
>
>
> We can streamline writing of strings to minimize object creation during writes.
> We can avoid converting a String into Utf8 for Json output, and for Binary output we can avoid a Utf8 (but still create a byte[]).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-668) Java: Streamline writing of Strings for Encoders and GenericDatumWriter

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910221#action_12910221 ] 

Scott Carey commented on AVRO-668:
----------------------------------

We could also change this signature to take CharSequence instead of String.  But that would break the user-facing API.  This only requires those who have extended Encoder to implement the string variation, and that could be backed out and simply overriden in the child classes.

> Java: Streamline writing of Strings for Encoders and GenericDatumWriter
> -----------------------------------------------------------------------
>
>                 Key: AVRO-668
>                 URL: https://issues.apache.org/jira/browse/AVRO-668
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Scott Carey
>            Assignee: Scott Carey
>             Fix For: 1.4.1
>
>         Attachments: AVRO-668.patch
>
>
> We can streamline writing of strings to minimize object creation during writes.
> We can avoid converting a String into Utf8 for Json output, and for Binary output we can avoid a Utf8 (but still create a byte[]).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AVRO-668) Java: Streamline writing of Strings for Encoders and GenericDatumWriter

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Scott Carey updated AVRO-668:
-----------------------------

    Attachment: AVRO-668.patch

I forgot ValidatingEncoder in the last patch file. Oops.

> Java: Streamline writing of Strings for Encoders and GenericDatumWriter
> -----------------------------------------------------------------------
>
>                 Key: AVRO-668
>                 URL: https://issues.apache.org/jira/browse/AVRO-668
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Scott Carey
>            Assignee: Scott Carey
>             Fix For: 1.4.1
>
>         Attachments: AVRO-668.patch, AVRO-668.patch
>
>
> We can streamline writing of strings to minimize object creation during writes.
> We can avoid converting a String into Utf8 for Json output, and for Binary output we can avoid a Utf8 (but still create a byte[]).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-668) Java: Streamline writing of Strings for Encoders and GenericDatumWriter

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910260#action_12910260 ] 

Scott Carey commented on AVRO-668:
----------------------------------

The last patch conflicts with AVRO-667,  I'll make a patch that works on top of AVRO-667 if we commit it.

> Java: Streamline writing of Strings for Encoders and GenericDatumWriter
> -----------------------------------------------------------------------
>
>                 Key: AVRO-668
>                 URL: https://issues.apache.org/jira/browse/AVRO-668
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Scott Carey
>            Assignee: Scott Carey
>             Fix For: 1.4.1
>
>         Attachments: AVRO-668.patch, AVRO-668.patch
>
>
> We can streamline writing of strings to minimize object creation during writes.
> We can avoid converting a String into Utf8 for Json output, and for Binary output we can avoid a Utf8 (but still create a byte[]).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.