You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2009/01/06 01:47:44 UTC

[jira] Created: (PIG-599) BufferedPositionedInputStream isn't buffered

BufferedPositionedInputStream isn't buffered
--------------------------------------------

                 Key: PIG-599
                 URL: https://issues.apache.org/jira/browse/PIG-599
             Project: Pig
          Issue Type: Bug
          Components: impl
    Affects Versions: types_branch
            Reporter: Alan Gates
            Assignee: Alan Gates
             Fix For: types_branch


org.apache.pig.impl.io.BufferedPositionedInputStream is not actually buffered.  This is because it sits atop a FSDataInputStream (somewhere down the stack), which is buffered.  So to avoid double buffering, which can be bad, BufferedPositionedInputStream was written without buffering.  But the FSDataInputStream is far enough down the stack that it is still quite costly to call read() individually for each byte.  A run through a profiler shows that a fair amount of time is being spent in BufferedPositionedInputStream.read().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-599) BufferedPositionedInputStream isn't buffered

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661384#action_12661384 ] 

Benjamin Reed commented on PIG-599:
-----------------------------------

There is a problem with using a buffered stream and compression. we have to do some really subtle things get the mapping of compression blocks into positions so that the load functions work out properly. if read ahead happens underneath things break. (it would be excellent if someone had a better way of doing it.)  in absence of a better idea, i think we should check that the stream we are buffering and skip the buffering if it as a compressed stream. 

> BufferedPositionedInputStream isn't buffered
> --------------------------------------------
>
>                 Key: PIG-599
>                 URL: https://issues.apache.org/jira/browse/PIG-599
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>             Fix For: types_branch
>
>         Attachments: loadperf.patch
>
>
> org.apache.pig.impl.io.BufferedPositionedInputStream is not actually buffered.  This is because it sits atop a FSDataInputStream (somewhere down the stack), which is buffered.  So to avoid double buffering, which can be bad, BufferedPositionedInputStream was written without buffering.  But the FSDataInputStream is far enough down the stack that it is still quite costly to call read() individually for each byte.  A run through a profiler shows that a fair amount of time is being spent in BufferedPositionedInputStream.read().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-599) BufferedPositionedInputStream isn't buffered

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-599:
---------------------------

    Attachment: loadperf.patch

This patch changes BufferedPositionedInputStream to wrap a BufferedInputStream around the provided InputStream.  It also adds a new constructor for DefaultTuple (and new calls in TupleFactory) that take an ArrayList<Object> and use that directly to construct the DefaultTuple instead of copying the list (as was done previously).  In a run of the pig mix queries these changes made most queries about 25-40% faster.

> BufferedPositionedInputStream isn't buffered
> --------------------------------------------
>
>                 Key: PIG-599
>                 URL: https://issues.apache.org/jira/browse/PIG-599
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>             Fix For: types_branch
>
>         Attachments: loadperf.patch
>
>
> org.apache.pig.impl.io.BufferedPositionedInputStream is not actually buffered.  This is because it sits atop a FSDataInputStream (somewhere down the stack), which is buffered.  So to avoid double buffering, which can be bad, BufferedPositionedInputStream was written without buffering.  But the FSDataInputStream is far enough down the stack that it is still quite costly to call read() individually for each byte.  A run through a profiler shows that a fair amount of time is being spent in BufferedPositionedInputStream.read().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-599) BufferedPositionedInputStream isn't buffered

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-599:
---------------------------

    Status: Patch Available  (was: Open)

> BufferedPositionedInputStream isn't buffered
> --------------------------------------------
>
>                 Key: PIG-599
>                 URL: https://issues.apache.org/jira/browse/PIG-599
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>             Fix For: types_branch
>
>         Attachments: loadperf.patch
>
>
> org.apache.pig.impl.io.BufferedPositionedInputStream is not actually buffered.  This is because it sits atop a FSDataInputStream (somewhere down the stack), which is buffered.  So to avoid double buffering, which can be bad, BufferedPositionedInputStream was written without buffering.  But the FSDataInputStream is far enough down the stack that it is still quite costly to call read() individually for each byte.  A run through a profiler shows that a fair amount of time is being spent in BufferedPositionedInputStream.read().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-599) BufferedPositionedInputStream isn't buffered

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-599:
---------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

Patch checked in quite some time ago.  Closing bug.

> BufferedPositionedInputStream isn't buffered
> --------------------------------------------
>
>                 Key: PIG-599
>                 URL: https://issues.apache.org/jira/browse/PIG-599
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>             Fix For: types_branch
>
>         Attachments: loadperf-2.patch, loadperf.patch
>
>
> org.apache.pig.impl.io.BufferedPositionedInputStream is not actually buffered.  This is because it sits atop a FSDataInputStream (somewhere down the stack), which is buffered.  So to avoid double buffering, which can be bad, BufferedPositionedInputStream was written without buffering.  But the FSDataInputStream is far enough down the stack that it is still quite costly to call read() individually for each byte.  A run through a profiler shows that a fair amount of time is being spent in BufferedPositionedInputStream.read().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-599) BufferedPositionedInputStream isn't buffered

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673399#action_12673399 ] 

Olga Natkovich commented on PIG-599:
------------------------------------

Alan, Has this patch been committed?

> BufferedPositionedInputStream isn't buffered
> --------------------------------------------
>
>                 Key: PIG-599
>                 URL: https://issues.apache.org/jira/browse/PIG-599
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>             Fix For: types_branch
>
>         Attachments: loadperf-2.patch, loadperf.patch
>
>
> org.apache.pig.impl.io.BufferedPositionedInputStream is not actually buffered.  This is because it sits atop a FSDataInputStream (somewhere down the stack), which is buffered.  So to avoid double buffering, which can be bad, BufferedPositionedInputStream was written without buffering.  But the FSDataInputStream is far enough down the stack that it is still quite costly to call read() individually for each byte.  A run through a profiler shows that a fair amount of time is being spent in BufferedPositionedInputStream.read().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-599) BufferedPositionedInputStream isn't buffered

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662359#action_12662359 ] 

Benjamin Reed commented on PIG-599:
-----------------------------------

+1 looks good.

> BufferedPositionedInputStream isn't buffered
> --------------------------------------------
>
>                 Key: PIG-599
>                 URL: https://issues.apache.org/jira/browse/PIG-599
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>             Fix For: types_branch
>
>         Attachments: loadperf-2.patch, loadperf.patch
>
>
> org.apache.pig.impl.io.BufferedPositionedInputStream is not actually buffered.  This is because it sits atop a FSDataInputStream (somewhere down the stack), which is buffered.  So to avoid double buffering, which can be bad, BufferedPositionedInputStream was written without buffering.  But the FSDataInputStream is far enough down the stack that it is still quite costly to call read() individually for each byte.  A run through a profiler shows that a fair amount of time is being spent in BufferedPositionedInputStream.read().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-599) BufferedPositionedInputStream isn't buffered

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-599:
---------------------------

    Attachment: loadperf-2.patch

A second version of the patch that addresses Ben's comments.

> BufferedPositionedInputStream isn't buffered
> --------------------------------------------
>
>                 Key: PIG-599
>                 URL: https://issues.apache.org/jira/browse/PIG-599
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>             Fix For: types_branch
>
>         Attachments: loadperf-2.patch, loadperf.patch
>
>
> org.apache.pig.impl.io.BufferedPositionedInputStream is not actually buffered.  This is because it sits atop a FSDataInputStream (somewhere down the stack), which is buffered.  So to avoid double buffering, which can be bad, BufferedPositionedInputStream was written without buffering.  But the FSDataInputStream is far enough down the stack that it is still quite costly to call read() individually for each byte.  A run through a profiler shows that a fair amount of time is being spent in BufferedPositionedInputStream.read().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.