You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2009/05/27 02:59:45 UTC

[jira] Created: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
-----------------------------------------------------------------------------------------

                 Key: PIG-820
                 URL: https://issues.apache.org/jira/browse/PIG-820
             Project: Pig
          Issue Type: Improvement
          Components: impl
    Affects Versions: 0.3.0
            Reporter: Alan Gates
            Assignee: Alan Gates


Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
in BinaryStorage format.

As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.

Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:

{code}
public interface SamplableLoader extends LoadFunc {
    
    /**
     * Skip ahead in the input stream.
     * @param n number of bytes to skip
     * @return number of bytes actually skipped.  The return semantics are
     * exactly the same as {@link java.io.InpuStream#skip(long)}
     */
    public long skip(long n) throws IOException;
    
    /**
     * Get the current position in the stream.
     * @return position in the stream.
     */
    public long getPosition() throws IOException;
}
{code}

The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan updated PIG-820:
---------------------------------

    Status: Patch Available  (was: Open)

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, pig-820_v4.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan updated PIG-820:
---------------------------------

    Status: Patch Available  (was: Reopened)

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch, pig-820_v7.patch, pig-820_v8.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan updated PIG-820:
---------------------------------

    Status: Open  (was: Patch Available)

Will be uploading a new patch.

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan updated PIG-820:
---------------------------------

    Status: Open  (was: Patch Available)

Due to change in LoadFunc interface as a part of PIG-734 commit, my patch won't apply cleanly on trunk anymore. Will merge with trunk and regenerate the patch again.

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, pig-820_v4.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan updated PIG-820:
---------------------------------

    Status: Open  (was: Patch Available)

Missed test files. 

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan updated PIG-820:
---------------------------------

    Attachment: pig-820_v8.patch

Thanks Pradeep for the review. skip(1) is not required because reading a byte (by calling in.read()) would result in pointer getting advanced by 1. I updated that comment in the interface noting the fact that loader implementing the interface should not assume that current read position is at the beginning of a tuple. 

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch, pig-820_v7.patch, pig-820_v8.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724717#action_12724717 ] 

Pradeep Kamath commented on PIG-820:
------------------------------------

+1

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722706#action_12722706 ] 

Ashutosh Chauhan commented on PIG-820:
--------------------------------------

In the patch RandomSampleLoader is marked as serializable and loader field in it is marked as transient. Since loader is  initialized in constructor and is used later on findbugs is complaining : "This class contains a field that is updated at multiple places in the class, thus it seems to be part of the state of the class.However, since the field is marked as transient and not set in readObject or readResolve, it will contain the default value in any deserialized instance of the class. " However there is no need for RandomSampleLoader to implement Serializable anyway (and thus loader to be marked as transient) because loader is reconstructed from FunSpec later on. Because of this reason, both PigStorage and BinStorage also doesnt implement serializable. Will be submitting a new patch with the required changes.


> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: pig-820.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pradeep Kamath updated PIG-820:
-------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

Patch committed - thanks Ashutosh!

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch, pig-820_v7.patch, pig-820_v8.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12728161#action_12728161 ] 

Ashutosh Chauhan commented on PIG-820:
--------------------------------------

The patch includes no unit tests as no new functionality is added or modified.

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch, pig-820_v7.patch, pig-820_v8.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan updated PIG-820:
---------------------------------

    Attachment: pig-820_v5.patch

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, pig-820_v4.patch, pig-820_v5.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan updated PIG-820:
---------------------------------

    Status: Patch Available  (was: Open)

Resubmitting patch.

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan updated PIG-820:
---------------------------------

    Attachment: pig-820_v4.patch

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, pig-820_v4.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725602#action_12725602 ] 

Hudson commented on PIG-820:
----------------------------

Integrated in Pig-trunk #490 (See [http://hudson.zones.apache.org/hudson/job/Pig-trunk/490/])
    : Change RandomSampleLoader to take a LoadFunc instead of extending BinStorage.  Added new Samplable interface for loaders to implement allowing them to be used by RandomSampleLoader.


> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-820:
---------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

v6 of the patch checked in.  Thanks Ashutosh.

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan reopened PIG-820:
----------------------------------


Samplable interface introduced as a part of this patch enforces the contract of implementing getPosition() and next() on the loaders implementing it. An additional requirement for a loader to be a sampler is that they should correctly handle getNext() without knowing the position in the file. Current patch doesn't include this contract as a part of interface. That should be a part of the interface.
Reopening the jira because of this issue.

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724769#action_12724769 ] 

Hadoop QA commented on PIG-820:
-------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12411945/pig-820_v6.patch
  against trunk revision 788174.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/104/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/104/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/104/console

This message is automatically generated.

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan updated PIG-820:
---------------------------------

    Attachment: pig-820_v3.patch

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724204#action_12724204 ] 

Pradeep Kamath commented on PIG-820:
------------------------------------

In SampleOptimizer the following should change:
{code}
        // First argument is name of loader function to subsume, this we want to set for                           
         // ourselves.
         rslargs[0] = predFs.getFuncName();
to
        // First argument is name of loader function with constructor args to subsume, this we want to set for                           
         // ourselves.
         rslargs[0] = predFs.getFuncSpec().toString();
{code}

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, pig-820_v4.patch, pig-820_v5.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan updated PIG-820:
---------------------------------

    Status: Patch Available  (was: Open)

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723316#action_12723316 ] 

Ashutosh Chauhan commented on PIG-820:
--------------------------------------

Thanks Alan and Pradeep for the review.

Will be incorporating SampleOptimizer changes. 
Constructor of RandomSampleLoader can only take string args since it is instantiated from FuncSpec on backend. So, cant make changes to types of RandomSampleLoader constructor argument. However, instead of String having classname of loader , String version of FuncSpec can be used so that loader with correct constructor gets instantiated.

Will be uploading a new patch soon.

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12727834#action_12727834 ] 

Pradeep Kamath commented on PIG-820:
------------------------------------

Review comments - two observations:

1. In PigStorage the skip() implementation should do an extra skip(1) if the byte at n-1 is not -1 (i.e. after skipping n-1, if the stream is not at EOF, there should be a skip(1), so that n bytes are skipped in all).

2. The comment for  getSampledTuple() contains:
{code}
   /** 
	* Get the next sampled tuple from the stream. 
	* Those loaders which can appropriately return the next tuple after 
	* skipping in the stream(e.g. BinStorage) can in turn call their getNext()
	* for implementing this method. Those who cannot (e.g. PigStorage) need to
	* provide their own implementation.
	* Samplers must call this method to get next tuple and should never directly call
	* underlying loader's getNext() method.
	* @return the next tuple after skipping or null if there are no more tuples
	* to be processed.
	*/

{code}

The comment can be updated to be explicit about the context in which getSampledTuple() would be called- something along the lines of 
{noformat}
getSampledTuple() method will be called after a call to skip(). Hence the loader implementation would have to handle the case wherein the current read position 
in the stream is not at the beginning of a record and correctly give the next tuple starting from the current read position. In particular, the implementation would need to handle the following cases:
1) The current read position for the input stream is at the beginning of the stream - in this case getSampledTuple() should return the tuple repesenting the first tuple in the stream
2) The current read position for the input stream is in the middle of a record - in this case getSampledTuple() should return the tuple representing the next record by reading forward in the stream
3) The current read position for the input stream is exactly at the beginning of a record - in this case getSampledTuple() should return the tuple representing the record at current read position
4) The current read position for the input stream is beyond end of file - in this case getSampledTuple() should return null
{noformat}

To keep the comment from being very verbose, the implementation details (whether to delegate to getNext() or not) can be omitted.


> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch, pig-820_v7.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan updated PIG-820:
---------------------------------

    Attachment: pig-820_v6.patch

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan updated PIG-820:
---------------------------------

    Status: Patch Available  (was: Open)

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, pig-820_v4.patch, pig-820_v5.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12727907#action_12727907 ] 

Hadoop QA commented on PIG-820:
-------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12412695/pig-820_v8.patch
  against trunk revision 791048.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/114/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/114/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/114/console

This message is automatically generated.

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch, pig-820_v7.patch, pig-820_v8.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan updated PIG-820:
---------------------------------

    Attachment: pig-820_v2.patch

Patch which fixes findbugs warning.

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722286#action_12722286 ] 

Hadoop QA commented on PIG-820:
-------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12411325/pig-820.patch
  against trunk revision 786694.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 1 new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/96/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/96/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/96/console

This message is automatically generated.

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: pig-820.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan updated PIG-820:
---------------------------------

    Status: Open  (was: Patch Available)

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723204#action_12723204 ] 

Alan Gates commented on PIG-820:
--------------------------------

+1

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723240#action_12723240 ] 

Pradeep Kamath commented on PIG-820:
------------------------------------

Some review comments:
In SampleOptimizer.java, 
{noformat}
      LoadFunc lf = (LoadFunc)PigContext.instantiateFuncFromSpec(predLoad.getLFile().getFuncName());
 should be changed to 
      LoadFunc lf = (LoadFunc)PigContext.instantiateFuncFromSpec(predLoad.getLFile().getFuncSpec());
{noformat}
This is so that we correctly handle loaders which do not have default constuctor. FuncSpec encapsulates both the classname and constructor arguments and hence would handle both loaders which only have default constructor and those which only have constructor with args.

Similarly
{noformat}      
fs = new FileSpec(predFs.getFileName(), new FuncSpec(predFs.getFuncName()));
should be changed to
      fs = new FileSpec(predFs.getFileName(), predFs.getFuncSpec());
{noformat}

Likewise, the constructor of RandomSampleLoader should take a FuncSpec object as its first argument to represent the loader classname and constructor args. So this will require callers who create RandomSampleLoader to create it with correct funcspec objects.





> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan updated PIG-820:
---------------------------------

    Fix Version/s: 0.4.0
         Assignee: Ashutosh Chauhan  (was: Alan Gates)
           Status: Open  (was: Patch Available)

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan updated PIG-820:
---------------------------------

    Attachment: pig-820_v7.patch

Submitting the patch for review. Currently running tests. Will update the jira with the result.

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch, pig-820_v7.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan updated PIG-820:
---------------------------------

    Affects Version/s: 0.4.0
               Status: Patch Available  (was: Open)

Submitting for both 0.3 and 0.4 branches.

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: pig-820.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724632#action_12724632 ] 

Hadoop QA commented on PIG-820:
-------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12411723/pig-820_v5.patch
  against trunk revision 788174.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/102/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/102/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/102/console

This message is automatically generated.

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, pig-820_v4.patch, pig-820_v5.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan updated PIG-820:
---------------------------------

    Attachment: pig-820.patch

In addition to explanation above SampleOptimizer is introduced which visits the compiled MR plan to detect this pattern (MR operator containing only load-store followed by MR operator containing sampling job in map plan). If this pattern is present, SampleOptimizer deletes the unnecessary predecessor MR operator and replaces the POLoad of sampling job with RandomSampleLoader which uses the loader of its predecessor. 

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: pig-820.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan updated PIG-820:
---------------------------------

    Status: Patch Available  (was: Open)

Submitting to hudson

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface.  If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.