You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Hadoop QA (JIRA)" <ji...@apache.org> on 2009/06/21 03:46:07 UTC
[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader
should be changed to allow it subsume another loader
[ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722286#action_12722286 ]
Hadoop QA commented on PIG-820:
-------------------------------
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12411325/pig-820.patch
against trunk revision 786694.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 6 new or modified tests.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
-1 findbugs. The patch appears to introduce 1 new Findbugs warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed core unit tests.
+1 contrib tests. The patch passed contrib unit tests.
Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/96/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/96/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/96/console
This message is automatically generated.
> PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader
> -----------------------------------------------------------------------------------------
>
> Key: PIG-820
> URL: https://issues.apache.org/jira/browse/PIG-820
> Project: Pig
> Issue Type: Improvement
> Components: impl
> Affects Versions: 0.3.0, 0.4.0
> Reporter: Alan Gates
> Assignee: Alan Gates
> Attachments: pig-820.patch
>
>
> Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage. For order by this
> has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage. For pig scripts that just did an order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable.
> Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping
> between tuples itself. This will require the subsumed loader to implement a Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>
> /**
> * Skip ahead in the input stream.
> * @param n number of bytes to skip
> * @return number of bytes actually skipped. The return semantics are
> * exactly the same as {@link java.io.InpuStream#skip(long)}
> */
> public long skip(long n) throws IOException;
>
> /**
> * Get the current position in the stream.
> * @return position in the stream.
> */
> public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface. If so, rather than create an initial MR
> job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.