You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Sriranjan Manjunath (JIRA)" <ji...@apache.org> on 2009/12/10 01:11:18 UTC

[jira] Created: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

Poisson Sample Loader should compute the number of samples required only once
-----------------------------------------------------------------------------

                 Key: PIG-1143
                 URL: https://issues.apache.org/jira/browse/PIG-1143
             Project: Pig
          Issue Type: Bug
            Reporter: Sriranjan Manjunath
            Assignee: Sriranjan Manjunath


The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1143:
--------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

patch committed to 0.6.0 branch. Thanks, Sri!

> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.6.0
>
>         Attachments: PIG_1143.patch, PIG_1143.patch.1
>
>
> The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1143:
--------------------------------

    Fix Version/s: 0.6.0

> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.6.0
>
>         Attachments: PIG_1143.patch, PIG_1143.patch.1
>
>
> The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791872#action_12791872 ] 

Hadoop QA commented on PIG-1143:
--------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12428266/PIG_1143.patch.1
  against trunk revision 891499.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 9 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/135/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/135/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/135/console

This message is automatically generated.

> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
>         Attachments: PIG_1143.patch, PIG_1143.patch.1
>
>
> The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sriranjan Manjunath updated PIG-1143:
-------------------------------------

    Attachment: PIG_1143.patch

> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
>         Attachments: PIG_1143.patch
>
>
> The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789015#action_12789015 ] 

Thejas M Nair commented on PIG-1143:
------------------------------------

To summarize my above comment, the approach in load-store redesign of not using the file-size at all is better .

> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
>
> The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1143:
--------------------------------

    Status: Open  (was: Patch Available)

> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
>         Attachments: PIG_1143.patch
>
>
> The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793322#action_12793322 ] 

Olga Natkovich commented on PIG-1143:
-------------------------------------

patch committed to the trunk. Will commit to 0.6 branch tomorrow.

> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.6.0
>
>         Attachments: PIG_1143.patch, PIG_1143.patch.1
>
>
> The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788971#action_12788971 ] 

Sriranjan Manjunath commented on PIG-1143:
------------------------------------------

To describe the problem in more detail, the current implementation does not handle a glob efficiently. When the sample loader encounters a directory (or combinations thereof), it gets the element descriptors of all the files inside the directory to compute the file sizes.
For ex: A = load "{view, click}" will result in computing file sizes of all the files underneath both "view" and "click" directories. If we have a large number of mappers, this will result in a ton of hdfs system calls, clogging the name node.

I intend to modify Poisson Sample Loader as follows. The algorithm for computing the total number of samples remains the same. However, it will not be computed by every mapper. I will be using the UDFContext object to share this information across mappers. Since mapper/ reducers can only read the information from UDFContext, the slicer will store this information. The slicer will compute the sampler count for the first map. As before, PigSlice will call computeSamples() for the first map. It will then store this value as a property in the UDFContext object. The Slicer will check UDFContext to see if this value is set and if it is, it will use it instead of computing it again. I intend to use "pig.input.0.sampleCount" as the key.

This solution will reduce the fileSize() invocations to a minimum and should reduce the burden on the name node.

> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
>
> The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789063#action_12789063 ] 

Sriranjan Manjunath commented on PIG-1143:
------------------------------------------

I am OK with using InputSplits.getLength() as long as these provide you a good estimate of the file size. Without the population size, poisson samplers do now work well.

Samplers expect the data to be in BinStorage. If not, the first job reads it and stores it into BinStorage. The only exception being if the join follows a load/store only MR job.


> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
>
> The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790566#action_12790566 ] 

Hadoop QA commented on PIG-1143:
--------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12427980/PIG_1143.patch
  against trunk revision 890553.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/123/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/123/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/123/console

This message is automatically generated.

> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
>         Attachments: PIG_1143.patch
>
>
> The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sriranjan Manjunath updated PIG-1143:
-------------------------------------

    Attachment: PIG_1143.patch.1

I have added the successive join and multiquery unit tests.

> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
>         Attachments: PIG_1143.patch, PIG_1143.patch.1
>
>
> The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788998#action_12788998 ] 

Olga Natkovich commented on PIG-1143:
-------------------------------------

Sounds like a good approach. We need to figure out how this will translate into Load-Store redesign and make sure to port it there once the patch is available.

> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
>
> The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sriranjan Manjunath updated PIG-1143:
-------------------------------------

    Status: Patch Available  (was: Open)

> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
>         Attachments: PIG_1143.patch
>
>
> The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789069#action_12789069 ] 

Thejas M Nair commented on PIG-1143:
------------------------------------

If the data is going to be in BinStorage, my comments regarding the approach for this patch are not applicable. But the patch does not need to be ported to load-store redesign branch.


> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
>
> The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790900#action_12790900 ] 

Olga Natkovich commented on PIG-1143:
-------------------------------------

I am reviewing this patch

> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
>         Attachments: PIG_1143.patch
>
>
> The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789013#action_12789013 ] 

Thejas M Nair commented on PIG-1143:
------------------------------------

The PoissonSampleLoader implementation in Load-store redesign does not check the file size and has a different approach for the following reason (as mentioned in PIG-1062) -

With new interfaces in load-store redesign, pig can compute the file size by adding up size of each split (from InputSplit.getLenght()) . But the documentation of the function does not make it clear if this is size on disk , compressed/uncompressed etc. Looks like it just needs to be some number proportional to size of the file. Assuming it is size on disk (uncompressed), using this to estimate the total memory it will require is tricky, one has to make assumptions about the compression ratio and the serialization method.
Using Tuple.getMemorySize() while sampling will give more accurate numbers for reducer memory that it will consume. 


> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
>
> The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sriranjan Manjunath updated PIG-1143:
-------------------------------------

    Status: Patch Available  (was: Open)

> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
>         Attachments: PIG_1143.patch, PIG_1143.patch.1
>
>
> The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789023#action_12789023 ] 

Sriranjan Manjunath commented on PIG-1143:
------------------------------------------

The file size in the documentation refers to the size on disk. In order to account for compression, encoding etc. a configurable parameter - pig.inputfile.conversionfactor is provided. I agree that this cannot be set to a good value for compressed data. It is just a guidance. The implications of setting it to a bad value are minimal - we will end up sampling little more than the required number of samples (unless you set it to a fraction).

> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
>
> The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790983#action_12790983 ] 

Olga Natkovich commented on PIG-1143:
-------------------------------------

I think this needs to be tested with multiple skew joins both in a case of single store and multiquery. Please, add unit tests

> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
>         Attachments: PIG_1143.patch
>
>
> The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789064#action_12789064 ] 

Sriranjan Manjunath commented on PIG-1143:
------------------------------------------

*now* should have been *not*!

> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
>
> The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793296#action_12793296 ] 

Olga Natkovich commented on PIG-1143:
-------------------------------------

+1 on the code changes. There is a extra debug trace in the code that I will remove as part of the commit

> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
>         Attachments: PIG_1143.patch, PIG_1143.patch.1
>
>
> The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789046#action_12789046 ] 

Thejas M Nair commented on PIG-1143:
------------------------------------

Pig input does not have to be a file, the LoadFunc could be reading from HBase or some other source. So the use of FileLocalizer.getSize(fname,pcProps) will not work in all cases.
 InputSplits.getLength() can be used instead, but as per the documentation, the purpose of  InputSplits.getLength() is "so that the input splits can be sorted by size". So implementations might just give a number that is proportional to the size if they don't have access to actual size. 

Even if the actual file size on disk is available through  InputSplits.getLength(), in case of columnar storage the compression can be very high (eg run-length encoding of column that is sort key with only few unique values), and we might end up sampling very little.

> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
>
> The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.