You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Aniket Mokashi (JIRA)" <ji...@apache.org> on 2014/01/06 23:13:51 UTC

[jira] [Commented] (PIG-3648) Make the sample size for RandomSampleLoader configurable

    [ https://issues.apache.org/jira/browse/PIG-3648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13863497#comment-13863497 ] 

Aniket Mokashi commented on PIG-3648:
-------------------------------------

+1.

There is a typo in comments - RandomeSampleLoader, otherwise, patch looks good.

> volatility in the variance of the results increases when sorting a large number of rows
Can you give an example when this happens? Sampling algo looks good to me. Also, we want to keep number of samples less, so that we can replace this mechanism in future if needed.

> Make the sample size for RandomSampleLoader configurable
> --------------------------------------------------------
>
>                 Key: PIG-3648
>                 URL: https://issues.apache.org/jira/browse/PIG-3648
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Cheolsoo Park
>            Assignee: Cheolsoo Park
>            Priority: Minor
>             Fix For: 0.13.0
>
>         Attachments: PIG-3648-1.patch
>
>
> Pig uses RandomSampleLoader for range partitioning in order-by. But since the sample size is hardcoded as 100, volatility in the variance of the results  increases when sorting a large number of rows (e.g. 10M+ per task).
> It would be nice if the sample size could be configurable via Pig properties.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)