You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Hao Zhu (JIRA)" <ji...@apache.org> on 2015/03/28 00:15:52 UTC

[jira] [Commented] (PIG-4485) Can Pig disable RandomSampleLoader when doing "Order by"

    [ https://issues.apache.org/jira/browse/PIG-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14384875#comment-14384875 ] 

Hao Zhu commented on PIG-4485:
------------------------------

BTW, even if we " set pig.random.sampler.sample.size 0" in pig grunt, it still launches the sampler with 0 row sampled:

{code}
Input(s):
Successfully sampled 0 records from: "/user/xxx/xxx/parquet/xxx/xxx.parquet"
Successfully read 1111 records from: "/user/xxx/xxx/parquet/xxx/xxx.parquet"
{code}

So I do not understand why can't we disable it ?


> Can Pig disable RandomSampleLoader when doing "Order by"
> --------------------------------------------------------
>
>                 Key: PIG-4485
>                 URL: https://issues.apache.org/jira/browse/PIG-4485
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.13.0
>            Reporter: Hao Zhu
>            Priority: Critical
>
> When reading parquet files with "order by":
> {code}
> a = load '/xxx/xxx/parquet/xxx.parquet' using ParquetLoader();
> b = order a by col1 ;
> c = limit b 100 ;
> dump c
> {code}
> Pig spawns a Sampler job always in the begining:
> {code}
> Job Stats (time in seconds):
> JobId	Maps	Reduces	MaxMapTime	MinMapTIme	AvgMapTime	MedianMapTime	MaxReduceTime	MinReduceTime	AvgReduceTime	MedianReducetime	Alias	Feature	Outputs
> job_1426804645147_1270	1	1	8	8	8	8	4	4	4	4	b	SAMPLER
> job_1426804645147_1271	1	1	10	10	10	10	4	4	4	4	b	ORDER_BY,COMBINER
> job_1426804645147_1272	1	1	2	2	2	2	4	4	4	4	b		hdfs:/tmp/temp-xxx/tmp-xxx,
> {code}
> The issue is when reading lots of files, the first sampler job can take a long time to finish.
> The ask is:
> 1. Is the sampler job a must to implement "order by"?
> 2. If no, is there any way to disable RandomSampleLoader manually?
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)