You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Hao Zhu (JIRA)" <ji...@apache.org> on 2015/03/27 23:32:54 UTC
[jira] [Created] (PIG-4485) Can Pig disable RandomSampleLoader when
doing "Order by"
Hao Zhu created PIG-4485:
----------------------------
Summary: Can Pig disable RandomSampleLoader when doing "Order by"
Key: PIG-4485
URL: https://issues.apache.org/jira/browse/PIG-4485
Project: Pig
Issue Type: Bug
Affects Versions: 0.13.0
Reporter: Hao Zhu
Priority: Critical
When reading parquet files with "order by":
{code}
a = load '/xxx/xxx/parquet/xxx.parquet' using ParquetLoader();
b = order a by col1 ;
c = limit b 100 ;
dump c
{code}
Pig spawns a Sampler job always in the begining:
{code}
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_1426804645147_1270 1 1 8 8 8 8 4 4 4 4 b SAMPLER
job_1426804645147_1271 1 1 10 10 10 10 4 4 4 4 b ORDER_BY,COMBINER
job_1426804645147_1272 1 1 2 2 2 2 4 4 4 4 b hdfs:/tmp/temp-xxx/tmp-xxx,
{code}
The issue is when reading lots of files, the first sampler job can take a long time to finish.
The ask is:
1. Is the sampler job a must to implement "order by"?
2. If no, is there any way to disable RandomSampleLoader manually?
Thanks.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)