You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Nandor Kollar (JIRA)" <ji...@apache.org> on 2017/02/15 09:11:42 UTC
[jira] [Commented] (PIG-4899) The number of records of input file
is calculated wrongly in spark mode in multiquery case
[ https://issues.apache.org/jira/browse/PIG-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15867502#comment-15867502 ]
Nandor Kollar commented on PIG-4899:
------------------------------------
LGTM, [~kellyzly] I think Adam's patch will fix the NPE you noticed in PIG-5128
> The number of records of input file is calculated wrongly in spark mode in multiquery case
> -------------------------------------------------------------------------------------------
>
> Key: PIG-4899
> URL: https://issues.apache.org/jira/browse/PIG-4899
> Project: Pig
> Issue Type: Sub-task
> Components: spark
> Reporter: liyunzhang_intel
> Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-4899.2.patch, PIG-4899.3IncrFrom2.patch, PIG-4899.patch
>
>
> sparkCounter to calucate the records of input file(LoadConverter#ToTupleFunction#apply) will be executed multiple times in multiquery case. This will cause the input records number is calculated wrongly. for example:
> {code}
> #--------------------------------------------------
> # Spark Plan
> #--------------------------------------------------
> Spark node scope-534
> Split - scope-548
> | |
> | Store(hdfs://localhost:48350/tmp/temp649016960/tmp48836938:org.apache.pig.impl.io.InterStorage) - scope-538
> | |
> | |---C: Filter[bag] - scope-495
> | | |
> | | Less Than or Equal[boolean] - scope-498
> | | |
> | | |---Project[int][1] - scope-496
> | | |
> | | |---Constant(5) - scope-497
> | |
> | Store(hdfs://localhost:48350/tmp/temp649016960/tmp804709981:org.apache.pig.impl.io.InterStorage) - scope-546
> | |
> | |---B: Filter[bag] - scope-507
> | | |
> | | Equal To[boolean] - scope-510
> | | |
> | | |---Project[int][0] - scope-508
> | | |
> | | |---Constant(3) - scope-509
> |
> |---A: New For Each(false,false,false)[bag] - scope-491
> | |
> | Cast[int] - scope-483
> | |
> | |---Project[bytearray][0] - scope-482
> | |
> | Cast[int] - scope-486
> | |
> | |---Project[bytearray][1] - scope-485
> | |
> | Cast[int] - scope-489
> | |
> | |---Project[bytearray][2] - scope-488
> |
> |---A: Load(hdfs://localhost:48350/user/root/input:org.apache.pig.builtin.PigStorage) - scope-481--------
> Spark node scope-540
> C: Store(hdfs://localhost:48350/user/root/output:org.apache.pig.builtin.PigStorage) - scope-502
> |
> |---Load(hdfs://localhost:48350/tmp/temp649016960/tmp48836938:org.apache.pig.impl.io.InterStorage) - scope-539--------
> Spark node scope-542
> D: Store(hdfs://localhost:48350/user/root/output2:org.apache.pig.builtin.PigStorage) - scope-533
> |
> |---D: FRJoin[tuple] - scope-525
> | |
> | Project[int][0] - scope-522
> | |
> | Project[int][0] - scope-523
> | |
> | Project[int][0] - scope-524
> |
> |---Load(hdfs://localhost:48350/tmp/temp649016960/tmp48836938:org.apache.pig.impl.io.InterStorage) - scope-541--------
> Spark node scope-545
> Store(hdfs://localhost:48350/tmp/temp649016960/tmp-2036144538:org.apache.pig.impl.io.InterStorage) - scope-547
> |
> |---A1: New For Each(false,false,false)[bag] - scope-521
> | |
> | Cast[int] - scope-513
> | |
> | |---Project[bytearray][0] - scope-512
> | |
> | Cast[int] - scope-516
> | |
> | |---Project[bytearray][1] - scope-515
> | |
> | Cast[int] - scope-519
> | |
> | |---Project[bytearray][2] - scope-518
> |
> |---A1: Load(hdfs://localhost:48350/user/root/input2:org.apache.pig.builtin.PigStorage) - scope-511-------
> {code}
> PhysicalOperator (LoadA) will be executed in LoadConverter#ToTupleFunction#apply for more than the correct times because this is a multi-query case.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)