You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "liyunzhang_intel (JIRA)" <ji...@apache.org> on 2016/05/18 08:55:13 UTC

[jira] [Created] (PIG-4899) The number of records of input file is calculated wrongly in spark mode in multiquery case

liyunzhang_intel created PIG-4899:
-------------------------------------

             Summary:  The number of records of input file is calculated wrongly in spark mode in multiquery case
                 Key: PIG-4899
                 URL: https://issues.apache.org/jira/browse/PIG-4899
             Project: Pig
          Issue Type: Sub-task
            Reporter: liyunzhang_intel
            Assignee: liyunzhang_intel


sparkCounter to calucate the records of input file(LoadConverter#ToTupleFunction#apply) will be executed multiple times in multiquery case. This will cause the input records number is calculated wrongly. for example:
{code}
#--------------------------------------------------
# Spark Plan                                  
#--------------------------------------------------

Spark node scope-534
Split - scope-548
|   |
|   Store(hdfs://localhost:48350/tmp/temp649016960/tmp48836938:org.apache.pig.impl.io.InterStorage) - scope-538
|   |
|   |---C: Filter[bag] - scope-495
|       |   |
|       |   Less Than or Equal[boolean] - scope-498
|       |   |
|       |   |---Project[int][1] - scope-496
|       |   |
|       |   |---Constant(5) - scope-497
|   |
|   Store(hdfs://localhost:48350/tmp/temp649016960/tmp804709981:org.apache.pig.impl.io.InterStorage) - scope-546
|   |
|   |---B: Filter[bag] - scope-507
|       |   |
|       |   Equal To[boolean] - scope-510
|       |   |
|       |   |---Project[int][0] - scope-508
|       |   |
|       |   |---Constant(3) - scope-509
|
|---A: New For Each(false,false,false)[bag] - scope-491
    |   |
    |   Cast[int] - scope-483
    |   |
    |   |---Project[bytearray][0] - scope-482
    |   |
    |   Cast[int] - scope-486
    |   |
    |   |---Project[bytearray][1] - scope-485
    |   |
    |   Cast[int] - scope-489
    |   |
    |   |---Project[bytearray][2] - scope-488
    |
    |---A: Load(hdfs://localhost:48350/user/root/input:org.apache.pig.builtin.PigStorage) - scope-481--------

Spark node scope-540
C: Store(hdfs://localhost:48350/user/root/output:org.apache.pig.builtin.PigStorage) - scope-502
|
|---Load(hdfs://localhost:48350/tmp/temp649016960/tmp48836938:org.apache.pig.impl.io.InterStorage) - scope-539--------

Spark node scope-542
D: Store(hdfs://localhost:48350/user/root/output2:org.apache.pig.builtin.PigStorage) - scope-533
|
|---D: FRJoin[tuple] - scope-525
    |   |
    |   Project[int][0] - scope-522
    |   |
    |   Project[int][0] - scope-523
    |   |
    |   Project[int][0] - scope-524
    |
    |---Load(hdfs://localhost:48350/tmp/temp649016960/tmp48836938:org.apache.pig.impl.io.InterStorage) - scope-541--------

Spark node scope-545
Store(hdfs://localhost:48350/tmp/temp649016960/tmp-2036144538:org.apache.pig.impl.io.InterStorage) - scope-547
|
|---A1: New For Each(false,false,false)[bag] - scope-521
    |   |
    |   Cast[int] - scope-513
    |   |
    |   |---Project[bytearray][0] - scope-512
    |   |
    |   Cast[int] - scope-516
    |   |
    |   |---Project[bytearray][1] - scope-515
    |   |
    |   Cast[int] - scope-519
    |   |
    |   |---Project[bytearray][2] - scope-518
    |
    |---A1: Load(hdfs://localhost:48350/user/root/input2:org.apache.pig.builtin.PigStorage) - scope-511-------
{code}

PhysicalOperator (LoadA) will be executed in LoadConverter#ToTupleFunction#apply for more than the correct times because this is a multi-query case. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)