You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Travis Woodruff (JIRA)" <ji...@apache.org> on 2018/01/03 19:26:00 UTC

[jira] [Created] (PIG-5326) Issue with auto parallelism and scalar inputs in Tez

Travis Woodruff created PIG-5326:
------------------------------------

             Summary: Issue with auto parallelism and scalar inputs in Tez
                 Key: PIG-5326
                 URL: https://issues.apache.org/jira/browse/PIG-5326
             Project: Pig
          Issue Type: Bug
          Components: tez
            Reporter: Travis Woodruff


I'm getting a "Scalar has more than one row in the output" error with the following script:

{code}
a = LOAD 't' as (x:chararray);
b = GROUP a BY x PARALLEL 2;
c = GROUP b by group;
d = FOREACH (GROUP a ALL) GENERATE COUNT(a) as count;
e = FOREACH c GENERATE group, d.count;
DUMP e;
{code}

If I add a PARALLEL clause to {{c}}, the error goes away, so the issue seems to be related to auto parallelism.

I'm not very familiar with Tez, so I'm not sure how things are supposed to work, the issue seems to be related to the following (I know almost nothing about Tez so take this with a grain of salt):

# {{PigGraceShuffleVertexManager}} calls {{VertexImpl.reconfigureVertex()}}, which configures the parallelism of the vertex ({{VertexImpl.numTasks}})
# The {{InputSpec}} for the scalar input is created (via {{Edge.getDestinationSpec()}}) with {{physicalInputCount}} equal to the parallelism set above
# The input is created (in {{LogicalIOProcessorRuntimeTask.createInput()}}) based on this {{InputSpec}}.
# The resulting {{UnorderedKVInput}} creates a {{ShuffleManager}} with {{numInputs}} = {{numPhysicalInputs}}.
 
This creates a reader that reads the scalar input {{numPhysicalInputs}} times, which results in the "Scalar has more than one row in the output" error in {{ReadScalarsTez}}.

When parallelism is specified explicitly, {{VertexImpl.reconfigureVertex()}} is never called, and {{numPhysicalInputs}} remains as 1 for the scalar input.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)