You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Daniel Dai (JIRA)" <ji...@apache.org> on 2017/04/12 20:35:41 UTC

[jira] [Updated] (PIG-5219) IndexOutOfBoundsException when loading multiple directories with different schemas using OrcStorage

     [ https://issues.apache.org/jira/browse/PIG-5219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-5219:
----------------------------
    Fix Version/s: 0.17.0

> IndexOutOfBoundsException when loading multiple directories with different schemas using OrcStorage
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-5219
>                 URL: https://issues.apache.org/jira/browse/PIG-5219
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.16.0
>         Environment: Pig Version: 0.16.0
> OS: EMR 5.3.1
>            Reporter: Omer Tal
>            Assignee: Daniel Dai
>             Fix For: 0.17.0
>
>
> Scenario:
> # Data set based on two hours in the same day. In hour 00 the ORC file has 4 columns {a,b,c,d} and during hour 02 it changes to 5 columns {a,b,c,d,e}
> # Loading ORC files with the same schema (hour 00):
> {code}
> x = load 's3://orc_files/dt=2017-03-21/hour=00' using OrcStorage();
> dump x;
> {code}
> Result:
> {code}
> (1,2,3,4)
> (1,2,3,4)
> (1,2,3,4)
> (1,2,3,4)
> (1,2,3,4)
> (1,2,3,4)
> (1,2,3,4)
> {code}
> # Loading ORC files with different schemas in the same directory:
> {code}
> x = load 's3://orc_files/dt=2017-03-21/hour=02' using OrcStorage();
> dump x;
> {code}
> Result:
> {code}
> (1,2,3,4,5)
> (1,2,3,4,5)
> (1,2,3,4,5)
> (1,2,3,4,5)
> (1,2,3,4,5)
> (1,2,3,4,5)
> (1,2,3,4,5)
> (1,2,3,4)
> (1,2,3,4)
> (1,2,3,4)
> (1,2,3,4)
> {code}
> # Loading the whole day (both hour 00 and 02):
> {code}
> x = load 's3://orc_files/dt=2017-03-21' using OrcStorage();
> dump x;
> {code}
> Result:
> {code}
> 37332 [PigTezLauncher-0] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJob  - DAG Status: status=FAILED, progress=TotalTasks: 1 Succeeded: 0 Running: 0 Failed: 1 Killed: 0 FailedTaskAttempts: 4, diagnostics=Vertex failed, vertexName=scope-2, vertexId=vertex_1491991474861_0006_1_00, diagnostics=[Task failed, taskId=task_1491991474861_0006_1_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1491991474861_0006_1_00_000000_0:java.lang.IndexOutOfBoundsException: Index: 4, Size: 4
>         at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>         at java.util.ArrayList.get(ArrayList.java:429)
>         at org.apache.pig.impl.util.hive.HiveUtils.convertHiveToPig(HiveUtils.java:97)
>         at org.apache.pig.builtin.OrcStorage.getNext(OrcStorage.java:381)
>         at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:204)
>         at org.apache.tez.mapreduce.lib.MRReaderMapReduce.next(MRReaderMapReduce.java:119)
>         at org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POSimpleTezLoad.getNextTuple(POSimpleTezLoad.java:140)
>         at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:305)
>         at org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POStoreTez.getNextTuple(POStoreTez.java:123)
>         at org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.runPipeline(PigProcessor.java:376)
>         at org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.run(PigProcessor.java:241)
>         at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
>         at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
>         at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
>         at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
>         at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
>         at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)