You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2015/07/01 01:30:05 UTC
[jira] [Updated] (TEZ-2587) Tez should provide attemptId (or some other ways of linking multiple threads for the same task)

     [ https://issues.apache.org/jira/browse/TEZ-2587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sergey Shelukhin updated TEZ-2587:
----------------------------------
    Description: 
There are at least 2 threads calling Hive code for every task; thread #1
{noformat}
	at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:303)
	at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:189)
	at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.<init>(TezGroupedSplitsInputFormat.java:131)
	at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:97)
	at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:152)
	at org.apache.tez.mapreduce.lib.MRReaderMapred.<init>(MRReaderMapred.java:73)
	at org.apache.tez.mapreduce.input.MultiMRInput.initFromEvent(MultiMRInput.java:177)
	at org.apache.tez.mapreduce.input.MultiMRInput.handleEvents(MultiMRInput.java:146)
	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.handleEvent(LogicalIOProcessorRuntimeTask.java:650)
	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.access$600(LogicalIOProcessorRuntimeTask.java:103)
	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$1.runInternal(LogicalIOProcessorRuntimeTask.java:720)
	at org.apache.tez.common.RunnableWithNdc.run(RunnableWithNdc.java:35)
	at java.lang.Thread.run(Thread.java:745)
{noformat}

Thread #2
{noformat}
	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:349)
	at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:71)
	at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:60)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
	at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:60)
	at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:35)
	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
{noformat}

Right now, there's no way for these threads to communicate with each other or share data.
While processor callee has access to some context objects and stuff, the input thread doesn't have access to anything.

Hive used globals to work around that, however this is both ugly, and no longer works if multiple tasks run in the same process.

There should be some way for the threads to talk... either IO thread should have access to ProcessorContext somehow, or maybe both should have attemptId added to the supplied conf. Perhaps it's possible to add a global method to get ProcessorContext by attemptId then, or if not we can arrange our own ugly globals by attemptId.

  was:
There are at least 2 threads calling Hive code for every task; thread #1
{noformat}
	at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:303)
	at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:189)
	at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.<init>(TezGroupedSplitsInputFormat.java:131)
	at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:97)
	at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:152)
	at org.apache.tez.mapreduce.lib.MRReaderMapred.<init>(MRReaderMapred.java:73)
	at org.apache.tez.mapreduce.input.MultiMRInput.initFromEvent(MultiMRInput.java:177)
	at org.apache.tez.mapreduce.input.MultiMRInput.handleEvents(MultiMRInput.java:146)
	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.handleEvent(LogicalIOProcessorRuntimeTask.java:650)
	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.access$600(LogicalIOProcessorRuntimeTask.java:103)
	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$1.runInternal(LogicalIOProcessorRuntimeTask.java:720)
	at org.apache.tez.common.RunnableWithNdc.run(RunnableWithNdc.java:35)
	at java.lang.Thread.run(Thread.java:745)
{noformat}

Thread #2
{noformat}
	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:349)
	at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:71)
	at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:60)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
	at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:60)
	at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:35)
	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
{noformat}

Right now, there's no way for these threads to communicate with each other or share data.
While processor caller has access to some context objects and stuff, the input thread doesn't have access to anything.

Hive used globals to work around that, however this is both ugly, and no longer works if multiple tasks run in the same process.

There should be some way for the threads to talk... either IO thread should have access to ProcessorContext somehow, or maybe both should have attemptId added to the supplied conf. Perhaps it's possible to add a global method to get ProcessorContext by attemptId then, or if not we can arrange our own ugly globals by attemptId.


> Tez should provide attemptId (or some other ways of linking multiple threads for the same task)
> -----------------------------------------------------------------------------------------------
>
>                 Key: TEZ-2587
>                 URL: https://issues.apache.org/jira/browse/TEZ-2587
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Siddharth Seth
>
> There are at least 2 threads calling Hive code for every task; thread #1
> {noformat}
> 	at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:303)
> 	at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:189)
> 	at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.<init>(TezGroupedSplitsInputFormat.java:131)
> 	at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:97)
> 	at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:152)
> 	at org.apache.tez.mapreduce.lib.MRReaderMapred.<init>(MRReaderMapred.java:73)
> 	at org.apache.tez.mapreduce.input.MultiMRInput.initFromEvent(MultiMRInput.java:177)
> 	at org.apache.tez.mapreduce.input.MultiMRInput.handleEvents(MultiMRInput.java:146)
> 	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.handleEvent(LogicalIOProcessorRuntimeTask.java:650)
> 	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.access$600(LogicalIOProcessorRuntimeTask.java:103)
> 	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$1.runInternal(LogicalIOProcessorRuntimeTask.java:720)
> 	at org.apache.tez.common.RunnableWithNdc.run(RunnableWithNdc.java:35)
> 	at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Thread #2
> {noformat}
> 	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
> 	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:349)
> 	at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:71)
> 	at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:60)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:422)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> 	at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:60)
> 	at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:35)
> 	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Right now, there's no way for these threads to communicate with each other or share data.
> While processor callee has access to some context objects and stuff, the input thread doesn't have access to anything.
> Hive used globals to work around that, however this is both ugly, and no longer works if multiple tasks run in the same process.
> There should be some way for the threads to talk... either IO thread should have access to ProcessorContext somehow, or maybe both should have attemptId added to the supplied conf. Perhaps it's possible to add a global method to get ProcessorContext by attemptId then, or if not we can arrange our own ugly globals by attemptId.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)