You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Lijia Liu (JIRA)" <ji...@apache.org> on 2018/04/26 08:53:00 UTC
[jira] [Created] (SPARK-24098) ScriptTransformationExec should wait process exiting before output iterator finish

Lijia Liu created SPARK-24098:
---------------------------------

             Summary: ScriptTransformationExec should wait process exiting before output iterator finish
                 Key: SPARK-24098
                 URL: https://issues.apache.org/jira/browse/SPARK-24098
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.3.0, 2.2.1
         Environment: Spark Version: 2.2.1

Hadoop Version: 2.7.1

 
            Reporter: Lijia Liu


In our spark cluster, some users find that spark may lost data when they use transform in sql.

We check the output file and discovery that some file are empty.

Then we check the executor's log, some exception were found like follow:

 

 
{code:java}
18/04/19 03:33:03 ERROR SparkUncaughtExceptionHandler: [Container in shutdown] Uncaught exception in thread Thread[Thread-ScriptTransformation-Feed,5,main]

java.io.IOException: Broken pipe

at java.io.FileOutputStream.writeBytes(Native Method)

at java.io.FileOutputStream.write(FileOutputStream.java:326)

at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)

at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)

at java.io.DataOutputStream.write(DataOutputStream.java:107)

at org.apache.hadoop.hive.ql.exec.TextRecordWriter.write(TextRecordWriter.java:53)

at org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(ScriptTransformationExec.scala:341)

at org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(ScriptTransformationExec.scala:317)

at scala.collection.Iterator$class.foreach(Iterator.scala:893)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)

at org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply$mcV$sp(ScriptTransformationExec.scala:317)

at org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformationExec.scala:306)

at org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformationExec.scala:306)

at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1952)

at org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread.run(ScriptTransformationExec.scala:306)

18/04/19 03:33:03 INFO FileOutputCommitter: Saved output of task 'attempt_20180419033229_0049_m_000001_0' to viewfs://hadoop-meituan/ghnn07/warehouse/mart_waimai_crm.db/.hive-staging_hive_2018-04-19_03-18-12_389_7869324671857417311-1/-ext-10000/_temporary/0/task_20180419033229_0049_m_000001

18/04/19 03:33:03 INFO SparkHadoopMapRedUtil: attempt_20180419033229_0049_m_000001_0: Committed

18/04/19 03:33:03 INFO Executor: Finished task 1.0 in stage 49.0 (TID 9843). 17342 bytes result sent to driver{code}
 

ScriptTransformation-Feed fail but the task finished successful.

FInally, we analysed the class ScriptTransformationExec and find follow result:

When feed thread doesn't set its _exception variable and the progress doesn't exit completely, output Iterator will return false in hasNext function.

 

Bug Reappearance:

1. Add Thread._sleep_(1000 * 600) before assign for _exception.

2. structure a python script witch will throw exception like follow:

test.py

 
{code:java}
import sys 
for line in sys.stdin:   
  raise Exception('error') 
  print line
{code}
3. use script created in step 2 in transform.
{code:java}
spark-sql>add files /path_to/test.py;
spark-sql>select transform(id) using 'python test.py' as id from city;
{code}
The result is that spark will end successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org