You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by wuyangjack <v-...@microsoft.com> on 2015/10/20 17:21:24 UTC

Hive custom transform scripts in Spark?

How to reuse hive custom transform scripts written in python or c++?

These scripts process data from stdin and print to stdout in spark. 
They use the Transform Syntax in Hive:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform

Example in Hive:
SELECT TRANSFORM(stuff)
USING 'script.exe'
AS thing1, thing2



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Hive-custom-transform-scripts-in-Spark-tp25142.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Hive custom transform scripts in Spark?

Posted by Michael Armbrust <mi...@databricks.com>.

Yeah, I don't think this feature was designed to work on systems that don't
have bash.  You could open a JIRA.

On Tue, Oct 20, 2015 at 10:36 AM, Yang Wu (Tata Consultancy Services) <
v-wuyang@microsoft.com> wrote:

> Yes.
>
> We are trying to run a custom script written in C# using TRANSFORM, but
> cannot get it work.
>
> The query and error are below. Any suggestions? Thank you!
>
>
>
> Spark version: 1.3
>
> Here is how we add and invoke the script:
>
>
>
> scala> hiveContext.sql("""ADD FILE wasb://… /NSSGraphHelper.exe""")
>
>                 …
>
> scala> hiveContext.sql("""SELECT TRANSFORM (dc, attribute, key, time,
> value) USING 'NSSGraphHelper. exe'  FROM SourceTable""").collect()
>
>
>
> The query throws an exception that it cannot find the file specified:
>
>
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 16.0 failed 4 times, most recent fail
>
> ure: Lost task 0.3 in stage 16.0 (TID 1273,
> workernode1.nsssparkcluster.g10.internal.cloudapp.net):
> java.io.IOException:
>
> Cannot run program "/bin/bash": CreateProcess error=2, The system cannot
> find the file specified
>
>         at java.lang.ProcessBuilder.start(ProcessBuilder.java:1041)
>
>         at
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anonfun$1.apply(ScriptTransformation.scala:61)
>
>         at
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anonfun$1.apply(ScriptTransformation.scala:58)
>
>         at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
>
>         at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
>
>         at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>
>         at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>
>        at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>
>         at org.apache.spark.scheduler.Task.run(Task.scala:64)
>
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>         at java.lang.Thread.run(Thread.java:745)
>
> Caused by: java.io.IOException: CreateProcess error=2, The system cannot
> find the file specified
>
>         at java.lang.ProcessImpl.create(Native Method)
>
>         at java.lang.ProcessImpl.<init>(ProcessImpl.java:385)
>
>         at java.lang.ProcessImpl.start(ProcessImpl.java:136)
>
>         at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
>
>         ... 16 more
>
>
>
> Driver stacktrace:
>
>         at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(
>
> DAGScheduler.scala:1204)
>
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
>
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
>
>         at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>
>         at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>
>         at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
>
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>
>         at scala.Option.foreach(Option.scala:236)
>
>         at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
>
>         at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
>
>         at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
>
>         at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>
>
>
> *From:* Michael Armbrust [mailto:michael@databricks.com]
> *Sent:* Tuesday, October 20, 2015 10:21 AM
> *To:* Yang Wu (Tata Consultancy Services) <v-...@microsoft.com>
> *Cc:* user <us...@spark.apache.org>
> *Subject:* Re: Hive custom transform scripts in Spark?
>
>
>
> We support TRANSFORM.  Are you having a problem using it?
>
>
>
> On Tue, Oct 20, 2015 at 8:21 AM, wuyangjack <v-...@microsoft.com>
> wrote:
>
> How to reuse hive custom transform scripts written in python or c++?
>
> These scripts process data from stdin and print to stdout in spark.
> They use the Transform Syntax in Hive:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform
> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fcwiki.apache.org%2fconfluence%2fdisplay%2fHive%2fLanguageManual%2bTransform&data=01%7c01%7cv-wuyang%40microsoft.com%7ca204316fb2bd41492b2708d2d972dde2%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=pL6wNubIoOntZPeD%2fGld%2b7ZPm57tpFKw4Q6Ab0YZ%2bV4%3d>
>
> Example in Hive:
> SELECT TRANSFORM(stuff)
> USING 'script.exe'
> AS thing1, thing2
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Hive-custom-transform-scripts-in-Spark-tp25142.html
> <https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fapache-spark-user-list.1001560.n3.nabble.com%2fHive-custom-transform-scripts-in-Spark-tp25142.html&data=01%7c01%7cv-wuyang%40microsoft.com%7ca204316fb2bd41492b2708d2d972dde2%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=E19LCyw%2ft%2b75qAtLbc1lCcOfCG02S8xts3e51HIEVE4%3d>
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
>

RE: Hive custom transform scripts in Spark?

Posted by "Yang Wu (Tata Consultancy Services)" <v-...@microsoft.com>.

Yes.
We are trying to run a custom script written in C# using TRANSFORM, but cannot get it work.
The query and error are below. Any suggestions? Thank you!

Spark version: 1.3
Here is how we add and invoke the script:

scala> hiveContext.sql("""ADD FILE wasb://… /NSSGraphHelper.exe""")
                …
scala> hiveContext.sql("""SELECT TRANSFORM (dc, attribute, key, time, value) USING 'NSSGraphHelper. exe'  FROM SourceTable""").collect()

The query throws an exception that it cannot find the file specified:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 16.0 failed 4 times, most recent fail
ure: Lost task 0.3 in stage 16.0 (TID 1273, workernode1.nsssparkcluster.g10.internal.cloudapp.net): java.io.IOException:
Cannot run program "/bin/bash": CreateProcess error=2, The system cannot find the file specified
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1041)
        at org.apache.spark.sql.hive.execution.ScriptTransformation$$anonfun$1.apply(ScriptTransformation.scala:61)
        at org.apache.spark.sql.hive.execution.ScriptTransformation$$anonfun$1.apply(ScriptTransformation.scala:58)
        at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
        at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:64)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: CreateProcess error=2, The system cannot find the file specified
        at java.lang.ProcessImpl.create(Native Method)
        at java.lang.ProcessImpl.<init>(ProcessImpl.java:385)
        at java.lang.ProcessImpl.start(ProcessImpl.java:136)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
        ... 16 more

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(
DAGScheduler.scala:1204)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

From: Michael Armbrust [mailto:michael@databricks.com]
Sent: Tuesday, October 20, 2015 10:21 AM
To: Yang Wu (Tata Consultancy Services) <v-...@microsoft.com>
Cc: user <us...@spark.apache.org>
Subject: Re: Hive custom transform scripts in Spark?

We support TRANSFORM.  Are you having a problem using it?

On Tue, Oct 20, 2015 at 8:21 AM, wuyangjack <v-...@microsoft.com>> wrote:
How to reuse hive custom transform scripts written in python or c++?

These scripts process data from stdin and print to stdout in spark.
They use the Transform Syntax in Hive:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fcwiki.apache.org%2fconfluence%2fdisplay%2fHive%2fLanguageManual%2bTransform&data=01%7c01%7cv-wuyang%40microsoft.com%7ca204316fb2bd41492b2708d2d972dde2%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=pL6wNubIoOntZPeD%2fGld%2b7ZPm57tpFKw4Q6Ab0YZ%2bV4%3d>

Example in Hive:
SELECT TRANSFORM(stuff)
USING 'script.exe'
AS thing1, thing2



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Hive-custom-transform-scripts-in-Spark-tp25142.html<https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fapache-spark-user-list.1001560.n3.nabble.com%2fHive-custom-transform-scripts-in-Spark-tp25142.html&data=01%7c01%7cv-wuyang%40microsoft.com%7ca204316fb2bd41492b2708d2d972dde2%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=E19LCyw%2ft%2b75qAtLbc1lCcOfCG02S8xts3e51HIEVE4%3d>
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: user-help@spark.apache.org<ma...@spark.apache.org>

Re: Hive custom transform scripts in Spark?

Posted by Michael Armbrust <mi...@databricks.com>.

We support TRANSFORM.  Are you having a problem using it?

On Tue, Oct 20, 2015 at 8:21 AM, wuyangjack <v-...@microsoft.com> wrote:

> How to reuse hive custom transform scripts written in python or c++?
>
> These scripts process data from stdin and print to stdout in spark.
> They use the Transform Syntax in Hive:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform
>
> Example in Hive:
> SELECT TRANSFORM(stuff)
> USING 'script.exe'
> AS thing1, thing2
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Hive-custom-transform-scripts-in-Spark-tp25142.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>