You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Uğur Sopaoğlu <us...@gmail.com> on 2018/09/24 12:26:14 UTC

Apache Spark and Airflow connection

I have a docker based cluster. In my cluster, I try to schedule spark jobs
by using Airflow. Airflow and Spark are running separately in *different
containers*.  However, I cannot run a spark job by using airflow.

Below the code is my airflow script:

from airflow import DAG

from airflow.contrib.operators.spark_submit_operator import
SparkSubmitOperator
from datetime import datetime, timedelta


args = {'owner': 'airflow', 'start_date': datetime(2018, 7, 31) }

dag = DAG('spark_example_new', default_args=args, schedule_interval="@once")

operator = SparkSubmitOperator(task_id='spark_submit_job',
conn_id='spark_default', java_class='Main', application='/SimpleSpark.jar',
name='airflow-spark-example',
        dag=dag)

I also configure spark_default in Airflow UI:

[image: Screenshot from 2018-09-24 12-00-46.png]


However, it produce following error:

[Errno 2] No such file or directory: 'spark-submit': 'spark-submit'

I think, airflow try to run spark job in own. How can I configure that it
runs spark code on spark master.

-- 
Uğur Sopaoğlu