You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kylin.apache.org by "Michael, Gabe" <Ga...@disneystreaming.com> on 2021/09/10 18:57:01 UTC

Kylin v4.0.0 GA on EMR 6.3.0 fail to start Sparder due to YARN staging files missing

Hello,



When running Kylin 4.0.0 on AWS EMR 6.3.0, I am able to successfully build a cube.



But when I try to query it, the Sparder application cannot start.



Kylin attempts to upload some files to a local directory, then the Spark job fails because it cannot read files from that directory.


2021-09-10 18:45:47,407 INFO  [Thread-9] yarn.Client:57 : Preparing resources for our AM container
2021-09-10 18:45:47,428 WARN  [Thread-9] yarn.Client:69 : Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
2021-09-10 18:45:50,861 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/tomcat/temp/spark-8ec4dae7-5f3c-477e-bda3-4c4f00978586/__spark_libs__7584573487901234438.zip -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip
2021-09-10 18:45:51,487 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/lib/kylin-parquet-job-4.0.0.jar -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/kylin-parquet-job-4.0.0.jar
2021-09-10 18:45:51,597 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/conf/spark-executor-log4j.properties -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/spark-executor-log4j.properties
2021-09-10 18:45:51,718 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/tomcat/temp/spark-8ec4dae7-5f3c-477e-bda3-4c4f00978586/__spark_conf__5546014978595262008.zip -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_conf__.zip
2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing view acls to: hadoop
2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing modify acls to: hadoop
2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing view acls groups to:
2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing modify acls groups to:
2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
2021-09-10 18:45:51,814 INFO  [Thread-9] yarn.Client:57 : Submitting application application_1631282030708_2863 to ResourceManager
2021-09-10 18:45:51,861 INFO  [Thread-9] impl.YarnClientImpl:329 : Submitted application application_1631282030708_2863
2021-09-10 18:45:52,863 INFO  [Thread-9] yarn.Client:57 : Application report for application_1631282030708_2863 (state: FAILED)
2021-09-10 18:45:52,866 INFO  [Thread-9] yarn.Client:57 :
       client token: N/A
       diagnostics: Application application_1631282030708_2863 failed 2 times due to AM Container for appattempt_1631282030708_2863_000002 exited with  exitCode: -1000
Failing this attempt.Diagnostics: [2021-09-10 18:45:52.033]File file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip does not exist
java.io.FileNotFoundException: File file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip does not exist
       at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:671)
       at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:992)
       at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:661)
       at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:464)
       at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:269)
       at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
       at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)
       at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)
       at java.security.AccessController.doPrivileged(Native Method)
       at javax.security.auth.Subject.doAs(Subject.java:422)
       at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
       at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)
       at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:243)
       at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:236)
       at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:224)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)

For more detailed output, check the application tracking page: http://ip-10-240-102-189.bamtech.test.us-east-1.bamgrid.net:8088/cluster/app/application_1631282030708_2863 Then click on links to logs of each attempt.
. Failing the application.
       ApplicationMaster host: N/A
       ApplicationMaster RPC port: -1
       queue: default
       start time: 1631299551829
       final status: FAILED
       tracking URL: http://ip-10-240-102-189.bamtech.test.us-east-1.bamgrid.net:8088/cluster/app/application_1631282030708_2863
       user: hadoop
2021-09-10 18:45:52,941 INFO  [Thread-9] yarn.Client:57 : Deleted staging directory file:/home/hadoop/.sparkStaging/application_1631282030708_2863
2021-09-10 18:45:52,942 ERROR [Thread-9] cluster.YarnClientSchedulerBackend:73 : The YARN application has already ended! It might have been killed or the Application Master may have failed to start. Check the YARN application logs for more details.
2021-09-10 18:45:52,943 ERROR [Thread-9] spark.SparkContext:94 : Error initializing SparkContext.

Here are my kylin.properties with irrelevant/sensitive values removed:

kylin.env.hdfs-working-dir=s3a://XXXXX/qa/kylin/hdfs/
kylin.env=QA
kylin.server.mode=all
kylin.server.cluster-servers=localhost:7070
kylin.engine.default=6
kylin.storage.default=4
kylin.server.external-acl-provider=
kylin.source.hive.database-for-flat-table=default
kylin.web.default-time-filter=1
kylin.storage.clean-after-delete-operation=false
kylin.job.retry=1
kylin.job.max-concurrent-jobs=1
kylin.job.sampling-percentage=100
kylin.job.scheduler.provider.100=org.apache.kylin.job.impl.curator.CuratorScheduler
kylin.job.scheduler.default=2
kylin.spark-conf.auto.prior=true
kylin.engine.spark-conf.spark.master=yarn
kylin.engine.spark-conf.spark.submit.deployMode=client
kylin.engine.spark-conf.spark.yarn.queue=default
kylin.engine.spark-conf.spark.eventLog.enabled=true
kylin.engine.spark-conf.spark.eventLog.dir=hdfs:///kylin/spark-history
kylin.engine.spark-conf.spark.history.fs.logDirectory=hdfs:///kylin/spark-history
kylin.engine.spark-conf.spark.hadoop.yarn.timeline-service.enabled=false
kylin.engine.spark-conf.spark.executor.extraJavaOptions=-Dfile.encoding=UTF-8 -Dhdp.version=current -Dlog4j.configuration=spark-executor-log4j.properties -Dlog4j.debug -Dkylin.hdfs.working.dir=${hdfs.working.dir} -Dkylin.metadata.identifier=kylin_metadata -Dkylin.spark.category=job -Dkylin.spark.project=${job.project} -Dkylin.spark.identifier=${job.id} -Dkylin.spark.jobName=${job.stepId} -Duser.timezone=${user.timezone}
kylin.engine.spark-conf.spark.driver.extraJavaOptions=-XX:+CrashOnOutOfMemoryError
kylin.query.auto-sparder-context-enabled-enabled=false
kylin.query.spark-conf.spark.master=yarn
kylin.query.spark-conf.spark.driver.cores=1
kylin.query.spark-conf.spark.driver.memory=4G
kylin.query.spark-conf.spark.driver.memoryOverhead=1G
kylin.query.spark-conf.spark.executor.cores=1
kylin.query.spark-conf.spark.executor.instances=1
kylin.query.spark-conf.spark.executor.memory=4G
kylin.query.spark-conf.spark.executor.memoryOverhead=1G
kylin.query.spark-conf.spark.serializer=org.apache.spark.serializer.JavaSerializer
kylin.query.spark-conf.spark.hadoop.yarn.timeline-service.enabled=false
kylin.query.spark-conf.spark.executor.extraJavaOptions=-Dhdp.version=current -Dlog4j.configuration=spark-executor-log4j.properties -Dlog4j.debug -Dkylin.hdfs.working.dir=s3a://dataeng-data-test/qa/kylin/hdfs/ -Dkylin.metadata.identifier=kylin_metadata -Dkylin.spark.category=sparder -Dkylin.spark.identifier={{APP_ID}}
kylin.source.hive.redistribute-flat-table=false
kylin.metadata.jdbc.dialect=mysql
kylin.metadata.jdbc.json-always-small-cell=true
kylin.job.lock=org.apache.kylin.storage.hbase.util.ZookeeperDistributedJobLock
kylin.web.set-config-enable=true
kylin.job.allow-empty-segment=false
kylin.env.hadoop-conf-dir=/etc/hadoop/conf
kylin.query.lazy-query-enabled=true
kylin.query.cache-signature-enabled=true
kylin.query.segment-cache-enabled=false
kylin.engine.spark-fact-distinct=true
kylin.engine.spark-dimension-dictionary=false
kylin.engine.spark-uhc-dictionary=true
kylin.engine.spark.rdd-partition-cut-mb=10
kylin.engine.spark.min-partition=1
kylin.engine.spark.max-partition=5000
kylin.engine.spark-conf.spark.dynamicAllocation.enabled=true
kylin.engine.spark-conf.spark.dynamicAllocation.minExecutors=1
kylin.engine.spark-conf.spark.dynamicAllocation.maxExecutors=1000
kylin.engine.spark-conf.spark.dynamicAllocation.executorIdleTimeout=300
kylin.engine.spark-conf.spark.driver.memory=2G
kylin.engine.spark-conf.spark.executor.memory=4G
kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024
kylin.engine.spark-conf.spark.executor.cores=1
kylin.engine.spark-conf.spark.network.timeout=600
kylin.engine.spark-conf.spark.shuffle.service.enabled=true
kylin.engine.spark-conf.spark.hadoop.dfs.replication=2
kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress=true
kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec
kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec
kylin.engine.spark-conf-mergedict.spark.executor.memory=6G
kylin.engine.spark-conf-mergedict.spark.memory.fraction=0.2
kylin.engine.spark-conf.spark.sql.hive.metastore.version=3.1.2
kylin.engine.spark-conf.spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/lib/*
kylin.query.spark-conf.spark.sql.hive.metastore.version=3.1.2
kylin.query.spark-conf.spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/lib/*
kylin.server.cluster-name=kylin_metadata
kylin.log.spark-executor-properties-file=/usr/local/kylin/conf/spark-executor-log4j.properties
kylin.metadata.url.identifier=kylin_metadata

Thank you for your assistance,

Gabe

--
Gabe Michael
Principal Data Engineer
Disney Streaming Services

Re: Kylin v4.0.0 GA on EMR 6.3.0 fail to start Sparder due to YARN staging files missing

Posted by "Michael, Gabe" <Ga...@disneystreaming.com>.

Yaqian Zhang thank you for the suggestion, I configured "kylin.query.spark-conf.spark.yarn.stagingDir=hdfs://my-cluster-hostname:8020/tmp/spark-staging" (I also created this directory on HDFS first with "hdfs dfs -mkdir -p /tmp/spark-staging") and now the file uploads are going to HDFS, the Sparder Spark job runs successfully and I receive query results!

De : Yaqian Zhang <Ya...@126.com>
Date : lundi, 13 septembre 2021 à 22:43
À : user@kylin.apache.org <us...@kylin.apache.org>
Objet : Re: Kylin v4.0.0 GA on EMR 6.3.0 fail to start Sparder due to YARN staging files missing
Hi Gabe:

You can try to configure 'kylin.query.spark-conf.spark.yarn.stagingDir' in kylin.properties  to make this configuration take effect in kylin.

在 2021年9月13日，下午9:56，Michael, Gabe <Ga...@disneystreaming.com>> 写道：

Thank you for your reply.

HADOOP_CONF_DIR is set correctly to /usr/local/kylin/hadoop_conf
fs.defaultFS in /usr/local/kylin/hadoop_conf/core-site.xml is set to hdfs://<hdfs://xxxxx:8020>xxxxx<hdfs://xxxxx:8020>:8020<hdfs://xxxxx:8020> (domain name omitted)

I also tested submitting a simple Spark app from the command line with spark-submit, and it succeeds.
According to the log messages it is uploading the files to HDFS when I submit directly from spark-submit:

21/09/13 13:49:19 INFO Client: Preparing resources for our AM container
21/09/13 13:49:19 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
21/09/13 13:49:23 INFO Client: Uploading resource file:/mnt/tmp/spark-7256648b-ffe0-4455-8a80-d56f1a7fd707/__spark_libs__3285017367714177339.zip -> hdfs://<hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_libs__3285017367714177339.zip>xxxxx<hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_libs__3285017367714177339.zip>:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_libs__3285017367714177339.zip<hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_libs__3285017367714177339.zip>
21/09/13 13:49:25 INFO Client: Uploading resource file:/usr/local/kylin/spark/python/lib/pyspark.zip -> hdfs://<hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/pyspark.zip>xxxxx<hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/pyspark.zip>:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/pyspark.zip<hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/pyspark.zip>
21/09/13 13:49:25 INFO Client: Uploading resource file:/usr/local/kylin/spark/python/lib/py4j-0.10.9-src.zip -> hdfs://<hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/py4j-0.10.9-src.zip>xxxxx<hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/py4j-0.10.9-src.zip>:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/py4j-0.10.9-src.zip<hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/py4j-0.10.9-src.zip>
21/09/13 13:49:25 INFO Client: Uploading resource file:/mnt/tmp/spark-7256648b-ffe0-4455-8a80-d56f1a7fd707/__spark_conf__6717448128964414860.zip -> hdfs://<hdfs://xxxxx/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_conf__.zip>xxxxx<hdfs://xxxxx/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_conf__.zip>/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_conf__.zip<hdfs://xxxxx/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_conf__.zip>

However I can reproduce the same problem I encounter with Kylin by setting the spark.yarn.stagingDir configuration:

spark-submit --master yarn --conf spark.yarn.stagingDir=file:///home/hadoop --deploy-mode client /home/hadoop/foo.py

It will try to upload to a local destination “file:/home/hadoop/.sparkStaging/application_1631282030708_2945/…” and the application will fail.

I am able to set spark.yarn.stagingDir to an HDFS location in /usr/local/kylin/spark/conf/spark-defaults.conf and spark-submit succeeds.

However it seems Kylin ignores values set in spark.yarn.stagingDir?

If I am able to set spark.yarn.stagingDir correctly I think it would work.

Thank you for your assistance,

Gabe

De : Yaqian Zhang <Ya...@126.com>>
Date : dimanche, 12 septembre 2021 à 22:45
À : user@kylin.apache.org<ma...@kylin.apache.org> <us...@kylin.apache.org>>
Objet : Re: Kylin v4.0.0 GA on EMR 6.3.0 fail to start Sparder due to YARN staging files missing
Hi：
I noticed this in your kylin.log:

“Uploading resource file:/usr/local/kylin/tomcat/temp/spark-8ec4dae7-5f3c-477e-bda3-4c4f00978586/__spark_libs__7584573487901234438.zip -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip
2021-09-10 18:45:51,487 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/lib/kylin-parquet-job-4.0.0.jar -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/kylin-parquet-job-4.0.0.jar
2021-09-10 18:45:51,597 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/conf/spark-executor-log4j.properties -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/spark-executor-log4j.properties
2021-09-10 18:45:51,718 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/tomcat/temp/spark-8ec4dae7-5f3c-477e-bda3-4c4f00978586/__spark_conf__5546014978595262008.zip -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_conf__.zip”

This does not seem normal. According to the process of submitting spark application, it needs to upload these libs to HDFS or S3, but it is obvious that the path here shows that these libs have been uploaded to the local directory of the driver running node, so that other nodes cannot find the path.

I'm not sure what caused these libs not to be uploaded to the correct path, but you can check whether this configuration ‘HADOOP_CONF_DIR' exists in the front page of kylin, as shown in the following figure:
<image001.png>
If so, you can check whether 'fs.defaultFS' in core-site.xml under this path is configured to the correct directory.

By the way, the configuration 'kylin.query.spark-conf.spark.executor.extraJavaOptions' in kylin.properties does not need to be manually modified by the user, kylin will automatically configure those variables at runtime.



在 2021年9月11日，上午2:57，Michael, Gabe <Ga...@disneystreaming.com>> 写道：

Hello,

When running Kylin 4.0.0 on AWS EMR 6.3.0, I am able to successfully build a cube.

But when I try to query it, the Sparder application cannot start.

Kylin attempts to upload some files to a local directory, then the Spark job fails because it cannot read files from that directory.

2021-09-10 18:45:47,407 INFO  [Thread-9] yarn.Client:57 : Preparing resources for our AM container
2021-09-10 18:45:47,428 WARN  [Thread-9] yarn.Client:69 : Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
2021-09-10 18:45:50,861 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/tomcat/temp/spark-8ec4dae7-5f3c-477e-bda3-4c4f00978586/__spark_libs__7584573487901234438.zip -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip
2021-09-10 18:45:51,487 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/lib/kylin-parquet-job-4.0.0.jar -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/kylin-parquet-job-4.0.0.jar
2021-09-10 18:45:51,597 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/conf/spark-executor-log4j.properties -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/spark-executor-log4j.properties
2021-09-10 18:45:51,718 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/tomcat/temp/spark-8ec4dae7-5f3c-477e-bda3-4c4f00978586/__spark_conf__5546014978595262008.zip -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_conf__.zip
2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing view acls to: hadoop
2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing modify acls to: hadoop
2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing view acls groups to:
2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing modify acls groups to:
2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
2021-09-10 18:45:51,814 INFO  [Thread-9] yarn.Client:57 : Submitting application application_1631282030708_2863 to ResourceManager
2021-09-10 18:45:51,861 INFO  [Thread-9] impl.YarnClientImpl:329 : Submitted application application_1631282030708_2863
2021-09-10 18:45:52,863 INFO  [Thread-9] yarn.Client:57 : Application report for application_1631282030708_2863 (state: FAILED)
2021-09-10 18:45:52,866 INFO  [Thread-9] yarn.Client:57 :
       client token: N/A
       diagnostics: Application application_1631282030708_2863 failed 2 times due to AM Container for appattempt_1631282030708_2863_000002 exited with  exitCode: -1000
Failing this attempt.Diagnostics: [2021-09-10 18:45:52.033]File file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip does not exist
java.io.FileNotFoundException: File file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip does not exist
       at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:671)
       at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:992)
       at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:661)
       at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:464)
       at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:269)
       at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
       at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)
       at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)
       at java.security.AccessController.doPrivileged(Native Method)
       at javax.security.auth.Subject.doAs(Subject.java:422)
       at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
       at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)
       at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:243)
       at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:236)
       at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:224)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)

For more detailed output, check the application tracking page: http://ip-10-240-102-189.bamtech.test.us-east-1.bamgrid.net:8088/cluster/app/application_1631282030708_2863 Then click on links to logs of each attempt.
. Failing the application.
       ApplicationMaster host: N/A
       ApplicationMaster RPC port: -1
       queue: default
       start time: 1631299551829
       final status: FAILED
       tracking URL: http://ip-10-240-102-189.bamtech.test.us-east-1.bamgrid.net:8088/cluster/app/application_1631282030708_2863
       user: hadoop
2021-09-10 18:45:52,941 INFO  [Thread-9] yarn.Client:57 : Deleted staging directory file:/home/hadoop/.sparkStaging/application_1631282030708_2863
2021-09-10 18:45:52,942 ERROR [Thread-9] cluster.YarnClientSchedulerBackend:73 : The YARN application has already ended! It might have been killed or the Application Master may have failed to start. Check the YARN application logs for more details.
2021-09-10 18:45:52,943 ERROR [Thread-9] spark.SparkContext:94 : Error initializing SparkContext.

Here are my kylin.properties with irrelevant/sensitive values removed:

kylin.env.hdfs-working-dir=s3a://<s3a://XXXXX/qa/kylin/hdfs/>XXXXX<s3a://XXXXX/qa/kylin/hdfs/>/qa/kylin/hdfs/<s3a://XXXXX/qa/kylin/hdfs/>
kylin.env=QA
kylin.server.mode=all
kylin.server.cluster-servers=localhost:7070
kylin.engine.default=6
kylin.storage.default=4
kylin.server.external-acl-provider=
kylin.source.hive.database-for-flat-table=default
kylin.web.default-time-filter=1
kylin.storage.clean-after-delete-operation=false
kylin.job.retry=1
kylin.job.max-concurrent-jobs=1
kylin.job.sampling-percentage=100
kylin.job.scheduler.provider.100=org.apache.kylin.job.impl.curator.CuratorScheduler
kylin.job.scheduler.default=2
kylin.spark-conf.auto.prior=true
kylin.engine.spark-conf.spark.master=yarn
kylin.engine.spark-conf.spark.submit.deployMode=client
kylin.engine.spark-conf.spark.yarn.queue=default
kylin.engine.spark-conf.spark.eventLog.enabled=true
kylin.engine.spark-conf.spark.eventLog.dir=hdfs:///kylin/spark-history<hdfs://kylin/spark-history>
kylin.engine.spark-conf.spark.history.fs.logDirectory=hdfs:///kylin/spark-history<hdfs://kylin/spark-history>
kylin.engine.spark-conf.spark.hadoop.yarn.timeline-service.enabled=false
kylin.engine.spark-conf.spark.executor.extraJavaOptions=-Dfile.encoding=UTF-8 -Dhdp.version=current -Dlog4j.configuration=spark-executor-log4j.properties -Dlog4j.debug -Dkylin.hdfs.working.dir=${hdfs.working.dir} -Dkylin.metadata.identifier=kylin_metadata -Dkylin.spark.category=job -Dkylin.spark.project=${job.project} -Dkylin.spark.identifier=${job.id<https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Fjob.id%2F&data=04%7C01%7CGabe.Michael%40disneystreaming.com%7Cd1fbea80570f44f96d7408d977293df4%7C65f03ca86d0a493e9e4ac85ac9526a03%7C1%7C0%7C637671841890968845%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=t1ypBK86%2BI99DZphr3CIZnLiZvgrm1ZeFeBqPQkxs1E%3D&reserved=0>} -Dkylin.spark.jobName=${job.stepId} -Duser.timezone=${user.timezone}
kylin.engine.spark-conf.spark.driver.extraJavaOptions=-XX:+CrashOnOutOfMemoryError
kylin.query.auto-sparder-context-enabled-enabled=false
kylin.query.spark-conf.spark.master=yarn
kylin.query.spark-conf.spark.driver.cores=1
kylin.query.spark-conf.spark.driver.memory=4G
kylin.query.spark-conf.spark.driver.memoryOverhead=1G
kylin.query.spark-conf.spark.executor.cores=1
kylin.query.spark-conf.spark.executor.instances=1
kylin.query.spark-conf.spark.executor.memory=4G
kylin.query.spark-conf.spark.executor.memoryOverhead=1G
kylin.query.spark-conf.spark.serializer=org.apache.spark.serializer.JavaSerializer
kylin.query.spark-conf.spark.hadoop.yarn.timeline-service.enabled=false
kylin.query.spark-conf.spark.executor.extraJavaOptions=-Dhdp.version=current -Dlog4j.configuration=spark-executor-log4j.properties -Dlog4j.debug -Dkylin.hdfs.working.dir=s3a://dataeng-data-test/qa/kylin/hdfs/ -Dkylin.metadata.identifier=kylin_metadata -Dkylin.spark.category=sparder -Dkylin.spark.identifier={{APP_ID}}
kylin.source.hive.redistribute-flat-table=false
kylin.metadata.jdbc.dialect=mysql
kylin.metadata.jdbc.json-always-small-cell=true
kylin.job.lock=org.apache.kylin.storage.hbase.util.ZookeeperDistributedJobLock
kylin.web.set-config-enable=true
kylin.job.allow-empty-segment=false
kylin.env.hadoop-conf-dir=/etc/hadoop/conf
kylin.query.lazy-query-enabled=true
kylin.query.cache-signature-enabled=true
kylin.query.segment-cache-enabled=false
kylin.engine.spark-fact-distinct=true
kylin.engine.spark-dimension-dictionary=false
kylin.engine.spark-uhc-dictionary=true
kylin.engine.spark.rdd-partition-cut-mb=10
kylin.engine.spark.min-partition=1
kylin.engine.spark.max-partition=5000
kylin.engine.spark-conf.spark.dynamicAllocation.enabled=true
kylin.engine.spark-conf.spark.dynamicAllocation.minExecutors=1
kylin.engine.spark-conf.spark.dynamicAllocation.maxExecutors=1000
kylin.engine.spark-conf.spark.dynamicAllocation.executorIdleTimeout=300
kylin.engine.spark-conf.spark.driver.memory=2G
kylin.engine.spark-conf.spark.executor.memory=4G
kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024
kylin.engine.spark-conf.spark.executor.cores=1
kylin.engine.spark-conf.spark.network.timeout=600
kylin.engine.spark-conf.spark.shuffle.service.enabled=true
kylin.engine.spark-conf.spark.hadoop.dfs.replication=2
kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress=true
kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec
kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec
kylin.engine.spark-conf-mergedict.spark.executor.memory=6G
kylin.engine.spark-conf-mergedict.spark.memory.fraction=0.2
kylin.engine.spark-conf.spark.sql.hive.metastore.version=3.1.2
kylin.engine.spark-conf.spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/lib/*
kylin.query.spark-conf.spark.sql.hive.metastore.version=3.1.2
kylin.query.spark-conf.spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/lib/*
kylin.server.cluster-name=kylin_metadata
kylin.log.spark-executor-properties-file=/usr/local/kylin/conf/spark-executor-log4j.properties
kylin.metadata.url.identifier=kylin_metadata

Thank you for your assistance,

Gabe

--
Gabe Michael
Principal Data Engineer
Disney Streaming Services

Re: Kylin v4.0.0 GA on EMR 6.3.0 fail to start Sparder due to YARN staging files missing

Posted by Yaqian Zhang <Ya...@126.com>.

Hi Gabe:

You can try to configure 'kylin.query.spark-conf.spark.yarn.stagingDir' in kylin.properties  to make this configuration take effect in kylin.

> 在 2021年9月13日，下午9:56，Michael, Gabe <Ga...@disneystreaming.com> 写道：
> 
> Thank you for your reply.
>  
> HADOOP_CONF_DIR is set correctly to /usr/local/kylin/hadoop_conf
> fs.defaultFS in /usr/local/kylin/hadoop_conf/core-site.xml is set to hdfs:// <hdfs://xxxxx:8020>xxxxx <hdfs://xxxxx:8020>:8020 <hdfs://xxxxx:8020> (domain name omitted)
>  
> I also tested submitting a simple Spark app from the command line with spark-submit, and it succeeds.
> According to the log messages it is uploading the files to HDFS when I submit directly from spark-submit:
>  
> 21/09/13 13:49:19 INFO Client: Preparing resources for our AM container
> 21/09/13 13:49:19 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
> 21/09/13 13:49:23 INFO Client: Uploading resource file:/mnt/tmp/spark-7256648b-ffe0-4455-8a80-d56f1a7fd707/__spark_libs__3285017367714177339.zip -> hdfs:// <hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_libs__3285017367714177339.zip>xxxxx <hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_libs__3285017367714177339.zip>:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_libs__3285017367714177339.zip <hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_libs__3285017367714177339.zip>
> 21/09/13 13:49:25 INFO Client: Uploading resource file:/usr/local/kylin/spark/python/lib/pyspark.zip -> hdfs:// <hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/pyspark.zip>xxxxx <hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/pyspark.zip>:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/pyspark.zip <hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/pyspark.zip>
> 21/09/13 13:49:25 INFO Client: Uploading resource file:/usr/local/kylin/spark/python/lib/py4j-0.10.9-src.zip -> hdfs:// <hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/py4j-0.10.9-src.zip>xxxxx <hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/py4j-0.10.9-src.zip>:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/py4j-0.10.9-src.zip <hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/py4j-0.10.9-src.zip>
> 21/09/13 13:49:25 INFO Client: Uploading resource file:/mnt/tmp/spark-7256648b-ffe0-4455-8a80-d56f1a7fd707/__spark_conf__6717448128964414860.zip -> hdfs:// <hdfs://xxxxx/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_conf__.zip>xxxxx <hdfs://xxxxx/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_conf__.zip>/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_conf__.zip <hdfs://xxxxx/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_conf__.zip>
>  
> However I can reproduce the same problem I encounter with Kylin by setting the spark.yarn.stagingDir configuration:
>  
> spark-submit --master yarn --conf spark.yarn.stagingDir=file:///home/hadoop <file:///home/hadoop> --deploy-mode client /home/hadoop/foo.py 
>  
> It will try to upload to a local destination “file:/home/hadoop/.sparkStaging/application_1631282030708_2945/…” and the application will fail.
>  
> I am able to set spark.yarn.stagingDir to an HDFS location in /usr/local/kylin/spark/conf/spark-defaults.conf and spark-submit succeeds.
> 
> However it seems Kylin ignores values set in spark.yarn.stagingDir?
>  
> If I am able to set spark.yarn.stagingDir correctly I think it would work.
>  
> Thank you for your assistance,
>  
> Gabe
>  
> De : Yaqian Zhang <Yaqian_Zhang@126.com <ma...@126.com>>
> Date : dimanche, 12 septembre 2021 à 22:45
> À : user@kylin.apache.org <ma...@kylin.apache.org> <user@kylin.apache.org <ma...@kylin.apache.org>>
> Objet : Re: Kylin v4.0.0 GA on EMR 6.3.0 fail to start Sparder due to YARN staging files missing
> 
> Hi：
> I noticed this in your kylin.log:
>  
> “Uploading resource file:/usr/local/kylin/tomcat/temp/spark-8ec4dae7-5f3c-477e-bda3-4c4f00978586/__spark_libs__7584573487901234438.zip -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip
> 2021-09-10 18:45:51,487 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/lib/kylin-parquet-job-4.0.0.jar -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/kylin-parquet-job-4.0.0.jar
> 2021-09-10 18:45:51,597 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/conf/spark-executor-log4j.properties -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/spark-executor-log4j.properties
> 2021-09-10 18:45:51,718 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/tomcat/temp/spark-8ec4dae7-5f3c-477e-bda3-4c4f00978586/__spark_conf__5546014978595262008.zip -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_conf__.zip”
>  
> This does not seem normal. According to the process of submitting spark application, it needs to upload these libs to HDFS or S3, but it is obvious that the path here shows that these libs have been uploaded to the local directory of the driver running node, so that other nodes cannot find the path.
>  
> I'm not sure what caused these libs not to be uploaded to the correct path, but you can check whether this configuration ‘HADOOP_CONF_DIR' exists in the front page of kylin, as shown in the following figure:
> <image001.png>
> If so, you can check whether 'fs.defaultFS' in core-site.xml under this path is configured to the correct directory.
>  
> By the way, the configuration 'kylin.query.spark-conf.spark.executor.extraJavaOptions' in kylin.properties does not need to be manually modified by the user, kylin will automatically configure those variables at runtime.
> 
> 
> 在 2021年9月11日，上午2:57，Michael, Gabe <Gabe.Michael@disneystreaming.com <ma...@disneystreaming.com>> 写道：
>  
> Hello,
>  
> When running Kylin 4.0.0 on AWS EMR 6.3.0, I am able to successfully build a cube.
>  
> But when I try to query it, the Sparder application cannot start.
>  
> Kylin attempts to upload some files to a local directory, then the Spark job fails because it cannot read files from that directory.
>  
> 2021-09-10 18:45:47,407 INFO  [Thread-9] yarn.Client:57 : Preparing resources for our AM container
> 2021-09-10 18:45:47,428 WARN  [Thread-9] yarn.Client:69 : Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
> 2021-09-10 18:45:50,861 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/tomcat/temp/spark-8ec4dae7-5f3c-477e-bda3-4c4f00978586/__spark_libs__7584573487901234438.zip -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip
> 2021-09-10 18:45:51,487 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/lib/kylin-parquet-job-4.0.0.jar -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/kylin-parquet-job-4.0.0.jar
> 2021-09-10 18:45:51,597 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/conf/spark-executor-log4j.properties -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/spark-executor-log4j.properties
> 2021-09-10 18:45:51,718 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/tomcat/temp/spark-8ec4dae7-5f3c-477e-bda3-4c4f00978586/__spark_conf__5546014978595262008.zip -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_conf__.zip
> 2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing view acls to: hadoop
> 2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing modify acls to: hadoop
> 2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing view acls groups to: 
> 2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing modify acls groups to: 
> 2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
> 2021-09-10 18:45:51,814 INFO  [Thread-9] yarn.Client:57 : Submitting application application_1631282030708_2863 to ResourceManager
> 2021-09-10 18:45:51,861 INFO  [Thread-9] impl.YarnClientImpl:329 : Submitted application application_1631282030708_2863
> 2021-09-10 18:45:52,863 INFO  [Thread-9] yarn.Client:57 : Application report for application_1631282030708_2863 (state: FAILED)
> 2021-09-10 18:45:52,866 INFO  [Thread-9] yarn.Client:57 : 
>        client token: N/A
>        diagnostics: Application application_1631282030708_2863 failed 2 times due to AM Container for appattempt_1631282030708_2863_000002 exited with  exitCode: -1000
> Failing this attempt.Diagnostics: [2021-09-10 18:45:52.033]File file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip does not exist
> java.io.FileNotFoundException: File file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip does not exist
>        at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:671)
>        at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:992)
>        at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:661)
>        at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:464)
>        at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:269)
>        at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
>        at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)
>        at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:422)
>        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>        at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)
>        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:243)
>        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:236)
>        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:224)
>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>        at java.lang.Thread.run(Thread.java:748)
>  
> For more detailed output, check the application tracking page: http://ip-10-240-102-189.bamtech.test.us-east-1.bamgrid.net:8088/cluster/app/application_1631282030708_2863 <http://ip-10-240-102-189.bamtech.test.us-east-1.bamgrid.net:8088/cluster/app/application_1631282030708_2863> Then click on links to logs of each attempt.
> . Failing the application.
>        ApplicationMaster host: N/A
>        ApplicationMaster RPC port: -1
>        queue: default
>        start time: 1631299551829
>        final status: FAILED
>        tracking URL: http://ip-10-240-102-189.bamtech.test.us-east-1.bamgrid.net:8088/cluster/app/application_1631282030708_2863 <http://ip-10-240-102-189.bamtech.test.us-east-1.bamgrid.net:8088/cluster/app/application_1631282030708_2863>
>        user: hadoop
> 2021-09-10 18:45:52,941 INFO  [Thread-9] yarn.Client:57 : Deleted staging directory file:/home/hadoop/.sparkStaging/application_1631282030708_2863
> 2021-09-10 18:45:52,942 ERROR [Thread-9] cluster.YarnClientSchedulerBackend:73 : The YARN application has already ended! It might have been killed or the Application Master may have failed to start. Check the YARN application logs for more details.
> 2021-09-10 18:45:52,943 ERROR [Thread-9] spark.SparkContext:94 : Error initializing SparkContext.
>  
> Here are my kylin.properties with irrelevant/sensitive values removed:
>  
> kylin.env.hdfs-working-dir=s3a:// <s3a://XXXXX/qa/kylin/hdfs/>XXXXX <s3a://XXXXX/qa/kylin/hdfs/>/qa/kylin/hdfs/ <s3a://XXXXX/qa/kylin/hdfs/>
> kylin.env=QA
> kylin.server.mode=all
> kylin.server.cluster-servers=localhost:7070
> kylin.engine.default=6
> kylin.storage.default=4
> kylin.server.external-acl-provider=
> kylin.source.hive.database-for-flat-table=default
> kylin.web.default-time-filter=1
> kylin.storage.clean-after-delete-operation=false
> kylin.job.retry=1
> kylin.job.max-concurrent-jobs=1
> kylin.job.sampling-percentage=100
> kylin.job.scheduler.provider.100=org.apache.kylin.job.impl.curator.CuratorScheduler
> kylin.job.scheduler.default=2
> kylin.spark-conf.auto.prior=true
> kylin.engine.spark-conf.spark.master=yarn
> kylin.engine.spark-conf.spark.submit.deployMode=client
> kylin.engine.spark-conf.spark.yarn.queue=default
> kylin.engine.spark-conf.spark.eventLog.enabled=true
> kylin.engine.spark-conf.spark.eventLog.dir=hdfs:///kylin/spark-history <hdfs://kylin/spark-history>
> kylin.engine.spark-conf.spark.history.fs.logDirectory=hdfs:///kylin/spark-history <hdfs://kylin/spark-history>
> kylin.engine.spark-conf.spark.hadoop.yarn.timeline-service.enabled=false
> kylin.engine.spark-conf.spark.executor.extraJavaOptions=-Dfile.encoding=UTF-8 -Dhdp.version=current -Dlog4j.configuration=spark-executor-log4j.properties -Dlog4j.debug -Dkylin.hdfs.working.dir=${hdfs.working.dir} -Dkylin.metadata.identifier=kylin_metadata -Dkylin.spark.category=job -Dkylin.spark.project=${job.project} -Dkylin.spark.identifier=${job.id <https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Fjob.id%2F&data=04%7C01%7CGabe.Michael%40disneystreaming.com%7C77de7c2283aa466d15fc08d9766074af%7C65f03ca86d0a493e9e4ac85ac9526a03%7C1%7C0%7C637670979514787428%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=GHWXzC1cClz06wfF2nFYMaGKuLlVo69yFsmFi%2BVwIi0%3D&reserved=0>} -Dkylin.spark.jobName=${job.stepId} -Duser.timezone=${user.timezone}
> kylin.engine.spark-conf.spark.driver.extraJavaOptions=-XX:+CrashOnOutOfMemoryError
> kylin.query.auto-sparder-context-enabled-enabled=false
> kylin.query.spark-conf.spark.master=yarn
> kylin.query.spark-conf.spark.driver.cores=1
> kylin.query.spark-conf.spark.driver.memory=4G
> kylin.query.spark-conf.spark.driver.memoryOverhead=1G
> kylin.query.spark-conf.spark.executor.cores=1
> kylin.query.spark-conf.spark.executor.instances=1
> kylin.query.spark-conf.spark.executor.memory=4G
> kylin.query.spark-conf.spark.executor.memoryOverhead=1G
> kylin.query.spark-conf.spark.serializer=org.apache.spark.serializer.JavaSerializer
> kylin.query.spark-conf.spark.hadoop.yarn.timeline-service.enabled=false
> kylin.query.spark-conf.spark.executor.extraJavaOptions=-Dhdp.version=current -Dlog4j.configuration=spark-executor-log4j.properties -Dlog4j.debug -Dkylin.hdfs.working.dir=s3a://dataeng-data-test/qa/kylin/hdfs/ <s3a://dataeng-data-test/qa/kylin/hdfs/> -Dkylin.metadata.identifier=kylin_metadata -Dkylin.spark.category=sparder -Dkylin.spark.identifier={{APP_ID}}
> kylin.source.hive.redistribute-flat-table=false
> kylin.metadata.jdbc.dialect=mysql
> kylin.metadata.jdbc.json-always-small-cell=true
> kylin.job.lock=org.apache.kylin.storage.hbase.util.ZookeeperDistributedJobLock
> kylin.web.set-config-enable=true
> kylin.job.allow-empty-segment=false
> kylin.env.hadoop-conf-dir=/etc/hadoop/conf
> kylin.query.lazy-query-enabled=true
> kylin.query.cache-signature-enabled=true
> kylin.query.segment-cache-enabled=false
> kylin.engine.spark-fact-distinct=true
> kylin.engine.spark-dimension-dictionary=false
> kylin.engine.spark-uhc-dictionary=true
> kylin.engine.spark.rdd-partition-cut-mb=10
> kylin.engine.spark.min-partition=1
> kylin.engine.spark.max-partition=5000
> kylin.engine.spark-conf.spark.dynamicAllocation.enabled=true
> kylin.engine.spark-conf.spark.dynamicAllocation.minExecutors=1
> kylin.engine.spark-conf.spark.dynamicAllocation.maxExecutors=1000
> kylin.engine.spark-conf.spark.dynamicAllocation.executorIdleTimeout=300
> kylin.engine.spark-conf.spark.driver.memory=2G
> kylin.engine.spark-conf.spark.executor.memory=4G
> kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024
> kylin.engine.spark-conf.spark.executor.cores=1
> kylin.engine.spark-conf.spark.network.timeout=600
> kylin.engine.spark-conf.spark.shuffle.service.enabled=true
> kylin.engine.spark-conf.spark.hadoop.dfs.replication=2
> kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress=true
> kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec
> kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec
> kylin.engine.spark-conf-mergedict.spark.executor.memory=6G
> kylin.engine.spark-conf-mergedict.spark.memory.fraction=0.2
> kylin.engine.spark-conf.spark.sql.hive.metastore.version=3.1.2
> kylin.engine.spark-conf.spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/lib/*
> kylin.query.spark-conf.spark.sql.hive.metastore.version=3.1.2
> kylin.query.spark-conf.spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/lib/*
> kylin.server.cluster-name=kylin_metadata
> kylin.log.spark-executor-properties-file=/usr/local/kylin/conf/spark-executor-log4j.properties
> kylin.metadata.url.identifier=kylin_metadata
>  
> Thank you for your assistance,
>  
> Gabe
>  
> -- 
> Gabe Michael
> Principal Data Engineer
> Disney Streaming Services
>

Re: Kylin v4.0.0 GA on EMR 6.3.0 fail to start Sparder due to YARN staging files missing

Posted by "Michael, Gabe" <Ga...@disneystreaming.com>.

Thank you for your reply.

HADOOP_CONF_DIR is set correctly to /usr/local/kylin/hadoop_conf
fs.defaultFS in /usr/local/kylin/hadoop_conf/core-site.xml is set to hdfs://xxxxx:8020 (domain name omitted)

I also tested submitting a simple Spark app from the command line with spark-submit, and it succeeds.
According to the log messages it is uploading the files to HDFS when I submit directly from spark-submit:


21/09/13 13:49:19 INFO Client: Preparing resources for our AM container

21/09/13 13:49:19 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.

21/09/13 13:49:23 INFO Client: Uploading resource file:/mnt/tmp/spark-7256648b-ffe0-4455-8a80-d56f1a7fd707/__spark_libs__3285017367714177339.zip -> hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_libs__3285017367714177339.zip

21/09/13 13:49:25 INFO Client: Uploading resource file:/usr/local/kylin/spark/python/lib/pyspark.zip -> hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/pyspark.zip

21/09/13 13:49:25 INFO Client: Uploading resource file:/usr/local/kylin/spark/python/lib/py4j-0.10.9-src.zip -> hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/py4j-0.10.9-src.zip

21/09/13 13:49:25 INFO Client: Uploading resource file:/mnt/tmp/spark-7256648b-ffe0-4455-8a80-d56f1a7fd707/__spark_conf__6717448128964414860.zip -> hdfs://xxxxx/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_conf__.zip

However I can reproduce the same problem I encounter with Kylin by setting the spark.yarn.stagingDir configuration:


spark-submit --master yarn --conf spark.yarn.stagingDir=file:///home/hadoop --deploy-mode client /home/hadoop/foo.py

It will try to upload to a local destination “file:/home/hadoop/.sparkStaging/application_1631282030708_2945/…” and the application will fail.

I am able to set spark.yarn.stagingDir to an HDFS location in /usr/local/kylin/spark/conf/spark-defaults.conf and spark-submit succeeds.

However it seems Kylin ignores values set in spark.yarn.stagingDir?

If I am able to set spark.yarn.stagingDir correctly I think it would work.

Thank you for your assistance,

Gabe

De : Yaqian Zhang <Ya...@126.com>
Date : dimanche, 12 septembre 2021 à 22:45
À : user@kylin.apache.org <us...@kylin.apache.org>
Objet : Re: Kylin v4.0.0 GA on EMR 6.3.0 fail to start Sparder due to YARN staging files missing
Hi：
I noticed this in your kylin.log:

“Uploading resource file:/usr/local/kylin/tomcat/temp/spark-8ec4dae7-5f3c-477e-bda3-4c4f00978586/__spark_libs__7584573487901234438.zip -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip
2021-09-10 18:45:51,487 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/lib/kylin-parquet-job-4.0.0.jar -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/kylin-parquet-job-4.0.0.jar
2021-09-10 18:45:51,597 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/conf/spark-executor-log4j.properties -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/spark-executor-log4j.properties
2021-09-10 18:45:51,718 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/tomcat/temp/spark-8ec4dae7-5f3c-477e-bda3-4c4f00978586/__spark_conf__5546014978595262008.zip -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_conf__.zip”

This does not seem normal. According to the process of submitting spark application, it needs to upload these libs to HDFS or S3, but it is obvious that the path here shows that these libs have been uploaded to the local directory of the driver running node, so that other nodes cannot find the path.

I'm not sure what caused these libs not to be uploaded to the correct path, but you can check whether this configuration ‘HADOOP_CONF_DIR' exists in the front page of kylin, as shown in the following figure:
[cid:image001.png@01D7A885.9315DFC0]
If so, you can check whether 'fs.defaultFS' in core-site.xml under this path is configured to the correct directory.

By the way, the configuration 'kylin.query.spark-conf.spark.executor.extraJavaOptions' in kylin.properties does not need to be manually modified by the user, kylin will automatically configure those variables at runtime.


在 2021年9月11日，上午2:57，Michael, Gabe <Ga...@disneystreaming.com>> 写道：

Hello,

When running Kylin 4.0.0 on AWS EMR 6.3.0, I am able to successfully build a cube.

But when I try to query it, the Sparder application cannot start.

Kylin attempts to upload some files to a local directory, then the Spark job fails because it cannot read files from that directory.

2021-09-10 18:45:47,407 INFO  [Thread-9] yarn.Client:57 : Preparing resources for our AM container
2021-09-10 18:45:47,428 WARN  [Thread-9] yarn.Client:69 : Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
2021-09-10 18:45:50,861 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/tomcat/temp/spark-8ec4dae7-5f3c-477e-bda3-4c4f00978586/__spark_libs__7584573487901234438.zip -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip
2021-09-10 18:45:51,487 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/lib/kylin-parquet-job-4.0.0.jar -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/kylin-parquet-job-4.0.0.jar
2021-09-10 18:45:51,597 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/conf/spark-executor-log4j.properties -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/spark-executor-log4j.properties
2021-09-10 18:45:51,718 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/tomcat/temp/spark-8ec4dae7-5f3c-477e-bda3-4c4f00978586/__spark_conf__5546014978595262008.zip -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_conf__.zip
2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing view acls to: hadoop
2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing modify acls to: hadoop
2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing view acls groups to:
2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing modify acls groups to:
2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
2021-09-10 18:45:51,814 INFO  [Thread-9] yarn.Client:57 : Submitting application application_1631282030708_2863 to ResourceManager
2021-09-10 18:45:51,861 INFO  [Thread-9] impl.YarnClientImpl:329 : Submitted application application_1631282030708_2863
2021-09-10 18:45:52,863 INFO  [Thread-9] yarn.Client:57 : Application report for application_1631282030708_2863 (state: FAILED)
2021-09-10 18:45:52,866 INFO  [Thread-9] yarn.Client:57 :
       client token: N/A
       diagnostics: Application application_1631282030708_2863 failed 2 times due to AM Container for appattempt_1631282030708_2863_000002 exited with  exitCode: -1000
Failing this attempt.Diagnostics: [2021-09-10 18:45:52.033]File file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip does not exist
java.io.FileNotFoundException: File file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip does not exist
       at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:671)
       at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:992)
       at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:661)
       at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:464)
       at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:269)
       at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
       at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)
       at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)
       at java.security.AccessController.doPrivileged(Native Method)
       at javax.security.auth.Subject.doAs(Subject.java:422)
       at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
       at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)
       at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:243)
       at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:236)
       at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:224)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)

For more detailed output, check the application tracking page: http://ip-10-240-102-189.bamtech.test.us-east-1.bamgrid.net:8088/cluster/app/application_1631282030708_2863 Then click on links to logs of each attempt.
. Failing the application.
       ApplicationMaster host: N/A
       ApplicationMaster RPC port: -1
       queue: default
       start time: 1631299551829
       final status: FAILED
       tracking URL: http://ip-10-240-102-189.bamtech.test.us-east-1.bamgrid.net:8088/cluster/app/application_1631282030708_2863
       user: hadoop
2021-09-10 18:45:52,941 INFO  [Thread-9] yarn.Client:57 : Deleted staging directory file:/home/hadoop/.sparkStaging/application_1631282030708_2863
2021-09-10 18:45:52,942 ERROR [Thread-9] cluster.YarnClientSchedulerBackend:73 : The YARN application has already ended! It might have been killed or the Application Master may have failed to start. Check the YARN application logs for more details.
2021-09-10 18:45:52,943 ERROR [Thread-9] spark.SparkContext:94 : Error initializing SparkContext.

Here are my kylin.properties with irrelevant/sensitive values removed:

kylin.env.hdfs-working-dir=s3a://<s3a://XXXXX/qa/kylin/hdfs/>XXXXX<s3a://XXXXX/qa/kylin/hdfs/>/qa/kylin/hdfs/<s3a://XXXXX/qa/kylin/hdfs/>
kylin.env=QA
kylin.server.mode=all
kylin.server.cluster-servers=localhost:7070
kylin.engine.default=6
kylin.storage.default=4
kylin.server.external-acl-provider=
kylin.source.hive.database-for-flat-table=default
kylin.web.default-time-filter=1
kylin.storage.clean-after-delete-operation=false
kylin.job.retry=1
kylin.job.max-concurrent-jobs=1
kylin.job.sampling-percentage=100
kylin.job.scheduler.provider.100=org.apache.kylin.job.impl.curator.CuratorScheduler
kylin.job.scheduler.default=2
kylin.spark-conf.auto.prior=true
kylin.engine.spark-conf.spark.master=yarn
kylin.engine.spark-conf.spark.submit.deployMode=client
kylin.engine.spark-conf.spark.yarn.queue=default
kylin.engine.spark-conf.spark.eventLog.enabled=true
kylin.engine.spark-conf.spark.eventLog.dir=hdfs:///kylin/spark-history<hdfs://kylin/spark-history>
kylin.engine.spark-conf.spark.history.fs.logDirectory=hdfs:///kylin/spark-history<hdfs://kylin/spark-history>
kylin.engine.spark-conf.spark.hadoop.yarn.timeline-service.enabled=false
kylin.engine.spark-conf.spark.executor.extraJavaOptions=-Dfile.encoding=UTF-8 -Dhdp.version=current -Dlog4j.configuration=spark-executor-log4j.properties -Dlog4j.debug -Dkylin.hdfs.working.dir=${hdfs.working.dir} -Dkylin.metadata.identifier=kylin_metadata -Dkylin.spark.category=job -Dkylin.spark.project=${job.project} -Dkylin.spark.identifier=${job.id<https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Fjob.id%2F&data=04%7C01%7CGabe.Michael%40disneystreaming.com%7C77de7c2283aa466d15fc08d9766074af%7C65f03ca86d0a493e9e4ac85ac9526a03%7C1%7C0%7C637670979514787428%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=GHWXzC1cClz06wfF2nFYMaGKuLlVo69yFsmFi%2BVwIi0%3D&reserved=0>} -Dkylin.spark.jobName=${job.stepId} -Duser.timezone=${user.timezone}
kylin.engine.spark-conf.spark.driver.extraJavaOptions=-XX:+CrashOnOutOfMemoryError
kylin.query.auto-sparder-context-enabled-enabled=false
kylin.query.spark-conf.spark.master=yarn
kylin.query.spark-conf.spark.driver.cores=1
kylin.query.spark-conf.spark.driver.memory=4G
kylin.query.spark-conf.spark.driver.memoryOverhead=1G
kylin.query.spark-conf.spark.executor.cores=1
kylin.query.spark-conf.spark.executor.instances=1
kylin.query.spark-conf.spark.executor.memory=4G
kylin.query.spark-conf.spark.executor.memoryOverhead=1G
kylin.query.spark-conf.spark.serializer=org.apache.spark.serializer.JavaSerializer
kylin.query.spark-conf.spark.hadoop.yarn.timeline-service.enabled=false
kylin.query.spark-conf.spark.executor.extraJavaOptions=-Dhdp.version=current -Dlog4j.configuration=spark-executor-log4j.properties -Dlog4j.debug -Dkylin.hdfs.working.dir=s3a://dataeng-data-test/qa/kylin/hdfs/ -Dkylin.metadata.identifier=kylin_metadata -Dkylin.spark.category=sparder -Dkylin.spark.identifier={{APP_ID}}
kylin.source.hive.redistribute-flat-table=false
kylin.metadata.jdbc.dialect=mysql
kylin.metadata.jdbc.json-always-small-cell=true
kylin.job.lock=org.apache.kylin.storage.hbase.util.ZookeeperDistributedJobLock
kylin.web.set-config-enable=true
kylin.job.allow-empty-segment=false
kylin.env.hadoop-conf-dir=/etc/hadoop/conf
kylin.query.lazy-query-enabled=true
kylin.query.cache-signature-enabled=true
kylin.query.segment-cache-enabled=false
kylin.engine.spark-fact-distinct=true
kylin.engine.spark-dimension-dictionary=false
kylin.engine.spark-uhc-dictionary=true
kylin.engine.spark.rdd-partition-cut-mb=10
kylin.engine.spark.min-partition=1
kylin.engine.spark.max-partition=5000
kylin.engine.spark-conf.spark.dynamicAllocation.enabled=true
kylin.engine.spark-conf.spark.dynamicAllocation.minExecutors=1
kylin.engine.spark-conf.spark.dynamicAllocation.maxExecutors=1000
kylin.engine.spark-conf.spark.dynamicAllocation.executorIdleTimeout=300
kylin.engine.spark-conf.spark.driver.memory=2G
kylin.engine.spark-conf.spark.executor.memory=4G
kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024
kylin.engine.spark-conf.spark.executor.cores=1
kylin.engine.spark-conf.spark.network.timeout=600
kylin.engine.spark-conf.spark.shuffle.service.enabled=true
kylin.engine.spark-conf.spark.hadoop.dfs.replication=2
kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress=true
kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec
kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec
kylin.engine.spark-conf-mergedict.spark.executor.memory=6G
kylin.engine.spark-conf-mergedict.spark.memory.fraction=0.2
kylin.engine.spark-conf.spark.sql.hive.metastore.version=3.1.2
kylin.engine.spark-conf.spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/lib/*
kylin.query.spark-conf.spark.sql.hive.metastore.version=3.1.2
kylin.query.spark-conf.spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/lib/*
kylin.server.cluster-name=kylin_metadata
kylin.log.spark-executor-properties-file=/usr/local/kylin/conf/spark-executor-log4j.properties
kylin.metadata.url.identifier=kylin_metadata

Thank you for your assistance,

Gabe

--
Gabe Michael
Principal Data Engineer
Disney Streaming Services

Re: Kylin v4.0.0 GA on EMR 6.3.0 fail to start Sparder due to YARN staging files missing

Posted by Yaqian Zhang <Ya...@126.com>.

Hi：
I noticed this in your kylin.log:

“Uploading resource file:/usr/local/kylin/tomcat/temp/spark-8ec4dae7-5f3c-477e-bda3-4c4f00978586/__spark_libs__7584573487901234438.zip -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip
> 2021-09-10 18:45:51,487 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/lib/kylin-parquet-job-4.0.0.jar -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/kylin-parquet-job-4.0.0.jar
> 2021-09-10 18:45:51,597 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/conf/spark-executor-log4j.properties -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/spark-executor-log4j.properties
2021-09-10 18:45:51,718 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/tomcat/temp/spark-8ec4dae7-5f3c-477e-bda3-4c4f00978586/__spark_conf__5546014978595262008.zip -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_conf__.zip”

This does not seem normal. According to the process of submitting spark application, it needs to upload these libs to HDFS or S3, but it is obvious that the path here shows that these libs have been uploaded to the local directory of the driver running node, so that other nodes cannot find the path.
 
I'm not sure what caused these libs not to be uploaded to the correct path, but you can check whether this configuration ‘HADOOP_CONF_DIR' exists in the front page of kylin, as shown in the following figure:

If so, you can check whether 'fs.defaultFS' in core-site.xml under this path is configured to the correct directory.

By the way, the configuration 'kylin.query.spark-conf.spark.executor.extraJavaOptions' in kylin.properties does not need to be manually modified by the user, kylin will automatically configure those variables at runtime.

> 在 2021年9月11日，上午2:57，Michael, Gabe <Ga...@disneystreaming.com> 写道：
> 
> Hello,
>  
> When running Kylin 4.0.0 on AWS EMR 6.3.0, I am able to successfully build a cube.
>  
> But when I try to query it, the Sparder application cannot start.
>  
> Kylin attempts to upload some files to a local directory, then the Spark job fails because it cannot read files from that directory.
>  
> 2021-09-10 18:45:47,407 INFO  [Thread-9] yarn.Client:57 : Preparing resources for our AM container
> 2021-09-10 18:45:47,428 WARN  [Thread-9] yarn.Client:69 : Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
> 2021-09-10 18:45:50,861 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/tomcat/temp/spark-8ec4dae7-5f3c-477e-bda3-4c4f00978586/__spark_libs__7584573487901234438.zip -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip
> 2021-09-10 18:45:51,487 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/lib/kylin-parquet-job-4.0.0.jar -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/kylin-parquet-job-4.0.0.jar
> 2021-09-10 18:45:51,597 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/conf/spark-executor-log4j.properties -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/spark-executor-log4j.properties
> 2021-09-10 18:45:51,718 INFO  [Thread-9] yarn.Client:57 : Uploading resource file:/usr/local/kylin/tomcat/temp/spark-8ec4dae7-5f3c-477e-bda3-4c4f00978586/__spark_conf__5546014978595262008.zip -> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_conf__.zip
> 2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing view acls to: hadoop
> 2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing modify acls to: hadoop
> 2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing view acls groups to: 
> 2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing modify acls groups to: 
> 2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
> 2021-09-10 18:45:51,814 INFO  [Thread-9] yarn.Client:57 : Submitting application application_1631282030708_2863 to ResourceManager
> 2021-09-10 18:45:51,861 INFO  [Thread-9] impl.YarnClientImpl:329 : Submitted application application_1631282030708_2863
> 2021-09-10 18:45:52,863 INFO  [Thread-9] yarn.Client:57 : Application report for application_1631282030708_2863 (state: FAILED)
> 2021-09-10 18:45:52,866 INFO  [Thread-9] yarn.Client:57 : 
>        client token: N/A
>        diagnostics: Application application_1631282030708_2863 failed 2 times due to AM Container for appattempt_1631282030708_2863_000002 exited with  exitCode: -1000
> Failing this attempt.Diagnostics: [2021-09-10 18:45:52.033]File file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip does not exist
> java.io.FileNotFoundException: File file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip does not exist
>        at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:671)
>        at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:992)
>        at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:661)
>        at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:464)
>        at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:269)
>        at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
>        at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)
>        at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:422)
>        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>        at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)
>        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:243)
>        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:236)
>        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:224)
>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>        at java.lang.Thread.run(Thread.java:748)
>  
> For more detailed output, check the application tracking page: http://ip-10-240-102-189.bamtech.test.us-east-1.bamgrid.net:8088/cluster/app/application_1631282030708_2863 <http://ip-10-240-102-189.bamtech.test.us-east-1.bamgrid.net:8088/cluster/app/application_1631282030708_2863> Then click on links to logs of each attempt.
> . Failing the application.
>        ApplicationMaster host: N/A
>        ApplicationMaster RPC port: -1
>        queue: default
>        start time: 1631299551829
>        final status: FAILED
>        tracking URL: http://ip-10-240-102-189.bamtech.test.us-east-1.bamgrid.net:8088/cluster/app/application_1631282030708_2863 <http://ip-10-240-102-189.bamtech.test.us-east-1.bamgrid.net:8088/cluster/app/application_1631282030708_2863>
>        user: hadoop
> 2021-09-10 18:45:52,941 INFO  [Thread-9] yarn.Client:57 : Deleted staging directory file:/home/hadoop/.sparkStaging/application_1631282030708_2863
> 2021-09-10 18:45:52,942 ERROR [Thread-9] cluster.YarnClientSchedulerBackend:73 : The YARN application has already ended! It might have been killed or the Application Master may have failed to start. Check the YARN application logs for more details.
> 2021-09-10 18:45:52,943 ERROR [Thread-9] spark.SparkContext:94 : Error initializing SparkContext.
>  
> Here are my kylin.properties with irrelevant/sensitive values removed:
>  
> kylin.env.hdfs-working-dir=s3a:// <s3a://XXXXX/qa/kylin/hdfs/>XXXXX <s3a://XXXXX/qa/kylin/hdfs/>/qa/kylin/hdfs/ <s3a://XXXXX/qa/kylin/hdfs/>
> kylin.env=QA
> kylin.server.mode=all
> kylin.server.cluster-servers=localhost:7070
> kylin.engine.default=6
> kylin.storage.default=4
> kylin.server.external-acl-provider=
> kylin.source.hive.database-for-flat-table=default
> kylin.web.default-time-filter=1
> kylin.storage.clean-after-delete-operation=false
> kylin.job.retry=1
> kylin.job.max-concurrent-jobs=1
> kylin.job.sampling-percentage=100
> kylin.job.scheduler.provider.100=org.apache.kylin.job.impl.curator.CuratorScheduler
> kylin.job.scheduler.default=2
> kylin.spark-conf.auto.prior=true
> kylin.engine.spark-conf.spark.master=yarn
> kylin.engine.spark-conf.spark.submit.deployMode=client
> kylin.engine.spark-conf.spark.yarn.queue=default
> kylin.engine.spark-conf.spark.eventLog.enabled=true
> kylin.engine.spark-conf.spark.eventLog.dir=hdfs:///kylin/spark-history <hdfs:///kylin/spark-history>
> kylin.engine.spark-conf.spark.history.fs.logDirectory=hdfs:///kylin/spark-history <hdfs:///kylin/spark-history>
> kylin.engine.spark-conf.spark.hadoop.yarn.timeline-service.enabled=false
> kylin.engine.spark-conf.spark.executor.extraJavaOptions=-Dfile.encoding=UTF-8 -Dhdp.version=current -Dlog4j.configuration=spark-executor-log4j.properties -Dlog4j.debug -Dkylin.hdfs.working.dir=${hdfs.working.dir} -Dkylin.metadata.identifier=kylin_metadata -Dkylin.spark.category=job -Dkylin.spark.project=${job.project} -Dkylin.spark.identifier=${job.id <http://job.id/>} -Dkylin.spark.jobName=${job.stepId} -Duser.timezone=${user.timezone}
> kylin.engine.spark-conf.spark.driver.extraJavaOptions=-XX:+CrashOnOutOfMemoryError
> kylin.query.auto-sparder-context-enabled-enabled=false
> kylin.query.spark-conf.spark.master=yarn
> kylin.query.spark-conf.spark.driver.cores=1
> kylin.query.spark-conf.spark.driver.memory=4G
> kylin.query.spark-conf.spark.driver.memoryOverhead=1G
> kylin.query.spark-conf.spark.executor.cores=1
> kylin.query.spark-conf.spark.executor.instances=1
> kylin.query.spark-conf.spark.executor.memory=4G
> kylin.query.spark-conf.spark.executor.memoryOverhead=1G
> kylin.query.spark-conf.spark.serializer=org.apache.spark.serializer.JavaSerializer
> kylin.query.spark-conf.spark.hadoop.yarn.timeline-service.enabled=false
> kylin.query.spark-conf.spark.executor.extraJavaOptions=-Dhdp.version=current -Dlog4j.configuration=spark-executor-log4j.properties -Dlog4j.debug -Dkylin.hdfs.working.dir=s3a://dataeng-data-test/qa/kylin/hdfs/ <s3a://dataeng-data-test/qa/kylin/hdfs/> -Dkylin.metadata.identifier=kylin_metadata -Dkylin.spark.category=sparder -Dkylin.spark.identifier={{APP_ID}}
> kylin.source.hive.redistribute-flat-table=false
> kylin.metadata.jdbc.dialect=mysql
> kylin.metadata.jdbc.json-always-small-cell=true
> kylin.job.lock=org.apache.kylin.storage.hbase.util.ZookeeperDistributedJobLock
> kylin.web.set-config-enable=true
> kylin.job.allow-empty-segment=false
> kylin.env.hadoop-conf-dir=/etc/hadoop/conf
> kylin.query.lazy-query-enabled=true
> kylin.query.cache-signature-enabled=true
> kylin.query.segment-cache-enabled=false
> kylin.engine.spark-fact-distinct=true
> kylin.engine.spark-dimension-dictionary=false
> kylin.engine.spark-uhc-dictionary=true
> kylin.engine.spark.rdd-partition-cut-mb=10
> kylin.engine.spark.min-partition=1
> kylin.engine.spark.max-partition=5000
> kylin.engine.spark-conf.spark.dynamicAllocation.enabled=true
> kylin.engine.spark-conf.spark.dynamicAllocation.minExecutors=1
> kylin.engine.spark-conf.spark.dynamicAllocation.maxExecutors=1000
> kylin.engine.spark-conf.spark.dynamicAllocation.executorIdleTimeout=300
> kylin.engine.spark-conf.spark.driver.memory=2G
> kylin.engine.spark-conf.spark.executor.memory=4G
> kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024
> kylin.engine.spark-conf.spark.executor.cores=1
> kylin.engine.spark-conf.spark.network.timeout=600
> kylin.engine.spark-conf.spark.shuffle.service.enabled=true
> kylin.engine.spark-conf.spark.hadoop.dfs.replication=2
> kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress=true
> kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec
> kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec
> kylin.engine.spark-conf-mergedict.spark.executor.memory=6G
> kylin.engine.spark-conf-mergedict.spark.memory.fraction=0.2
> kylin.engine.spark-conf.spark.sql.hive.metastore.version=3.1.2
> kylin.engine.spark-conf.spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/lib/*
> kylin.query.spark-conf.spark.sql.hive.metastore.version=3.1.2
> kylin.query.spark-conf.spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/lib/*
> kylin.server.cluster-name=kylin_metadata
> kylin.log.spark-executor-properties-file=/usr/local/kylin/conf/spark-executor-log4j.properties
> kylin.metadata.url.identifier=kylin_metadata
>  
> Thank you for your assistance,
>  
> Gabe
>  
> -- 
> Gabe Michael
> Principal Data Engineer
> Disney Streaming Services