You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kylin.apache.org by ShaoFeng Shi <sh...@apache.org> on 2021/06/03 02:10:56 UTC

Re: MERGE CUBE job always fails

Hi Michael,

Thanks for your information.

Firstly, I want to make a clarification on the path "
hdfs://xxxxx:8020/kylin/kylin_metadata/kylin-d0a4b4b5-
44ba-cdf3-c6d0-231483835b24/xxxxx/cuboid". Although in the path it has the "
kylin_metadata" prefix, it is not just for "metadata", but also for data,
especially the cuboid data. We put "kylin_metadata" in the path because it
represents this Kylin instance. So that if you have another Kylin instance
e.g, "kylin_metadata_qa", the data will be in another path.

Secondly, usually if seeing a "Path not exist" error, and do confirm the
path not there, it may be caused by the stop/start of the EMR cluster, as
EMR HDFS data will be lost during a restart.

Putting all data on S3 doesn't have that problem, but the build performance
might be slower than in local HDFS. It depends on how much data you have in
the cluster, we can have different approaches to optimize it.

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Email: shaofengshi@apache.org

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org



Michael, Gabe <Ga...@disneystreaming.com> 于2021年4月30日周五 下午11:49写道:

> I think I have solved my problem.
>
>
>
> I misunderstood the purpose of the kylin.env.hdfs-working-dir property.
>
> In the documentation on Kylin on EMR (
> http://kylin.apache.org/docs/install/kylin_aws_emr.html) it says:
>
>
>
>                 Kylin’s ‘hdfs-working-dir’ is for putting the intermediate
> data for Cube building, cuboid files and also some metadata files (like
> dictionary and table snapshots which are not good in HBase);
>
>                 so it is best to configure HDFS for this.
>
>                 If using HDFS as Kylin working directory, you just leave
> configurations unchanged as EMR’s default FS is HDFS:
>
>                 kylin.env.hdfs-working-dir=/kylin
>
>                 Before you shutdown/restart the cluster, you must backup
> the “/kylin” data on HDFS to S3 with S3DistCp, or you may lost data and
> couldn’t recover the cluster later.
>
>
>
>                 Use S3 as kylin.env.hdfs-working-dir
>
>
>
>                 If you want to use S3 as storage (assume HBase is also on
> S3), you need configure the following parameters:
>
>
>
>
> kylin.env.hdfs-working-dir=s3://yourbucket/kylin
>
>
> kylin.storage.hbase.cluster-fs=s3://yourbucket
>
>
> kylin.source.hive.redistribute-flat-table=false
>
>
>
>                 The intermediate file and the HFile will all be written to
> S3.
>
>
>
> I misunderstood the documentation and assumed that when kylin.metadata.url
>
> was configured to point to a MySQL database, all Kylin metadata would be
>
> written to MySQL.
>
>
>
> But now I understand some Kylin metadata is always written to HDFS or S3
>
> regardless of whether kylin.metadata.url points at MySQL or Hbase.
>
> Because my kylin.env.hdfs-working-dir was pointing to an HDFS location
> that
>
> did not persist across EMR clusters, the cuboid metadata was missing and
>
> the MERGE CUBE job was failing.
>
>
>
> I changed kylin.env.hdfs-working-dir to point to an S3 location,
>
> purged my cube, built two segments, and successfully ran a MERGE CUBE job.
>
>
>
> Thank you,
>
>
>
> Gabe
>
>
>
> *De : *Michael, Gabe <Ga...@disneystreaming.com>
> *Date : *jeudi, 29 avril 2021 à 12:32
> *À : *user@kylin.apache.org <us...@kylin.apache.org>
> *Objet : *MERGE CUBE job always fails
>
> Hello,
>
>
>
> I am running Kylin 3.1.1 on AWS EMR 5.30.1
>
> (Hadoop 2.8.5, Hive 2.3.6, HBase 1.4.13, ZooKeeper 3.4.14).
>
>
>
> Hbase is configured to store data on S3, and I use AWS Aurora MySQL
>
> for Kylin metadata.
>
>
>
> Whenever I attempt to run a MERGE CUBE job, the job fails at #4 Step
>
> Name: Merge Cuboid Data
>
>
>
> Step Parameters:
>
>
>
>                 -conf /usr/local/kylin/conf/kylin_job_conf.xml -cubename
> xxxxx -segmentid f6eadc72-e5e4-bdd2-db1d-24ea19fbf9c4 -input
> hdfs://xxxxx:8020/kylin/kylin_metadata/kylin-a13671fd-ef51-fa05-89d1-033f0c6e3423/xxxxx/cuboid/*,hdfs://xxxxx:8020/kylin/kylin_metadata/kylin-24ef5daf-38e2-56d3-2b0a-792ffb37a0bf/xxxxx/cuboid/*,hdfs://xxxxx:8020/kylin/kylin_met
> adata/kylin-d0a4b4b5-44ba-cdf3-c6d0-231483835b24/xxxxx/cuboid/* -output
> hdfs://xxxxx:8020/kylin/kylin_metadata/kylin-6c86522d-2a6e-2820-1af6-193f7c19b0d0/xxxxx/cuboid/
> -jobname Kylin_Merge_Cuboid_xxxxx_Step
>
> Step Logs:
>
>
>
>                 java.io.IOException: No input paths specified in job
>
>                                at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:245)
>
>                                at
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>
>                                at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:358)
>
>                                at
> org.apache.kylin.engine.mr.common.AbstractHadoopJob.getTotalMapInputMB(AbstractHadoopJob.java:664)
>
>                                at
> org.apache.kylin.storage.hbase.steps.HBaseMROutput2Transition$HBaseMergeMROutputFormat.configureJobOutput(HBaseMROutput2Transition.java:171)
>
>                                at
> org.apache.kylin.engine.mr.steps.MergeCuboidJob.run(MergeCuboidJob.java:88)
>
>                                at
> org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:155)
>
>                                at
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:179)
>
>                                at
> org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
>
>                                at
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:179)
>
>                                at
> org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:113)
>
>                                at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>
>                                at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>
>                                at java.lang.Thread.run(Thread.java:748)
>
>
>
> Prior to this exception I get warnings in Kylin logs like:
>
>
>
>                 common.AbstractHadoopJob:460 : Path not
> exist:hdfs://xxxxx:8020/kylin/kylin_metadata/kylin-d0a4b4b5-44ba-cdf3-c6d0-231483835b24/xxxxx/cuboid
>
>
>
> Maybe the MERGE CUBE job is ignoring the kylin.metadata.url
>
> configuration value?
>
>
>
> For example in #1 Step Name: Merge Cuboid Dictionary I see the
>
> following parameters:
>
>
>
>                 -conf /usr/local/kylin/conf/kylin_job_conf_cube_merge.xml
> -cubename xxxxx -segmentid f6eadc72-e5e4-bdd2-db1d-24ea19fbf9c4
> -metadataUrl                 kylin_metadata@hdfs,path=hdfs://xxxxx:8020/kylin/kylin_metadata/kylin-6c86522d-2a6e-2820-1af6-193f7c19b0d0/xxxxx/metadata
> -segmentIds
> d4e7c735-48d2-c257-d011-a99573e3f32d,a5aba327-9441-eb8e-b719-c0642c53632b,688df5de-8123-6963-83bd-330d897a25a6
> -dictOutputPath
> hdfs://xxxxx:8020/kylin/kylin_metadata/kylin-6c86522d-2a6e-2820-1af6-193f7c19b0d0/xxxxx/dict_info
> -statOutputPath
> hdfs://xxxxx:8020/kylin/kylin_metadata/kylin-6c86522d-2a6e-2820-1af6-193f7c19b0d0/xxxxx/fact_distinct_columns/statistics
> -jobname Kylin_Merge_Dictionary_xxxxx_Step
>
>
>
> Note that the -metadataUrl parameter refers to an HDFS location even
>
> though my kylin.metadata.url refers to a JDBC connection.
>
>
>
> Here are my Kylin configuration values (from web UI, some irrelevant
>
> or sensitive values removed):
>
>
>
>                 kylin.metadata.url=kylin_metadata@jdbc
> ,url=jdbc:mysql://xxxxx:3306/kylin?createDatabaseIfNotExist=true,username=xxxxx,password=xxxxx,driverClassName=org.mariadb.jdbc.Driver
>
>                 kylin.metadata.sync-retries=3
>
>                 kylin.env.hdfs-working-dir=hdfs://xxxxx:8020/kylin
>
>                 kylin.env=QA
>
>                 kylin.env.zookeeper-base-path=/kylin
>
>                 kylin.server.mode=all
>
>                 kylin.server.cluster-servers=xxxxx:7070
>
>                 kylin.engine.default=2
>
>                 kylin.storage.default=2
>
>                 kylin.cube.migration.enabled=false
>
>                 kylin.source.hive.client=cli
>
>                 kylin.source.hive.beeline-shell=beeline
>
>                 kylin.source.hive.enable-sparksql-for-table-ops=false
>
>                 kylin.source.hive.keep-flat-table=false
>
>                 kylin.source.hive.database-for-flat-table=default
>
>                 kylin.source.hive.redistribute-flat-table=false
>
>                 kylin.source.hive.metadata-type=hcatalog
>
>                 kylin.storage.url=hbase
>
>                 kylin.storage.hbase.table-name-prefix=KYLIN_
>
>                 kylin.storage.hbase.namespace=default
>
>                 kylin.storage.hbase.compression-codec=snappy
>
>                 kylin.storage.hbase.region-cut-gb=5
>
>                 kylin.storage.hbase.hfile-size-gb=2
>
>                 kylin.storage.hbase.min-region-count=1
>
>                 kylin.storage.hbase.max-region-count=500
>
>                 kylin.storage.hbase.owner-tag=whoami@kylin.apache.org
>
>                 kylin.storage.hbase.coprocessor-mem-gb=3
>
>                 kylin.storage.partition.aggr-spill-enabled=true
>
>                 kylin.storage.partition.max-scan-bytes=3221225472
>
>                 kylin.storage.clean-after-delete-operation=false
>
>                 kylin.job.retry=0
>
>                 kylin.job.max-concurrent-jobs=10
>
>                 kylin.job.sampling-percentage=100
>
>
> kylin.job.scheduler.provider.100=org.apache.kylin.job.impl.curator.CuratorScheduler
>
>                 kylin.job.scheduler.default=100
>
>                 kylin.engine.mr.yarn-check-interval-seconds=10
>
>                 kylin.engine.mr.reduce-input-mb=500
>
>                 kylin.engine.mr.max-reducer-number=500
>
>                 kylin.engine.mr.mapper-input-rows=1000000
>
>                 kylin.engine.mr.build-dict-in-reducer=true
>
>                 kylin.engine.mr.uhc-reducer-count=3
>
>                 kylin.engine.mr.build-uhc-dict-in-additional-step=false
>
>
> kylin.cube.cuboid-scheduler=org.apache.kylin.cube.cuboid.DefaultCuboidScheduler
>
>
> kylin.cube.segment-advisor=org.apache.kylin.cube.CubeSegmentAdvisor
>
>                 kylin.cube.algorithm=layer
>
>                 kylin.cube.algorithm.layer-or-inmem-threshold=7
>
>                 kylin.cube.algorithm.inmem-auto-optimize=true
>
>                 kylin.cube.aggrgroup.max-combination=32768
>
>                 kylin.snapshot.max-mb=300
>
>                 kylin.storage.hbase.cluster-fs=s3://xxxxx
>
>                 kylin.metadata.jdbc.dialect=mysql
>
>                 kylin.metadata.jdbc.json-always-small-cell=true
>
>                 kylin.server.self-discovery-enabled=true
>
>                 kylin.server.cluster-name=kylin_metadata
>
>                 kylin.server.cluster-servers-with-mode=xxxxx:7070:all
>
>
>
> Thank you for your assistance,
>
>
>
> Gabe
>
>
>