You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kylin.apache.org by Fábio Teixeira <fa...@gmail.com> on 2019/07/08 16:34:01 UTC

Kylin interacting with AWS EMR

Dear all,

First of all, thank you very much for building and maintaining Apache Kylin, it is a really awesome, the work you are doing.

I had to try it out, so I first configured Apache Kylin into an AWS EMR cluster which worked pretty well and then I wanted to really go crazy and have it outside the AWS EMR cluster.

I’ve already setup a Kylin cluster using MySQL as metastore but I am struggling on making it interacting with the EMR cluster.

My issue:
On the first build step of a cube, It is fetching data using sqoop and should add it to the Hive table, but there it is timing out because it tries to connect to 127.0.0.1:50010 which obviously is not the AWS EMR cluster. I was trying to find where I could change the ip for the datanode without success.

Considering my issue, I was checking the code and I saw that there is the possibility of running the jobs using remote cli and I was wondering if this should be the way to go on a Production environment.

Would you be so kind and provide me some guidance on the following topics?:
Setting up kylin.job.use-remote-cli=true is the configuration that one should use when Apache Kylin is not inside the Hadoop cluster.
If not then could you provide me any kind of guidance where I can find documentation for doing that kind of configuration (Kylin and Hadoop separated)?
I was already investigating the https://github.com/apache/kylin/tree/master/examples/test_case_data/sandbox <https://github.com/apache/kylin/tree/master/examples/test_case_data/sandbox>
Do you have more updated documentation for having Kylin outside the Hadoop cluster?
Is it recommended to use Kylin outside the Hadoop cluster on a production environment?

Thank you in advance.

I look forward to hearing from you.

Kind regards,
Fábio Teixeira

Re:Kylin interacting with AWS EMR

Posted by Xiaoxiang Yu <hi...@126.com>.

Dear friend ,
   I am feeling sad that you have met such trouble. I have depolyed Kylin into CDH's Hadoop Cluster, but I have less knowledge about AWS's EMR, but I think I may share what I know to you.
   First question, how to depoly Kylin outside the Hadoop cluster? As far as I see, I think you should deploy Kylin into a router/client node of Hadoop Cluster. A router node should be a node which has deploy Hadoop binary(such as Hive/HDFS) and conf file, but without DataNode/NodeManager(So it has no heavy workload). The router/client node let you have fully access to Hive CLI/HBase CLI/HDFS CLI, that is suitable for Kylin's depolyment. 
   On another hand, I think depoly Kylin outside the Hadoop cluster is not suitable, because Kylin need to upload/download large amounts of data to/from Hadoop cluster. So, depolying Kylin outside the Hadoop cluster, make network being a bottleneck, which has bad influence on Kylin's performance.
   Another question, the entry "kylin.job.use-remote-cli=true", which is used for Kylin's developer, but not for Kylin's user. If you are interested in that, please check http://kylin.apache.org/development/dev_env.html for detail.
   Besides, I have invited you into a slack channel(https://apache-kylin.slack.com). Some kylin user has deploy Kylin successfully on EMR, you may ask them more question.

-----------------
-----------------
Best wishes to you ! 
From ：Xiaoxiang Yu

At 2019-07-09 00:34:01, "Fábio Teixeira" <fa...@gmail.com> wrote:
>Dear all,
>
>First of all, thank you very much for building and maintaining Apache Kylin, it is a really awesome, the work you are doing. 
>
>I had to try it out, so I first configured Apache Kylin into an AWS EMR cluster which worked pretty well and then I wanted to really go crazy and have it outside the AWS EMR cluster.
>
>I’ve already setup a Kylin cluster using MySQL as metastore but I am struggling on making it interacting with the EMR cluster.
>
>My issue:
>On the first build step of a cube, It is fetching data using sqoop and should add it to the Hive table, but there it is timing out because it tries to connect to 127.0.0.1:50010 which obviously is not the AWS EMR cluster. I was trying to find where I could change the ip for the datanode without success.
>
>Considering my issue, I was checking the code and I saw that there is the possibility of running the jobs using remote cli and I was wondering if this should be the way to go on a Production environment.
>
>Would you be so kind and provide me some guidance on the following topics?:
>Setting up kylin.job.use-remote-cli=true is the configuration that one should use when Apache Kylin is not inside the Hadoop cluster.
>If not then could you provide me any kind of guidance where I can find documentation for doing that kind of configuration (Kylin and Hadoop separated)?
>I was already investigating the https://github.com/apache/kylin/tree/master/examples/test_case_data/sandbox <https://github.com/apache/kylin/tree/master/examples/test_case_data/sandbox> 
>Do you have more updated documentation for having Kylin outside the Hadoop cluster?
>Is it recommended to use Kylin outside the Hadoop cluster on a production environment?
>
>Thank you in advance.
>
>I look forward to hearing from you.
>
>Kind regards,
>Fábio Teixeira
>