You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2020/02/20 04:36:46 UTC
[GitHub] [incubator-doris] wangbo commented on issue #2940: spark etl build global dict doc

wangbo commented on issue #2940: spark etl build global dict doc
URL: https://github.com/apache/incubator-doris/pull/2940#issuecomment-588605086
 
 
   @morningman 
   > spark 构建全局索引的工作，我理解是 spark导入功能中的一部分是吧？
   
   是的，如果本次导入是有精确去重字段的话，是需要在流程中加一步构建全局字典。
   
   > 目前一个spark导入作业所需要的信息，使用sql表述感觉非常复杂了，比如条件过滤，列转换和映射，以及访问spark、hive等所需要的各种连接信息，权限信息。
   这些其实通过sql来表达已经比较吃力了
   
   目前这个job的sql版我已经写完了，并跑通基本测试了。接下来主要是测试，这部分花时间应该会比较长，也是最重要的一部分，测试内容包括资源用量以及性能，至少要和kylin现有的能力持平。所以sql表达的问题不是重点。
   
   > 比如如果需要使用kerberos访问，我们不得不将keytab的文件内容转成 base64 来表示。
   
   这个没太明白，是什么使用kerberos访问什么呢？目前我测试时，是用的一个hadoop账户直接提交的spark作业到yarn，只要提交作业的本地节点有keytab完成一次认证就可以了，不需要做什么上传。
   
   > 这些信息都要驻留在FE内存中，同时也需要序列化到FE元数据中。当作业很多时，对内存和元数据存储的开销都非常大。
   
   构建全局字典的元信息分为两部分，一部分是doris的元数据信息，包括olaptable的schema信息，这个是一定得持久化的。
   另一部分是作业的信息，输入路径配置参数什么的，没有其他由于构建全局字典要额外的引入的信息。
   所以我确认下你想表达的意思是，通过外部xml封装的方案去解决job信息过多导致的可能存在的fe内存使用过高的问题是吧？如果是这样的话，我觉得可以。
   只是实现方案上如果用xml的话，感觉上还得写一个简单的协议将xml同步到多个fe，如果现有的的元数据持久化方案支持根据标识读盘的话，可能会简单一些

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org