You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2020/07/15 08:24:00 UTC

[GitHub] [incubator-doris] xy720 opened a new issue #4101: [Proposal]Create a jar package's repository for Spark Load

xy720 opened a new issue #4101:
URL: https://github.com/apache/incubator-doris/issues/4101


   **Motivation**
   Recently, we have introduced the Spark Load, which currently needs to upload many jar packages to the Yarn cluster before load. These jar packages include `$DORIS_HOME/lib/palo-fe.jar`(the Dpp runtime dependency) and all jars in the `$SPARK_HOME/jars` folder(the Spark dependencies), which usually takes 2~3 minutes to upload.
   
   Currently, these jars are uploaded to the temporary directories in HDFS. The `palo-fe.jar` is uploaded to  `{working_dir}/jobs/DB_ID/LABEL/JOB_ID/configs`. Other jars are packaged as zip file and uploaded to `{stage_dir}/APPLICATION_ID/__spark_lib__.zip`. 
   
   In most cases, the jar packages uploaded by two different load are completely same, which means we don't have to upload these jar packages every time. Secondly, the jar packages should be stored in one directory so that we can manage them  easily. Moreover, we can put all jars in a zip file in the compile phase and upload it to a specified remote repository before load.
   
   Therefore, as a proposal, I suggest to create a repository for all dependencies of Spark Load in HDFS.
   
   **The repository structure**
   
   ```
   Repository/
   |-lib_{version}.zip
   |     {All spark dependencies}
   |     |-roaringbitmap.jar
   |     |-activation-1.1.1.jar
   |     |-aircompressor-0.10.jar
   |     |-...
   |     {All dpp dependencies}
   |     |-spark-dpp.jar
   |-lib_{version}.zip
   |-lib_{version}.zip
   |-...
   ```
   
   The Repository/ directory is the parent dir of all zip files. When we submit a spark load, fe will compare the version between remote zip file and local zip file, and only upload when we can not find the right versionn.
   
   Note that, the `spark-dpp.jar `is built by spark-dpp sub-modules. The difference between `palo-fe.jar` and `spark-dpp.jar` is that `spark-dpp.jar` contain other third-party libraries that `palo-fe.jar` depends on. You can see the details about multi-modules of fe in this issue #4098 .
   
   Meanwhile, we can set `AppResourceHdfsPath` argument of spark-submit to lib.zip file. Spark will analyze it and find the entrance of MainClass.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] xy720 commented on issue #4101: [Proposal]Create a repository for SparkLoad‘s dependencies

Posted by GitBox <gi...@apache.org>.
xy720 commented on issue #4101:
URL: https://github.com/apache/incubator-doris/issues/4101#issuecomment-661630805


   1、I am not going to remove the old version yet,because it's not a big problem for now.
   2、At present, I plan to save a DppVersion in FeConstant, fe will try to find or create a subfolders in repository named as DppVersion before load. Different Doris clusters will have its own repository directory.
   3、Fe will soon have a submodule of spark load,for every cluster we maintain a DppVersion in FeConstant, represents a version of spark load submodule(Just like the meta version). Before Fe starts to submit spark load, it will first look for the subfolders under the dppversion command under the repository, and find each library under this folder. Then it will compare each library with MD5, and determine whether it needs to upload again
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] xy720 edited a comment on issue #4101: [Proposal]Create a repository for SparkLoad‘s dependencies

Posted by GitBox <gi...@apache.org>.
xy720 edited a comment on issue #4101:
URL: https://github.com/apache/incubator-doris/issues/4101#issuecomment-661630805


   1、I am not going to remove the old version yet,because it's not a big problem for now.
   2、At present, I plan to save a DppVersion in FeConstant, fe will try to find or create a subfolders in repository named as DppVersion before load. Different Doris clusters will have its own repository directory.
   3、Fe will soon have a submodule of spark load,for every cluster we maintain a DppVersion in FeConstant, represents a version of spark load submodule(Just like the meta version). Before Fe starts to submit spark load, it will first look for the subfolders which named as DppVersion under its repository, and find each library under this folder. Then it will compare each library with MD5, and determine whether it needs to upload again
   
   Please see the updating in the following:


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] xy720 commented on issue #4101: [Proposal]Create a repository for SparkLoad‘s dependencies

Posted by GitBox <gi...@apache.org>.
xy720 commented on issue #4101:
URL: https://github.com/apache/incubator-doris/issues/4101#issuecomment-661636359


   **The repository structure** will be like this:
   __spark_repository__/
       |-__archive_1_0_0/
       |        |-__lib_990325d2c0d1d5e45bf675e54e44fb16_spark-dpp.zip
       |        |-__lib_7670c29daf535efe3c9b923f778f61fc_spark-2x.zip
       |-__archive_2_2_0/
       |        |-__lib_64d5696f99c379af2bee28c1c84271d5_spark-dpp.zip
       |        |-__lib_1bbb74bb6b264a270bc7fca3e964160f_spark-2x.zip
       |-__archive_3_2_0/
       |        |-...
   
   1、The archive represents a remote directory contains the libraries which are in the same dppVersion.
   2、The library(lib) represents a remote library for Spark-2x/Spark-dpp.
   
   Every time when fe upload its zip, it creates a subfolder like "__archive_1_2_0" in repository, and put the zip file into it.
   
   The zip file is named like "__lib_md5sum_spark-dpp.zip", generated by the jar packages. For example, spark-dpp.zip is generated by spark-dpp.jar, and spark-2x.zip is generated by jars under $SPARK_HOME/jars.
   
   If upload is needed, by default fe will find the local zip file with the following config:
   spark_dpp_resource_local_path = $DORIS_HOME + /lib/spark-dpp.zip
   spark_resource_local_path = "{user_setting}"
   
   if spark_resource_local_path is empty, fe will try find zip file at
   $SPARK_HOME/jars/spark-2x.zip


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wangbo commented on issue #4101: [Proposal]Create a repository for SparkLoad‘s dependencies

Posted by GitBox <gi...@apache.org>.
wangbo commented on issue #4101:
URL: https://github.com/apache/incubator-doris/issues/4101#issuecomment-659905944


   Some Questions:
   1 Do we need to delete useless version Jar?
   2 In Multiple Doris Clusters case,how dose every cluster knows its version, whether they share the same directory?
   3 How to deal the case when FE Gray release, every FE has different code?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] xy720 edited a comment on issue #4101: [Proposal]Create a repository for SparkLoad‘s dependencies

Posted by GitBox <gi...@apache.org>.
xy720 edited a comment on issue #4101:
URL: https://github.com/apache/incubator-doris/issues/4101#issuecomment-661630805


   1、I am not going to remove the old version yet,because it's not a big problem for now.
   2、At present, I plan to save a DppVersion in FeConstant, fe will try to find or create a subfolders in repository named as DppVersion before load. Different Doris clusters will have its own repository directory.
   3、Fe will soon have a submodule of spark load,for every cluster we maintain a DppVersion in FeConstant, represents a version of spark load submodule(Just like the meta version). Before Fe starts to submit spark load, it will first look for the subfolders which named as DppVersion under its repository, and find each library under this folder. Then it will compare each library with MD5, and determine whether it needs to upload again
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] xy720 edited a comment on issue #4101: [Proposal]Create a repository for SparkLoad‘s dependencies

Posted by GitBox <gi...@apache.org>.
xy720 edited a comment on issue #4101:
URL: https://github.com/apache/incubator-doris/issues/4101#issuecomment-661636359


   **The repository structure** will be like this:
   
   ```
   __spark_repository__/
       |-__archive_1_0_0/
       |        |-__lib_990325d2c0d1d5e45bf675e54e44fb16_spark-dpp.zip
       |        |-__lib_7670c29daf535efe3c9b923f778f61fc_spark-2x.zip
       |-__archive_2_2_0/
       |        |-__lib_64d5696f99c379af2bee28c1c84271d5_spark-dpp.zip
       |        |-__lib_1bbb74bb6b264a270bc7fca3e964160f_spark-2x.zip
       |-__archive_3_2_0/
       |        |-...
   ```
   
   1、The archive represents a remote directory contains the libraries which are in the same dppVersion.
   2、The library(lib) represents a remote library for Spark-2x/Spark-dpp.
   
   Every time when fe upload its zip, it creates a subfolder like "__archive_1_2_0" in repository, and put the zip file into it.
   
   The zip file is named like "__lib_md5sum_spark-dpp.zip", generated by the jar packages. For example, spark-dpp.zip is generated by spark-dpp.jar, and spark-2x.zip is generated by jars under `$SPARK_HOME/jars`.
   
   If upload is needed, by default fe will find the local zip file with the following config:
   `spark_dpp_resource_local_path = $DORIS_HOME + "/lib/spark-dpp.zip"`
   `spark_resource_local_path = "{user_setting}"`
   
   if the config `spark_resource_local_path` is empty, fe will try find zip file at
   `$SPARK_HOME/jars/spark-2x.zip`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] xy720 edited a comment on issue #4101: [Proposal]Create a repository for SparkLoad‘s dependencies

Posted by GitBox <gi...@apache.org>.
xy720 edited a comment on issue #4101:
URL: https://github.com/apache/incubator-doris/issues/4101#issuecomment-661630805


   1、I am not going to remove the old version yet,because it's not a big problem for now.
   2、At present, I plan to save a DppVersion in FeConstant, fe will try to find or create a subfolders in repository named as DppVersion before load. Different Doris clusters will have its own repository directory.
   3、Fe will soon have a submodule of spark load,for every cluster we maintain a DppVersion in FeConstant, represents a version of spark load submodule(Just like the meta version). Before Fe starts to submit spark load, it will first look for the subfolders which named as DppVersion under its repository, and find each library under this folder. Then it will compare each library with MD5, and determine whether it needs to upload again
   
   Please see the comment in the following:


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] xy720 edited a comment on issue #4101: [Proposal]Create a repository for SparkLoad‘s dependencies

Posted by GitBox <gi...@apache.org>.
xy720 edited a comment on issue #4101:
URL: https://github.com/apache/incubator-doris/issues/4101#issuecomment-661636359


   **The repository structure** will be like this:
   
   ```
   __spark_repository__/
       |-__archive_1_0_0/
       |        |-__lib_990325d2c0d1d5e45bf675e54e44fb16_spark-dpp.zip
       |        |-__lib_7670c29daf535efe3c9b923f778f61fc_spark-2x.zip
       |-__archive_2_2_0/
       |        |-__lib_64d5696f99c379af2bee28c1c84271d5_spark-dpp.zip
       |        |-__lib_1bbb74bb6b264a270bc7fca3e964160f_spark-2x.zip
       |-__archive_3_2_0/
       |        |-...
   ```
   
   1、The archive represents a remote directory contains the libraries which are in the same dppVersion.
   2、The library(lib) represents a remote library for Spark-2x/Spark-dpp.
   
   Every time when fe upload its zip, it creates a subfolder like "__archive_1_2_0" in repository, and put the zip file into it.
   
   The zip file is named like "__lib_md5sum_spark-dpp.zip", generated by the jar packages. For example, spark-dpp.zip is generated by spark-dpp.jar, and spark-2x.zip is generated by jars under $SPARK_HOME/jars.
   
   If upload is needed, by default fe will find the local zip file with the following config:
   spark_dpp_resource_local_path = $DORIS_HOME + /lib/spark-dpp.zip
   spark_resource_local_path = "{user_setting}"
   
   if spark_resource_local_path is empty, fe will try find zip file at
   $SPARK_HOME/jars/spark-2x.zip


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] xy720 edited a comment on issue #4101: [Proposal]Create a repository for SparkLoad‘s dependencies

Posted by GitBox <gi...@apache.org>.
xy720 edited a comment on issue #4101:
URL: https://github.com/apache/incubator-doris/issues/4101#issuecomment-661636359


   **The repository structure** will be like this:
   
   ```
   __spark_repository__/
       |-__archive_1_0_0/
       |        |-__lib_990325d2c0d1d5e45bf675e54e44fb16_spark-dpp.jar
       |        |-__lib_7670c29daf535efe3c9b923f778f61fc_spark-2x.zip
       |-__archive_2_2_0/
       |        |-__lib_64d5696f99c379af2bee28c1c84271d5_spark-dpp.jar
       |        |-__lib_1bbb74bb6b264a270bc7fca3e964160f_spark-2x.zip
       |-__archive_3_2_0/
       |        |-...
   ```
   
   1、The archive represents a remote directory contains the libraries which are in the same dppVersion.
   2、The library(lib) represents a remote library for Spark-2x/Spark-dpp.
   
   Every time when fe upload its zip, it creates a subfolder like "__archive_1_2_0" in repository, and put the zip file into it.
   
   The zip file is named like "__lib_md5sum_spark-dpp.zip", generated by the jar packages. For example, spark-dpp.zip is generated by spark-dpp.jar, and spark-2x.zip is generated by jars under `$SPARK_HOME/jars`.
   
   If upload is needed, by default fe will find the local zip file with the following config:
   `spark_dpp_resource_local_path = $DORIS_HOME + "/lib/spark-dpp.zip"`
   `spark_resource_local_path = "{user_setting}"`
   
   if the config `spark_resource_local_path` is empty, fe will try find zip file at
   `$SPARK_HOME/jars/spark-2x.zip`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] xy720 edited a comment on issue #4101: [Proposal]Create a repository for SparkLoad‘s dependencies

Posted by GitBox <gi...@apache.org>.
xy720 edited a comment on issue #4101:
URL: https://github.com/apache/incubator-doris/issues/4101#issuecomment-661636359


   **The repository structure** will be like this:
   
   ```
   __spark_repository__/
       |-__archive_1_0_0/
       |        |-__lib_990325d2c0d1d5e45bf675e54e44fb16_spark-dpp.zip
       |        |-__lib_7670c29daf535efe3c9b923f778f61fc_spark-2x.zip
       |-__archive_2_2_0/
       |        |-__lib_64d5696f99c379af2bee28c1c84271d5_spark-dpp.zip
       |        |-__lib_1bbb74bb6b264a270bc7fca3e964160f_spark-2x.zip
       |-__archive_3_2_0/
       |        |-...
   ```
   
   1、The archive represents a remote directory contains the libraries which are in the same dppVersion.
   2、The library(lib) represents a remote library for Spark-2x/Spark-dpp.
   
   Every time when fe upload its zip, it creates a subfolder like "__archive_1_2_0" in repository, and put the zip file into it.
   
   The zip file is named like "__lib_md5sum_spark-dpp.zip", generated by the jar packages. For example, spark-dpp.zip is generated by spark-dpp.jar, and spark-2x.zip is generated by jars under `$SPARK_HOME/jars`.
   
   If upload is needed, by default fe will find the local zip file with the following config:
   `spark_dpp_resource_local_path = $DORIS_HOME + "/lib/spark-dpp.zip"`
   `spark_resource_local_path = "{user_setting}"`
   
   if spark_resource_local_path is empty, fe will try find zip file at
   `$SPARK_HOME/jars/spark-2x.zip`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org