You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2020/04/10 12:22:16 UTC

[GitHub] [incubator-doris] kangpinghuang opened a new issue #3295: spark data preparation process

kangpinghuang opened a new issue #3295: spark data preparation process
URL: https://github.com/apache/incubator-doris/issues/3295
 
 
   To solve #2855, we intent to do elt by using spark cluster
   
   The pr #3010 has resolve the spark job submission job.
   The issue #2940 has resolve the global dict build process in spark load.
   And this issue is used to track the spark dpp job, which will accomplish the following tasks:
   
   1. read and do the etl job from the data source
   there are many jobs should be done in the step, including:
   
   - schema check
   
   - type cast
   
   - data validation
   
   - null value/default value
   
   - strict mode support
   
   - udf function support
   
   2. repartition and bucket as the doris data model
   3. rollup build/aggregation/sort
   4. rewrite data to parquet(phase 1) or doris segment file(phase 2)
   5. write the dpp job statistics for FE to parse
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org