You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "qian wang (Jira)" <ji...@apache.org> on 2021/04/25 08:05:00 UTC

[jira] [Created] (SPARK-35216) a general auto merge output files feature for datasource api

qian wang created SPARK-35216:
---------------------------------

             Summary: a general auto merge output files feature for datasource api
                 Key: SPARK-35216
                 URL: https://issues.apache.org/jira/browse/SPARK-35216
             Project: Spark
          Issue Type: New Feature
          Components: SQL
    Affects Versions: 3.0.2
            Reporter: qian wang


in most case, users write data to hive table or hdfs dir with spark sql, since as spark3.0 released, offical didn't encourge to use hive module to read/write hive table, preferred  switching to datasoruce api from hive strategy rule, so as to centralize io operation with one module.

so given a general auto merge output files ability for datasource api would resolve many users's small files problem in production, and it can bind with datasource write framwork tightly, so that the auto merge course is transparent to users, and it is capable to handle all kinds of writing method, such as writing hdfs dir/non-partitioned hive table/dynamic partition hive table

this is my individual implemetation for the functionality, and it's stable in production environment of my company



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org