You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2016/03/09 10:59:40 UTC
[jira] [Comment Edited] (SPARK-13766) Inconsistent file extensions and omitted file extensions written by CSV, TEXT and JSON data sources

    [ https://issues.apache.org/jira/browse/SPARK-13766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15186872#comment-15186872 ] 

Hyukjin Kwon edited comment on SPARK-13766 at 3/9/16 9:59 AM:
--------------------------------------------------------------

Firstly, sorry, I just checked this after creating a PR. 
I cannot guarantee though, isn't this usually true that each "part-*" file has non-broken format?

I have been testing to read some "part-*" files and so far they were okay and self-containable and could not see they can be broken ones.

For Parquet and ORC, each file is a complete format and for Text, JSON and CSV, each line has record and they are non-broken. I mean, an record would not exist across "part-*" files.

Also, then, shouldn't we anyway need a consistent naming rule for those extensions? (whether maybe we should remove compression extensions or add some extensions)
Users still can see the file names and I think it is possible to think it is a bit weird that they have inconsistent file extensions.


was (Author: hyukjin.kwon):
Firstly, sorry, I just checked this after creating a PR. 
I cannot guarantee though, isn't this usually true that each "part-*" file has not-broken format?
For example,  an record would not exist across "part-*" files.
Also, then, shouldn't we anyway need a consistent naming rule for those extensions? (whether maybe we should remove compression extensions or add some extensions)
Users still can see the file names and I think it is possible to think it is a bit weird that they have inconsistent file extensions.

> Inconsistent file extensions and omitted file extensions written by CSV, TEXT and JSON data sources
> ---------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-13766
>                 URL: https://issues.apache.org/jira/browse/SPARK-13766
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Hyukjin Kwon
>            Priority: Minor
>
> Currently, the output (part-files) from CSV, TEXT and JSON data sources do not have file extensions such as .csv, .txt and .json (except for compression extensions such as .gz, .deflate and .bz4).
> In addition, it looks Parquet has the extensions (in part-files) such as .gz.parquet or .snappy.parquet according to compression codecs whereas ORC does not have such extensions but it is just .orc.
> So, in a simple view, currently the extensions are set as below:
> {code}
> TEXT, CSV and JSON - [.COMPRESSION_CODEC_NAME]
> Parquet -  [.COMPRESSION_CODEC_NAME].parquet
> ORC - .orc
> {code}
> It would be great if we have a consistent naming for them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org