You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jacek Laskowski (JIRA)" <ji...@apache.org> on 2016/08/17 20:27:20 UTC
[jira] [Updated] (SPARK-17101) Provide consistent format identifiers for TextFileFormat and ParquetFileFormat

     [ https://issues.apache.org/jira/browse/SPARK-17101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jacek Laskowski updated SPARK-17101:
------------------------------------
    Summary: Provide consistent format identifiers for TextFileFormat and ParquetFileFormat  (was: Provide format identifier for TextFileFormat)

> Provide consistent format identifiers for TextFileFormat and ParquetFileFormat
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-17101
>                 URL: https://issues.apache.org/jira/browse/SPARK-17101
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Jacek Laskowski
>            Priority: Trivial
>
> Define the format identifier that is used in {{Optimized Logical Plan}} in {{explain}} for {{text}} file format.
> {code}
> scala> spark.read.text("people.csv").cache.explain(extended = true)
> == Parsed Logical Plan ==
> Relation[value#24] text
> == Analyzed Logical Plan ==
> value: string
> Relation[value#24] text
> == Optimized Logical Plan ==
> InMemoryRelation [value#24], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
>    +- *FileScan text [value#24] Batched: false, Format: org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>
> == Physical Plan ==
> InMemoryTableScan [value#24]
>    +- InMemoryRelation [value#24], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
>          +- *FileScan text [value#24] Batched: false, Format: org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>
> {code}
> When you {{explain}} csv format you can see {{Format: CSV}}.
> {code}
> scala> spark.read.csv("people.csv").cache.explain(extended = true)
> == Parsed Logical Plan ==
> Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv
> == Analyzed Logical Plan ==
> _c0: string, _c1: string, _c2: string, _c3: string
> Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv
> == Optimized Logical Plan ==
> InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
>    +- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false, Format: CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_c0:string,_c1:string,_c2:string,_c3:string>
> == Physical Plan ==
> InMemoryTableScan [_c0#39, _c1#40, _c2#41, _c3#42]
>    +- InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
>          +- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false, Format: CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_c0:string,_c1:string,_c2:string,_c3:string>
> {code}
> The custom format is defined for JSON, too.
> {code}
> scala> spark.read.json("people.csv").cache.explain(extended = true)
> == Parsed Logical Plan ==
> Relation[_corrupt_record#93] json
> == Analyzed Logical Plan ==
> _corrupt_record: string
> Relation[_corrupt_record#93] json
> == Optimized Logical Plan ==
> InMemoryRelation [_corrupt_record#93], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
>    +- *FileScan json [_corrupt_record#93] Batched: false, Format: JSON, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_corrupt_record:string>
> == Physical Plan ==
> InMemoryTableScan [_corrupt_record#93]
>    +- InMemoryRelation [_corrupt_record#93], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
>          +- *FileScan json [_corrupt_record#93] Batched: false, Format: JSON, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_corrupt_record:string>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org