You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Tathagata Das (JIRA)" <ji...@apache.org> on 2016/04/22 02:48:13 UTC

[jira] [Updated] (SPARK-14832) Refactor DataSource to ensure schema is inferred only once when creating a file stream

     [ https://issues.apache.org/jira/browse/SPARK-14832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tathagata Das updated SPARK-14832:
----------------------------------
    Description: 
When creating a file stream using sqlContext.write.stream(), existing files are scanned twice for finding the schema 
- Once, when creating a DataSource + StreamingRelation in the DataFrameReader.stream()
- Again, when creating streaming Source from the DataSource, in DataSource.createSource()

Instead, the schema should be generated only once, at the time of creating the dataframe, and when the streaming source is created, it should just reuse that schema

  was:
When creating a file stream using sqlContext.write.stream(), existing files are scanned twice for finding the schema 
- Once, when creating a DataSource + StreamingRelation in the DataFrameReader.stream()
- Again, when creating streaming Source from the DataSource, in DataSource.createSource()

Instead, the schema should be generated only once, at the time of creating the dataframe, and when the streaming source is created, it should just reuse that schame


> Refactor DataSource to ensure schema is inferred only once when creating a file stream
> --------------------------------------------------------------------------------------
>
>                 Key: SPARK-14832
>                 URL: https://issues.apache.org/jira/browse/SPARK-14832
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL, Streaming
>            Reporter: Tathagata Das
>            Assignee: Tathagata Das
>
> When creating a file stream using sqlContext.write.stream(), existing files are scanned twice for finding the schema 
> - Once, when creating a DataSource + StreamingRelation in the DataFrameReader.stream()
> - Again, when creating streaming Source from the DataSource, in DataSource.createSource()
> Instead, the schema should be generated only once, at the time of creating the dataframe, and when the streaming source is created, it should just reuse that schema



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org