You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Cheng Lian (JIRA)" <ji...@apache.org> on 2016/03/30 16:50:25 UTC

[jira] [Created] (SPARK-14274) Replaces inferSchema with prepareRead to collect necessary global information

Cheng Lian created SPARK-14274:
----------------------------------

             Summary: Replaces inferSchema with prepareRead to collect necessary global information
                 Key: SPARK-14274
                 URL: https://issues.apache.org/jira/browse/SPARK-14274
             Project: Spark
          Issue Type: Sub-task
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Cheng Lian
            Assignee: Cheng Lian


One problem of our newly introduced {{FileFormat.buildReader()}} method is that it only sees pieces of input files. On the other hand, data sources like CSV and LibSVM requires some sort of global information:

- CSV: the content of the header line if {{header}} option is set to true, so that we can filter out header lines within each input file. This is considered as a global information because it's possible that the header appears in the middle of a file after blocks of comments and empty lines, although this is just a rare/contrived corner case.
- LibSVM: when {{numFeature}} is not set, we need to scan the whole dataset to infer the total number of features to construct result {{LabeledPoint}}s.

Unfortunately, with our current API, this kind of global information can't be gathered.

The solution proposed here is to add a {{prepareRead}} method, which accepts the same arguments as {{inferSchema}} but returns a {{ReadContext}}, which contains an {{Option\[StructType\]}} for the inferred schema and a {{Map\[String, Any\]}} for any gathered global information. This {{ReadContext}} is then passed to {{buildReader()}}. By default, {{prepareRead}} simply calls {{inferSchema}} (actually the inferred schema itself can be considered as a sort of global information).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org