You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Everett Anderson <ev...@nuna.com.INVALID> on 2016/06/17 19:38:04 UTC

Best way to go from RDD to DataFrame of StringType columns

Hi,

I have a system with files in a variety of non-standard input formats,
though they're generally flat text files. I'd like to dynamically create
DataFrames of string columns.

What's the best way to go from a RDD<String> to a DataFrame of StringType
columns?

My current plan is

   - Call map() on the RDD<String> with a function to split the String into
   columns and call RowFactory.create() with the resulting array, creating a
   RDD<Row>
   - Construct a StructType schema using column names and StringType
   - Call SQLContext.createDataFrame(RDD, schema) to create the result

Does that make sense?

I looked through the spark-csv package a little and noticed that it's using
baseRelationToDataFrame(), but BaseRelation looks like it might be a
restricted developer API. Anyone know if it's recommended for use?

Thanks!

- Everett

Re: Best way to go from RDD to DataFrame of StringType columns

Posted by Everett Anderson <ev...@nuna.com.INVALID>.

On Fri, Jun 17, 2016 at 1:17 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Ok a bit of a challenge.
>
> Have you tried using databricks stuff?. they can read compressed files and
> they might work here?
>
> val df =
> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
> "true").option("header",
> "true").load("hdfs://rhes564:9000/data/stg/accounts/nw/10124772")
>
> case class Accounts( TransactionDate: String, TransactionType: String,
> Description: String, Value: Double, Balance: Double, AccountName: String,
> AccountNumber : String)
> // Map the columns to names
> //
> val a = df.filter(col("Date") > "").map(p =>
> Accounts(p(0).toString,p(1).toString,p(2).toString,p(3).toString.toDouble,p(4).toString.toDouble,p(5).toString,p(6).toString))
> //
> // Create a Spark temporary table
> //
> a.toDF.registerTempTable("tmp")
>

Yes, I looked at their spark-csv package -- it'd be great for CSV (or even
a large swath of delimited file formats). In some cases, I have file
formats that aren't delimited in a way compatible with that, though, so was
rolling my own string lines => DataFrames.

Also, there are arbitrary record formats, and I don't want to restrict to a
compile-time value class, hence the need to manually create the schema.




>
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 June 2016 at 21:02, Everett Anderson <ev...@nuna.com> wrote:
>
>>
>>
>> On Fri, Jun 17, 2016 at 12:44 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Are these mainly in csv format?
>>>
>>
>> Alas, no -- lots of different formats. Many are fixed width files, where
>> I have outside information to know which byte ranges correspond to which
>> columns. Some have odd null representations or non-comma delimiters (though
>> many of those cases might fit within the configurability of the spark-csv
>> package).
>>
>>
>>
>>
>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 17 June 2016 at 20:38, Everett Anderson <ev...@nuna.com.invalid>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a system with files in a variety of non-standard input formats,
>>>> though they're generally flat text files. I'd like to dynamically create
>>>> DataFrames of string columns.
>>>>
>>>> What's the best way to go from a RDD<String> to a DataFrame of
>>>> StringType columns?
>>>>
>>>> My current plan is
>>>>
>>>>    - Call map() on the RDD<String> with a function to split the String
>>>>    into columns and call RowFactory.create() with the resulting array,
>>>>    creating a RDD<Row>
>>>>    - Construct a StructType schema using column names and StringType
>>>>    - Call SQLContext.createDataFrame(RDD, schema) to create the result
>>>>
>>>> Does that make sense?
>>>>
>>>> I looked through the spark-csv package a little and noticed that it's
>>>> using baseRelationToDataFrame(), but BaseRelation looks like it might be a
>>>> restricted developer API. Anyone know if it's recommended for use?
>>>>
>>>> Thanks!
>>>>
>>>> - Everett
>>>>
>>>>
>>>
>>
>

Re: Best way to go from RDD to DataFrame of StringType columns

Posted by Mich Talebzadeh <mi...@gmail.com>.

Ok a bit of a challenge.

Have you tried using databricks stuff?. they can read compressed files and
they might work here?

val df =
sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
"true").option("header",
"true").load("hdfs://rhes564:9000/data/stg/accounts/nw/10124772")

case class Accounts( TransactionDate: String, TransactionType: String,
Description: String, Value: Double, Balance: Double, AccountName: String,
AccountNumber : String)
// Map the columns to names
//
val a = df.filter(col("Date") > "").map(p =>
Accounts(p(0).toString,p(1).toString,p(2).toString,p(3).toString.toDouble,p(4).toString.toDouble,p(5).toString,p(6).toString))
//
// Create a Spark temporary table
//
a.toDF.registerTempTable("tmp")



HTH


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 17 June 2016 at 21:02, Everett Anderson <ev...@nuna.com> wrote:

>
>
> On Fri, Jun 17, 2016 at 12:44 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> Are these mainly in csv format?
>>
>
> Alas, no -- lots of different formats. Many are fixed width files, where I
> have outside information to know which byte ranges correspond to which
> columns. Some have odd null representations or non-comma delimiters (though
> many of those cases might fit within the configurability of the spark-csv
> package).
>
>
>
>
>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 17 June 2016 at 20:38, Everett Anderson <ev...@nuna.com.invalid>
>> wrote:
>>
>>> Hi,
>>>
>>> I have a system with files in a variety of non-standard input formats,
>>> though they're generally flat text files. I'd like to dynamically create
>>> DataFrames of string columns.
>>>
>>> What's the best way to go from a RDD<String> to a DataFrame of
>>> StringType columns?
>>>
>>> My current plan is
>>>
>>>    - Call map() on the RDD<String> with a function to split the String
>>>    into columns and call RowFactory.create() with the resulting array,
>>>    creating a RDD<Row>
>>>    - Construct a StructType schema using column names and StringType
>>>    - Call SQLContext.createDataFrame(RDD, schema) to create the result
>>>
>>> Does that make sense?
>>>
>>> I looked through the spark-csv package a little and noticed that it's
>>> using baseRelationToDataFrame(), but BaseRelation looks like it might be a
>>> restricted developer API. Anyone know if it's recommended for use?
>>>
>>> Thanks!
>>>
>>> - Everett
>>>
>>>
>>
>

Re: Best way to go from RDD to DataFrame of StringType columns

Posted by Everett Anderson <ev...@nuna.com.INVALID>.

On Fri, Jun 17, 2016 at 12:44 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com
> wrote:

> Are these mainly in csv format?
>

Alas, no -- lots of different formats. Many are fixed width files, where I
have outside information to know which byte ranges correspond to which
columns. Some have odd null representations or non-comma delimiters (though
many of those cases might fit within the configurability of the spark-csv
package).





>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 June 2016 at 20:38, Everett Anderson <ev...@nuna.com.invalid>
> wrote:
>
>> Hi,
>>
>> I have a system with files in a variety of non-standard input formats,
>> though they're generally flat text files. I'd like to dynamically create
>> DataFrames of string columns.
>>
>> What's the best way to go from a RDD<String> to a DataFrame of StringType
>> columns?
>>
>> My current plan is
>>
>>    - Call map() on the RDD<String> with a function to split the String
>>    into columns and call RowFactory.create() with the resulting array,
>>    creating a RDD<Row>
>>    - Construct a StructType schema using column names and StringType
>>    - Call SQLContext.createDataFrame(RDD, schema) to create the result
>>
>> Does that make sense?
>>
>> I looked through the spark-csv package a little and noticed that it's
>> using baseRelationToDataFrame(), but BaseRelation looks like it might be a
>> restricted developer API. Anyone know if it's recommended for use?
>>
>> Thanks!
>>
>> - Everett
>>
>>
>

Re: Best way to go from RDD to DataFrame of StringType columns

Posted by Mich Talebzadeh <mi...@gmail.com>.

Are these mainly in csv format?

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 17 June 2016 at 20:38, Everett Anderson <ev...@nuna.com.invalid> wrote:

> Hi,
>
> I have a system with files in a variety of non-standard input formats,
> though they're generally flat text files. I'd like to dynamically create
> DataFrames of string columns.
>
> What's the best way to go from a RDD<String> to a DataFrame of StringType
> columns?
>
> My current plan is
>
>    - Call map() on the RDD<String> with a function to split the String
>    into columns and call RowFactory.create() with the resulting array,
>    creating a RDD<Row>
>    - Construct a StructType schema using column names and StringType
>    - Call SQLContext.createDataFrame(RDD, schema) to create the result
>
> Does that make sense?
>
> I looked through the spark-csv package a little and noticed that it's
> using baseRelationToDataFrame(), but BaseRelation looks like it might be a
> restricted developer API. Anyone know if it's recommended for use?
>
> Thanks!
>
> - Everett
>
>

Re: Best way to go from RDD to DataFrame of StringType columns

Posted by Jason <Ja...@jasonknight.us>.

We do the exact same approach you proposed for converting horrible text
formats (VCF in the bioinformatics domain) into DataFrames. This involves
creating the schema dynamically based on the header of the file too.

It's simple and easy, but if you need something higher performance you
might need to look into custom DataSet encoders though I'm not sure what
kind of gain (if any) you'd get with that approach.

Jason

On Fri, Jun 17, 2016, 12:38 PM Everett Anderson <ev...@nuna.com.invalid>
wrote:

> Hi,
>
> I have a system with files in a variety of non-standard input formats,
> though they're generally flat text files. I'd like to dynamically create
> DataFrames of string columns.
>
> What's the best way to go from a RDD<String> to a DataFrame of StringType
> columns?
>
> My current plan is
>
>    - Call map() on the RDD<String> with a function to split the String
>    into columns and call RowFactory.create() with the resulting array,
>    creating a RDD<Row>
>    - Construct a StructType schema using column names and StringType
>    - Call SQLContext.createDataFrame(RDD, schema) to create the result
>
> Does that make sense?
>
> I looked through the spark-csv package a little and noticed that it's
> using baseRelationToDataFrame(), but BaseRelation looks like it might be a
> restricted developer API. Anyone know if it's recommended for use?
>
> Thanks!
>
> - Everett
>
>