You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by "Eskilson,Aleksander" <Al...@Cerner.com> on 2015/06/03 16:51:50 UTC

SparkR DataFrame Column Casts esp. from CSV Files

It appears that casting columns remains a bit of a trick in Spark’s DataFrames. This is an issue because tools like spark-csv will set column types to String by default and will not attempt to infer types. Although spark-csv supports specifying  types for columns in its options, it’s not clear how that might be integrated into SparkR (when loading the spark-csv package into the R session).

Looking at the column.R spec we can cast a column to a different data type with the cast function [1], but it’s notable that this is not a mutator, and it returns a column object as opposed to a DataFrame. It appears the column cast can only be ‘applied’ by using the withColumn() or mutate() (an alias for withColumn).

The other way to cast with Spark DataFrames is to write UDFs that operate on a column value and return a coerced value. It looks like SparkR doesn’t have UDFs just yet [2], but it seems like they’d be necessary to do a natural one-off column cast in R, something like

df.col1toInt <- withColumn(df, “intCol1”, udf(df$col1, function(x) as.numeric(x)))

(where col1 was originally ‘character’ type)

Currently it seems one has to
df.col1cast <- cast(df$col1, “int”)
df.col1toInt <- withColumn(df, df.col1cast)

If we wanted just our casted columns and not the original column from the data frame, we’d still have to do a select. There was a conversation about CSV files just yesterday. Types are already problematic, but they’re a very common data source in R, even at scale.

But only being able to coerce one column at a time is really unwieldy. Can the current spark-csv SQL API for specifying types [3] be extended SparkR? And are there any thoughts on implementing some kind of type inferencing perhaps based on a sampling of some number of rows (an implementation I’ve seen before)? R’s read.csv() and read.delim() get types by inferring from the whole file. Getting something that can achieve that functionality via explicit definition of types or sampling will probably be necessary to work with CSV files that have enough columns to merit R at Spark’s scale.

Regards,
Alek Eskilson

[1] - https://github.com/apache/spark/blob/master/R/pkg/R/column.R#L190
[2] - https://issues.apache.org/jira/browse/SPARK-6817
[3] - https://github.com/databricks/spark-csv#sql-api

CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.

Re: SparkR DataFrame Column Casts esp. from CSV Files

Posted by Reynold Xin <rx...@databricks.com>.

I think Hossein does want to implement schema inference for CSV -- then
it'd be easy.

Another way you can do this is to use R dataframe/table to read the CSV
files in, and then convert it into a Spark DataFrames. Not going to be
scalable, but could work.

On Wed, Jun 3, 2015 at 10:49 AM, Eskilson,Aleksander <
Alek.Eskilson@cerner.com> wrote:

>  Hi Shivaram,
>
>  As far as databricks’ spark-csv API shows, it seems there’s currently
> only support for explicit definition of column types. In JSON we have nice
> typed fields, but in CSVs, all bets are off. In the SQL version of the API,
> it appears you specify the column types when you create the table you’re
> populating with CSV data.
>
>  Thanks for the clarification on individual column casting, I was missing
> the more obvious syntax.
>
>  I’ll file a JIRA for resetting the schema after loading a DF.
>
>  Thanks,
> Alek
>
>
>   From: Shivaram Venkataraman <sh...@eecs.berkeley.edu>
> Reply-To: "shivaram@eecs.berkeley.edu" <sh...@eecs.berkeley.edu>
> Date: Wednesday, June 3, 2015 at 12:29 PM
> To: Aleksander Eskilson <Al...@cerner.com>
> Cc: "dev@spark.apache.org" <de...@spark.apache.org>, "hossein@databricks.com"
> <ho...@databricks.com>
> Subject: Re: SparkR DataFrame Column Casts esp. from CSV Files
>
>   cc Hossein who knows more about the spark-csv options
>
>  You are right that the default CSV reader options end up creating all
> columns as string. I know that the JSON reader infers the schema [1] but I
> don't know if the CSV reader has any options to do that.  Regarding the
> SparkR syntax to cast columns, I think there is a simpler way to do it by
> just assigning to the same column name. For example I have a flights
> DataFrame with the `year` column typed as string. To cast it to int I just
> use
>
>  flights$year <- cast(flights$year, "int")
>
>  Now the dataframe has the same number of columns as before and you don't
> need a selection.
>
>  However this still doesn't address the part about casting multiple
> columns -- Could you file a new JIRA to track the need for casting multiple
> columns or rather being able to set the schema after loading a DF ?
>
>  Thanks
> Shivaram
>
>  [1]
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__spark.apache.org_docs_latest_sql-2Dprogramming-2Dguide.html-23json-2Ddatasets&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=aCZhOxAn5Iu762hWogwQK__JsZigsbLZFMaz44UcKQw&s=BX3MuobG748zhfm7hc_SnZA4MnFbwgFreNVEjkzkENc&e=>
>
>
> On Wed, Jun 3, 2015 at 7:51 AM, Eskilson,Aleksander <
> Alek.Eskilson@cerner.com> wrote:
>
>>  It appears that casting columns remains a bit of a trick in Spark’s
>> DataFrames. This is an issue because tools like spark-csv will set column
>> types to String by default and will not attempt to infer types. Although
>> spark-csv supports specifying  types for columns in its options, it’s not
>> clear how that might be integrated into SparkR (when loading the spark-csv
>> package into the R session).
>>
>>  Looking at the column.R spec we can cast a column to a different data
>> type with the cast function [1], but it’s notable that this is not a
>> mutator, and it returns a column object as opposed to a DataFrame. It
>> appears the column cast can only be ‘applied’ by using the withColumn() or
>> mutate() (an alias for withColumn).
>>
>>  The other way to cast with Spark DataFrames is to write UDFs that
>> operate on a column value and return a coerced value. It looks like SparkR
>> doesn’t have UDFs just yet [2], but it seems like they’d be necessary to do
>> a natural one-off column cast in R, something like
>>
>>  df.col1toInt <- withColumn(df, “intCol1”, udf(df$col1, function(x)
>> as.numeric(x)))
>>
>>  (where col1 was originally ‘character’ type)
>>
>>  Currently it seems one has to
>> df.col1cast <- cast(df$col1, “int”)
>> df.col1toInt <- withColumn(df, df.col1cast)
>>
>>  If we wanted just our casted columns and not the original column from
>> the data frame, we’d still have to do a select. There was a conversation
>> about CSV files just yesterday. Types are already problematic, but they’re
>> a very common data source in R, even at scale.
>>
>>  But only being able to coerce one column at a time is really unwieldy.
>> Can the current spark-csv SQL API for specifying types [3] be extended
>> SparkR? And are there any thoughts on implementing some kind of type
>> inferencing perhaps based on a sampling of some number of rows (an
>> implementation I’ve seen before)? R’s read.csv() and read.delim() get types
>> by inferring from the whole file. Getting something that can achieve that
>> functionality via explicit definition of types or sampling will probably be
>> necessary to work with CSV files that have enough columns to merit R at
>> Spark’s scale.
>>
>>  Regards,
>> Alek Eskilson
>>
>>  [1] - https://github.com/apache/spark/blob/master/R/pkg/R/column.R#L190
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_R_pkg_R_column.R-23L190&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=aCZhOxAn5Iu762hWogwQK__JsZigsbLZFMaz44UcKQw&s=pETagpDWepAmeaxucEKv1BgoCjqqpIejSjZhXZFF_y8&e=>
>> [2] - https://issues.apache.org/jira/browse/SPARK-6817
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D6817&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=aCZhOxAn5Iu762hWogwQK__JsZigsbLZFMaz44UcKQw&s=tiLELUAU2Sgk680gUGLr9fR9YxEU6lJEs2e0gWenWhs&e=>
>> [3] - https://github.com/databricks/spark-csv#sql-api
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dcsv-23sql-2Dapi&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=aCZhOxAn5Iu762hWogwQK__JsZigsbLZFMaz44UcKQw&s=89QC5nymwl5GjjpMwUD--828WaTvjqik9glbCHR7T-8&e=>
>>
>> CONFIDENTIALITY NOTICE This message and any included attachments are from
>> Cerner Corporation and are intended only for the addressee. The information
>> contained in this message is confidential and may constitute inside or
>> non-public information under international, federal, or state securities
>> laws. Unauthorized forwarding, printing, copying, distribution, or use of
>> such information is strictly prohibited and may be unlawful. If you are not
>> the addressee, please promptly delete this message and notify the sender of
>> the delivery error by e-mail or you may call Cerner's corporate offices in
>> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>>
>
>

Re: SparkR DataFrame Column Casts esp. from CSV Files

Posted by "Eskilson,Aleksander" <Al...@Cerner.com>.

Hi Shivaram,

As far as databricks’ spark-csv API shows, it seems there’s currently only support for explicit definition of column types. In JSON we have nice typed fields, but in CSVs, all bets are off. In the SQL version of the API, it appears you specify the column types when you create the table you’re populating with CSV data.

Thanks for the clarification on individual column casting, I was missing the more obvious syntax.

I’ll file a JIRA for resetting the schema after loading a DF.

Thanks,
Alek


From: Shivaram Venkataraman <sh...@eecs.berkeley.edu>>
Reply-To: "shivaram@eecs.berkeley.edu<ma...@eecs.berkeley.edu>" <sh...@eecs.berkeley.edu>>
Date: Wednesday, June 3, 2015 at 12:29 PM
To: Aleksander Eskilson <Al...@cerner.com>>
Cc: "dev@spark.apache.org<ma...@spark.apache.org>" <de...@spark.apache.org>>, "hossein@databricks.com<ma...@databricks.com>" <ho...@databricks.com>>
Subject: Re: SparkR DataFrame Column Casts esp. from CSV Files

cc Hossein who knows more about the spark-csv options

You are right that the default CSV reader options end up creating all columns as string. I know that the JSON reader infers the schema [1] but I don't know if the CSV reader has any options to do that.  Regarding the SparkR syntax to cast columns, I think there is a simpler way to do it by just assigning to the same column name. For example I have a flights DataFrame with the `year` column typed as string. To cast it to int I just use

flights$year <- cast(flights$year, "int")

Now the dataframe has the same number of columns as before and you don't need a selection.

However this still doesn't address the part about casting multiple columns -- Could you file a new JIRA to track the need for casting multiple columns or rather being able to set the schema after loading a DF ?

Thanks
Shivaram

[1] http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets<https://urldefense.proofpoint.com/v2/url?u=http-3A__spark.apache.org_docs_latest_sql-2Dprogramming-2Dguide.html-23json-2Ddatasets&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=aCZhOxAn5Iu762hWogwQK__JsZigsbLZFMaz44UcKQw&s=BX3MuobG748zhfm7hc_SnZA4MnFbwgFreNVEjkzkENc&e=>

On Wed, Jun 3, 2015 at 7:51 AM, Eskilson,Aleksander <Al...@cerner.com>> wrote:
It appears that casting columns remains a bit of a trick in Spark’s DataFrames. This is an issue because tools like spark-csv will set column types to String by default and will not attempt to infer types. Although spark-csv supports specifying  types for columns in its options, it’s not clear how that might be integrated into SparkR (when loading the spark-csv package into the R session).

Looking at the column.R spec we can cast a column to a different data type with the cast function [1], but it’s notable that this is not a mutator, and it returns a column object as opposed to a DataFrame. It appears the column cast can only be ‘applied’ by using the withColumn() or mutate() (an alias for withColumn).

The other way to cast with Spark DataFrames is to write UDFs that operate on a column value and return a coerced value. It looks like SparkR doesn’t have UDFs just yet [2], but it seems like they’d be necessary to do a natural one-off column cast in R, something like

df.col1toInt <- withColumn(df, “intCol1”, udf(df$col1, function(x) as.numeric(x)))

(where col1 was originally ‘character’ type)

Currently it seems one has to
df.col1cast <- cast(df$col1, “int”)
df.col1toInt <- withColumn(df, df.col1cast)

If we wanted just our casted columns and not the original column from the data frame, we’d still have to do a select. There was a conversation about CSV files just yesterday. Types are already problematic, but they’re a very common data source in R, even at scale.

But only being able to coerce one column at a time is really unwieldy. Can the current spark-csv SQL API for specifying types [3] be extended SparkR? And are there any thoughts on implementing some kind of type inferencing perhaps based on a sampling of some number of rows (an implementation I’ve seen before)? R’s read.csv() and read.delim() get types by inferring from the whole file. Getting something that can achieve that functionality via explicit definition of types or sampling will probably be necessary to work with CSV files that have enough columns to merit R at Spark’s scale.

Regards,
Alek Eskilson

[1] - https://github.com/apache/spark/blob/master/R/pkg/R/column.R#L190<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_R_pkg_R_column.R-23L190&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=aCZhOxAn5Iu762hWogwQK__JsZigsbLZFMaz44UcKQw&s=pETagpDWepAmeaxucEKv1BgoCjqqpIejSjZhXZFF_y8&e=>
[2] - https://issues.apache.org/jira/browse/SPARK-6817<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D6817&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=aCZhOxAn5Iu762hWogwQK__JsZigsbLZFMaz44UcKQw&s=tiLELUAU2Sgk680gUGLr9fR9YxEU6lJEs2e0gWenWhs&e=>
[3] - https://github.com/databricks/spark-csv#sql-api<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dcsv-23sql-2Dapi&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=aCZhOxAn5Iu762hWogwQK__JsZigsbLZFMaz44UcKQw&s=89QC5nymwl5GjjpMwUD--828WaTvjqik9glbCHR7T-8&e=>

CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024<tel:%28%2B1%29%20%28816%29221-1024>.

Re: SparkR DataFrame Column Casts esp. from CSV Files

Posted by Shivaram Venkataraman <sh...@eecs.berkeley.edu>.

I created https://issues.apache.org/jira/browse/SPARK-8085 for this.

On Wed, Jun 3, 2015 at 12:12 PM, Shivaram Venkataraman <
shivaram@eecs.berkeley.edu> wrote:

> Hmm - the schema=myschema doesn't seem to work in SparkR from my simple
> local test. I'm filing a JIRA for this now
>
> On Wed, Jun 3, 2015 at 11:04 AM, Eskilson,Aleksander <
> Alek.Eskilson@cerner.com> wrote:
>
>>  Neat, thanks for the info Hossein. My use case was just to reset the
>> schema for a CSV dataset, but if either a. I can specify it at load, or b.
>> it will be inferred in the future, I’ll likely not need to cast columns,
>> much less reset the whole schema. I’ll still file a JIRA for the
>> capability, but with lower priority.
>>
>>  —Alek
>>
>>   From: Hossein Falaki <ho...@databricks.com>
>> Date: Wednesday, June 3, 2015 at 12:55 PM
>> To: "shivaram@eecs.berkeley.edu" <sh...@eecs.berkeley.edu>
>> Cc: Aleksander Eskilson <Al...@cerner.com>, "dev@spark.apache.org"
>> <de...@spark.apache.org>
>> Subject: Re: SparkR DataFrame Column Casts esp. from CSV Files
>>
>>   Yes, spark-csv does not infer types yet, but it is planned to be
>> implemented soon.
>>
>>  To work around the current limitations (of spark-csv and SparkR), you
>> can specify the schema in read.df() to get your desired types from
>> spark-csv. For example:
>>
>>  myschema <- structType(structField(“id", "integer"),
>> structField(“name", "string”), structField(“location”, “string”))
>> df <- read.df(sqlContext, "path/to/file.csv", source =
>> “com.databricks.spark.csv”, schema = myschema)
>>
>>  —Hossein
>>
>>  On Jun 3, 2015, at 10:29 AM, Shivaram Venkataraman <
>> shivaram@eecs.berkeley.edu> wrote:
>>
>>  cc Hossein who knows more about the spark-csv options
>>
>>  You are right that the default CSV reader options end up creating all
>> columns as string. I know that the JSON reader infers the schema [1] but I
>> don't know if the CSV reader has any options to do that.  Regarding the
>> SparkR syntax to cast columns, I think there is a simpler way to do it by
>> just assigning to the same column name. For example I have a flights
>> DataFrame with the `year` column typed as string. To cast it to int I just
>> use
>>
>>  flights$year <- cast(flights$year, "int")
>>
>>  Now the dataframe has the same number of columns as before and you
>> don't need a selection.
>>
>>  However this still doesn't address the part about casting multiple
>> columns -- Could you file a new JIRA to track the need for casting multiple
>> columns or rather being able to set the schema after loading a DF ?
>>
>>  Thanks
>> Shivaram
>>
>>  [1]
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__spark.apache.org_docs_latest_sql-2Dprogramming-2Dguide.html-23json-2Ddatasets&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=jttL5G8owvc7e3__uVdYKnu0D5nxr2rZnq2twPUTtyQ&s=HrpRObaR19Nr992p61rCA9h_44qxPkg3u3G9QPEGKcE&e=>
>>
>>
>> On Wed, Jun 3, 2015 at 7:51 AM, Eskilson,Aleksander <
>> Alek.Eskilson@cerner.com> wrote:
>>
>>>  It appears that casting columns remains a bit of a trick in Spark’s
>>> DataFrames. This is an issue because tools like spark-csv will set column
>>> types to String by default and will not attempt to infer types. Although
>>> spark-csv supports specifying  types for columns in its options, it’s not
>>> clear how that might be integrated into SparkR (when loading the spark-csv
>>> package into the R session).
>>>
>>>  Looking at the column.R spec we can cast a column to a different data
>>> type with the cast function [1], but it’s notable that this is not a
>>> mutator, and it returns a column object as opposed to a DataFrame. It
>>> appears the column cast can only be ‘applied’ by using the withColumn() or
>>> mutate() (an alias for withColumn).
>>>
>>>  The other way to cast with Spark DataFrames is to write UDFs that
>>> operate on a column value and return a coerced value. It looks like SparkR
>>> doesn’t have UDFs just yet [2], but it seems like they’d be necessary to do
>>> a natural one-off column cast in R, something like
>>>
>>>  df.col1toInt <- withColumn(df, “intCol1”, udf(df$col1, function(x)
>>> as.numeric(x)))
>>>
>>>  (where col1 was originally ‘character’ type)
>>>
>>>  Currently it seems one has to
>>> df.col1cast <- cast(df$col1, “int”)
>>> df.col1toInt <- withColumn(df, df.col1cast)
>>>
>>>  If we wanted just our casted columns and not the original column from
>>> the data frame, we’d still have to do a select. There was a conversation
>>> about CSV files just yesterday. Types are already problematic, but they’re
>>> a very common data source in R, even at scale.
>>>
>>>  But only being able to coerce one column at a time is really unwieldy.
>>> Can the current spark-csv SQL API for specifying types [3] be extended
>>> SparkR? And are there any thoughts on implementing some kind of type
>>> inferencing perhaps based on a sampling of some number of rows (an
>>> implementation I’ve seen before)? R’s read.csv() and read.delim() get types
>>> by inferring from the whole file. Getting something that can achieve that
>>> functionality via explicit definition of types or sampling will probably be
>>> necessary to work with CSV files that have enough columns to merit R at
>>> Spark’s scale.
>>>
>>>  Regards,
>>> Alek Eskilson
>>>
>>>  [1] - https://github.com/apache/spark/blob/master/R/pkg/R/column.R#L190
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_R_pkg_R_column.R-23L190&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=jttL5G8owvc7e3__uVdYKnu0D5nxr2rZnq2twPUTtyQ&s=a_un2u_P_9iUC5QY4DQf4ayzukWk5ta9cbsGnaND3bA&e=>
>>> [2] - https://issues.apache.org/jira/browse/SPARK-6817
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D6817&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=jttL5G8owvc7e3__uVdYKnu0D5nxr2rZnq2twPUTtyQ&s=dciAX1hsR4ZvwI8BZEgLV49GX7x9Bv5c3TbZZbUnZnA&e=>
>>> [3] - https://github.com/databricks/spark-csv#sql-api
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dcsv-23sql-2Dapi&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=jttL5G8owvc7e3__uVdYKnu0D5nxr2rZnq2twPUTtyQ&s=zrmlIgWJY8jsATWoWYM9fvEVVVW9EDiWeBHTKMQpEMA&e=>
>>>
>>> CONFIDENTIALITY NOTICE This message and any included attachments are
>>> from Cerner Corporation and are intended only for the addressee. The
>>> information contained in this message is confidential and may constitute
>>> inside or non-public information under international, federal, or state
>>> securities laws. Unauthorized forwarding, printing, copying, distribution,
>>> or use of such information is strictly prohibited and may be unlawful. If
>>> you are not the addressee, please promptly delete this message and notify
>>> the sender of the delivery error by e-mail or you may call Cerner's
>>> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>>>
>>
>>
>>
>

Re: SparkR DataFrame Column Casts esp. from CSV Files

Posted by Shivaram Venkataraman <sh...@eecs.berkeley.edu>.

Hmm - the schema=myschema doesn't seem to work in SparkR from my simple
local test. I'm filing a JIRA for this now

On Wed, Jun 3, 2015 at 11:04 AM, Eskilson,Aleksander <
Alek.Eskilson@cerner.com> wrote:

>  Neat, thanks for the info Hossein. My use case was just to reset the
> schema for a CSV dataset, but if either a. I can specify it at load, or b.
> it will be inferred in the future, I’ll likely not need to cast columns,
> much less reset the whole schema. I’ll still file a JIRA for the
> capability, but with lower priority.
>
>  —Alek
>
>   From: Hossein Falaki <ho...@databricks.com>
> Date: Wednesday, June 3, 2015 at 12:55 PM
> To: "shivaram@eecs.berkeley.edu" <sh...@eecs.berkeley.edu>
> Cc: Aleksander Eskilson <Al...@cerner.com>, "dev@spark.apache.org"
> <de...@spark.apache.org>
> Subject: Re: SparkR DataFrame Column Casts esp. from CSV Files
>
>   Yes, spark-csv does not infer types yet, but it is planned to be
> implemented soon.
>
>  To work around the current limitations (of spark-csv and SparkR), you
> can specify the schema in read.df() to get your desired types from
> spark-csv. For example:
>
>  myschema <- structType(structField(“id", "integer"), structField(“name",
> "string”), structField(“location”, “string”))
> df <- read.df(sqlContext, "path/to/file.csv", source =
> “com.databricks.spark.csv”, schema = myschema)
>
>  —Hossein
>
>  On Jun 3, 2015, at 10:29 AM, Shivaram Venkataraman <
> shivaram@eecs.berkeley.edu> wrote:
>
>  cc Hossein who knows more about the spark-csv options
>
>  You are right that the default CSV reader options end up creating all
> columns as string. I know that the JSON reader infers the schema [1] but I
> don't know if the CSV reader has any options to do that.  Regarding the
> SparkR syntax to cast columns, I think there is a simpler way to do it by
> just assigning to the same column name. For example I have a flights
> DataFrame with the `year` column typed as string. To cast it to int I just
> use
>
>  flights$year <- cast(flights$year, "int")
>
>  Now the dataframe has the same number of columns as before and you don't
> need a selection.
>
>  However this still doesn't address the part about casting multiple
> columns -- Could you file a new JIRA to track the need for casting multiple
> columns or rather being able to set the schema after loading a DF ?
>
>  Thanks
> Shivaram
>
>  [1]
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__spark.apache.org_docs_latest_sql-2Dprogramming-2Dguide.html-23json-2Ddatasets&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=jttL5G8owvc7e3__uVdYKnu0D5nxr2rZnq2twPUTtyQ&s=HrpRObaR19Nr992p61rCA9h_44qxPkg3u3G9QPEGKcE&e=>
>
>
> On Wed, Jun 3, 2015 at 7:51 AM, Eskilson,Aleksander <
> Alek.Eskilson@cerner.com> wrote:
>
>>  It appears that casting columns remains a bit of a trick in Spark’s
>> DataFrames. This is an issue because tools like spark-csv will set column
>> types to String by default and will not attempt to infer types. Although
>> spark-csv supports specifying  types for columns in its options, it’s not
>> clear how that might be integrated into SparkR (when loading the spark-csv
>> package into the R session).
>>
>>  Looking at the column.R spec we can cast a column to a different data
>> type with the cast function [1], but it’s notable that this is not a
>> mutator, and it returns a column object as opposed to a DataFrame. It
>> appears the column cast can only be ‘applied’ by using the withColumn() or
>> mutate() (an alias for withColumn).
>>
>>  The other way to cast with Spark DataFrames is to write UDFs that
>> operate on a column value and return a coerced value. It looks like SparkR
>> doesn’t have UDFs just yet [2], but it seems like they’d be necessary to do
>> a natural one-off column cast in R, something like
>>
>>  df.col1toInt <- withColumn(df, “intCol1”, udf(df$col1, function(x)
>> as.numeric(x)))
>>
>>  (where col1 was originally ‘character’ type)
>>
>>  Currently it seems one has to
>> df.col1cast <- cast(df$col1, “int”)
>> df.col1toInt <- withColumn(df, df.col1cast)
>>
>>  If we wanted just our casted columns and not the original column from
>> the data frame, we’d still have to do a select. There was a conversation
>> about CSV files just yesterday. Types are already problematic, but they’re
>> a very common data source in R, even at scale.
>>
>>  But only being able to coerce one column at a time is really unwieldy.
>> Can the current spark-csv SQL API for specifying types [3] be extended
>> SparkR? And are there any thoughts on implementing some kind of type
>> inferencing perhaps based on a sampling of some number of rows (an
>> implementation I’ve seen before)? R’s read.csv() and read.delim() get types
>> by inferring from the whole file. Getting something that can achieve that
>> functionality via explicit definition of types or sampling will probably be
>> necessary to work with CSV files that have enough columns to merit R at
>> Spark’s scale.
>>
>>  Regards,
>> Alek Eskilson
>>
>>  [1] - https://github.com/apache/spark/blob/master/R/pkg/R/column.R#L190
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_R_pkg_R_column.R-23L190&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=jttL5G8owvc7e3__uVdYKnu0D5nxr2rZnq2twPUTtyQ&s=a_un2u_P_9iUC5QY4DQf4ayzukWk5ta9cbsGnaND3bA&e=>
>> [2] - https://issues.apache.org/jira/browse/SPARK-6817
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D6817&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=jttL5G8owvc7e3__uVdYKnu0D5nxr2rZnq2twPUTtyQ&s=dciAX1hsR4ZvwI8BZEgLV49GX7x9Bv5c3TbZZbUnZnA&e=>
>> [3] - https://github.com/databricks/spark-csv#sql-api
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dcsv-23sql-2Dapi&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=jttL5G8owvc7e3__uVdYKnu0D5nxr2rZnq2twPUTtyQ&s=zrmlIgWJY8jsATWoWYM9fvEVVVW9EDiWeBHTKMQpEMA&e=>
>>
>> CONFIDENTIALITY NOTICE This message and any included attachments are from
>> Cerner Corporation and are intended only for the addressee. The information
>> contained in this message is confidential and may constitute inside or
>> non-public information under international, federal, or state securities
>> laws. Unauthorized forwarding, printing, copying, distribution, or use of
>> such information is strictly prohibited and may be unlawful. If you are not
>> the addressee, please promptly delete this message and notify the sender of
>> the delivery error by e-mail or you may call Cerner's corporate offices in
>> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>>
>
>
>

Re: SparkR DataFrame Column Casts esp. from CSV Files

Posted by "Eskilson,Aleksander" <Al...@Cerner.com>.

Neat, thanks for the info Hossein. My use case was just to reset the schema for a CSV dataset, but if either a. I can specify it at load, or b. it will be inferred in the future, I’ll likely not need to cast columns, much less reset the whole schema. I’ll still file a JIRA for the capability, but with lower priority.

—Alek

From: Hossein Falaki <ho...@databricks.com>>
Date: Wednesday, June 3, 2015 at 12:55 PM
To: "shivaram@eecs.berkeley.edu<ma...@eecs.berkeley.edu>" <sh...@eecs.berkeley.edu>>
Cc: Aleksander Eskilson <Al...@cerner.com>>, "dev@spark.apache.org<ma...@spark.apache.org>" <de...@spark.apache.org>>
Subject: Re: SparkR DataFrame Column Casts esp. from CSV Files

Yes, spark-csv does not infer types yet, but it is planned to be implemented soon.

To work around the current limitations (of spark-csv and SparkR), you can specify the schema in read.df() to get your desired types from spark-csv. For example:

myschema <- structType(structField(“id", "integer"), structField(“name", "string”), structField(“location”, “string”))
df <- read.df(sqlContext, "path/to/file.csv", source = “com.databricks.spark.csv”, schema = myschema)

—Hossein

On Jun 3, 2015, at 10:29 AM, Shivaram Venkataraman <sh...@eecs.berkeley.edu>> wrote:

cc Hossein who knows more about the spark-csv options

You are right that the default CSV reader options end up creating all columns as string. I know that the JSON reader infers the schema [1] but I don't know if the CSV reader has any options to do that.  Regarding the SparkR syntax to cast columns, I think there is a simpler way to do it by just assigning to the same column name. For example I have a flights DataFrame with the `year` column typed as string. To cast it to int I just use

flights$year <- cast(flights$year, "int")

Now the dataframe has the same number of columns as before and you don't need a selection.

However this still doesn't address the part about casting multiple columns -- Could you file a new JIRA to track the need for casting multiple columns or rather being able to set the schema after loading a DF ?

Thanks
Shivaram

[1] http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets<https://urldefense.proofpoint.com/v2/url?u=http-3A__spark.apache.org_docs_latest_sql-2Dprogramming-2Dguide.html-23json-2Ddatasets&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=jttL5G8owvc7e3__uVdYKnu0D5nxr2rZnq2twPUTtyQ&s=HrpRObaR19Nr992p61rCA9h_44qxPkg3u3G9QPEGKcE&e=>

On Wed, Jun 3, 2015 at 7:51 AM, Eskilson,Aleksander <Al...@cerner.com>> wrote:
It appears that casting columns remains a bit of a trick in Spark’s DataFrames. This is an issue because tools like spark-csv will set column types to String by default and will not attempt to infer types. Although spark-csv supports specifying  types for columns in its options, it’s not clear how that might be integrated into SparkR (when loading the spark-csv package into the R session).

Looking at the column.R spec we can cast a column to a different data type with the cast function [1], but it’s notable that this is not a mutator, and it returns a column object as opposed to a DataFrame. It appears the column cast can only be ‘applied’ by using the withColumn() or mutate() (an alias for withColumn).

The other way to cast with Spark DataFrames is to write UDFs that operate on a column value and return a coerced value. It looks like SparkR doesn’t have UDFs just yet [2], but it seems like they’d be necessary to do a natural one-off column cast in R, something like

df.col1toInt <- withColumn(df, “intCol1”, udf(df$col1, function(x) as.numeric(x)))

(where col1 was originally ‘character’ type)

Currently it seems one has to
df.col1cast <- cast(df$col1, “int”)
df.col1toInt <- withColumn(df, df.col1cast)

If we wanted just our casted columns and not the original column from the data frame, we’d still have to do a select. There was a conversation about CSV files just yesterday. Types are already problematic, but they’re a very common data source in R, even at scale.

But only being able to coerce one column at a time is really unwieldy. Can the current spark-csv SQL API for specifying types [3] be extended SparkR? And are there any thoughts on implementing some kind of type inferencing perhaps based on a sampling of some number of rows (an implementation I’ve seen before)? R’s read.csv() and read.delim() get types by inferring from the whole file. Getting something that can achieve that functionality via explicit definition of types or sampling will probably be necessary to work with CSV files that have enough columns to merit R at Spark’s scale.

Regards,
Alek Eskilson

[1] - https://github.com/apache/spark/blob/master/R/pkg/R/column.R#L190<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_R_pkg_R_column.R-23L190&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=jttL5G8owvc7e3__uVdYKnu0D5nxr2rZnq2twPUTtyQ&s=a_un2u_P_9iUC5QY4DQf4ayzukWk5ta9cbsGnaND3bA&e=>
[2] - https://issues.apache.org/jira/browse/SPARK-6817<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D6817&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=jttL5G8owvc7e3__uVdYKnu0D5nxr2rZnq2twPUTtyQ&s=dciAX1hsR4ZvwI8BZEgLV49GX7x9Bv5c3TbZZbUnZnA&e=>
[3] - https://github.com/databricks/spark-csv#sql-api<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dcsv-23sql-2Dapi&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=jttL5G8owvc7e3__uVdYKnu0D5nxr2rZnq2twPUTtyQ&s=zrmlIgWJY8jsATWoWYM9fvEVVVW9EDiWeBHTKMQpEMA&e=>

CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024<tel:%28%2B1%29%20%28816%29221-1024>.

Re: SparkR DataFrame Column Casts esp. from CSV Files

Posted by Hossein Falaki <ho...@databricks.com>.

Yes, spark-csv does not infer types yet, but it is planned to be implemented soon.

To work around the current limitations (of spark-csv and SparkR), you can specify the schema in read.df() to get your desired types from spark-csv. For example:

myschema <- structType(structField(“id", "integer"), structField(“name", "string”), structField(“location”, “string”))
df <- read.df(sqlContext, "path/to/file.csv", source = “com.databricks.spark.csv”, schema = myschema)

—Hossein

> On Jun 3, 2015, at 10:29 AM, Shivaram Venkataraman <sh...@eecs.berkeley.edu> wrote:
> 
> cc Hossein who knows more about the spark-csv options
> 
> You are right that the default CSV reader options end up creating all columns as string. I know that the JSON reader infers the schema [1] but I don't know if the CSV reader has any options to do that.  Regarding the SparkR syntax to cast columns, I think there is a simpler way to do it by just assigning to the same column name. For example I have a flights DataFrame with the `year` column typed as string. To cast it to int I just use
> 
> flights$year <- cast(flights$year, "int")
> 
> Now the dataframe has the same number of columns as before and you don't need a selection.
> 
> However this still doesn't address the part about casting multiple columns -- Could you file a new JIRA to track the need for casting multiple columns or rather being able to set the schema after loading a DF ?
> 
> Thanks
> Shivaram
> 
> [1] http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets <http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets> 
> 
> On Wed, Jun 3, 2015 at 7:51 AM, Eskilson,Aleksander <Alek.Eskilson@cerner.com <ma...@cerner.com>> wrote:
> It appears that casting columns remains a bit of a trick in Spark’s DataFrames. This is an issue because tools like spark-csv will set column types to String by default and will not attempt to infer types. Although spark-csv supports specifying  types for columns in its options, it’s not clear how that might be integrated into SparkR (when loading the spark-csv package into the R session). 
> 
> Looking at the column.R spec we can cast a column to a different data type with the cast function [1], but it’s notable that this is not a mutator, and it returns a column object as opposed to a DataFrame. It appears the column cast can only be ‘applied’ by using the withColumn() or mutate() (an alias for withColumn). 
> 
> The other way to cast with Spark DataFrames is to write UDFs that operate on a column value and return a coerced value. It looks like SparkR doesn’t have UDFs just yet [2], but it seems like they’d be necessary to do a natural one-off column cast in R, something like
> 
> df.col1toInt <- withColumn(df, “intCol1”, udf(df$col1, function(x) as.numeric(x)))
> 
> (where col1 was originally ‘character’ type)
> 
> Currently it seems one has to
> df.col1cast <- cast(df$col1, “int”)
> df.col1toInt <- withColumn(df, df.col1cast)
> 
> If we wanted just our casted columns and not the original column from the data frame, we’d still have to do a select. There was a conversation about CSV files just yesterday. Types are already problematic, but they’re a very common data source in R, even at scale. 
> 
> But only being able to coerce one column at a time is really unwieldy. Can the current spark-csv SQL API for specifying types [3] be extended SparkR? And are there any thoughts on implementing some kind of type inferencing perhaps based on a sampling of some number of rows (an implementation I’ve seen before)? R’s read.csv() and read.delim() get types by inferring from the whole file. Getting something that can achieve that functionality via explicit definition of types or sampling will probably be necessary to work with CSV files that have enough columns to merit R at Spark’s scale.
> 
> Regards,
> Alek Eskilson
> 
> [1] - https://github.com/apache/spark/blob/master/R/pkg/R/column.R#L190 <https://github.com/apache/spark/blob/master/R/pkg/R/column.R#L190>
> [2] - https://issues.apache.org/jira/browse/SPARK-6817 <https://issues.apache.org/jira/browse/SPARK-6817>
> [3] - https://github.com/databricks/spark-csv#sql-api <https://github.com/databricks/spark-csv#sql-api>
> CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024 <tel:%28%2B1%29%20%28816%29221-1024>.
>

Re: SparkR DataFrame Column Casts esp. from CSV Files

Posted by Shivaram Venkataraman <sh...@eecs.berkeley.edu>.

cc Hossein who knows more about the spark-csv options

You are right that the default CSV reader options end up creating all
columns as string. I know that the JSON reader infers the schema [1] but I
don't know if the CSV reader has any options to do that.  Regarding the
SparkR syntax to cast columns, I think there is a simpler way to do it by
just assigning to the same column name. For example I have a flights
DataFrame with the `year` column typed as string. To cast it to int I just
use

flights$year <- cast(flights$year, "int")

Now the dataframe has the same number of columns as before and you don't
need a selection.

However this still doesn't address the part about casting multiple columns
-- Could you file a new JIRA to track the need for casting multiple columns
or rather being able to set the schema after loading a DF ?

Thanks
Shivaram

[1]
http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets


On Wed, Jun 3, 2015 at 7:51 AM, Eskilson,Aleksander <
Alek.Eskilson@cerner.com> wrote:

>  It appears that casting columns remains a bit of a trick in Spark’s
> DataFrames. This is an issue because tools like spark-csv will set column
> types to String by default and will not attempt to infer types. Although
> spark-csv supports specifying  types for columns in its options, it’s not
> clear how that might be integrated into SparkR (when loading the spark-csv
> package into the R session).
>
>  Looking at the column.R spec we can cast a column to a different data
> type with the cast function [1], but it’s notable that this is not a
> mutator, and it returns a column object as opposed to a DataFrame. It
> appears the column cast can only be ‘applied’ by using the withColumn() or
> mutate() (an alias for withColumn).
>
>  The other way to cast with Spark DataFrames is to write UDFs that
> operate on a column value and return a coerced value. It looks like SparkR
> doesn’t have UDFs just yet [2], but it seems like they’d be necessary to do
> a natural one-off column cast in R, something like
>
>  df.col1toInt <- withColumn(df, “intCol1”, udf(df$col1, function(x)
> as.numeric(x)))
>
>  (where col1 was originally ‘character’ type)
>
>  Currently it seems one has to
> df.col1cast <- cast(df$col1, “int”)
> df.col1toInt <- withColumn(df, df.col1cast)
>
>  If we wanted just our casted columns and not the original column from
> the data frame, we’d still have to do a select. There was a conversation
> about CSV files just yesterday. Types are already problematic, but they’re
> a very common data source in R, even at scale.
>
>  But only being able to coerce one column at a time is really unwieldy.
> Can the current spark-csv SQL API for specifying types [3] be extended
> SparkR? And are there any thoughts on implementing some kind of type
> inferencing perhaps based on a sampling of some number of rows (an
> implementation I’ve seen before)? R’s read.csv() and read.delim() get types
> by inferring from the whole file. Getting something that can achieve that
> functionality via explicit definition of types or sampling will probably be
> necessary to work with CSV files that have enough columns to merit R at
> Spark’s scale.
>
>  Regards,
> Alek Eskilson
>
>  [1] - https://github.com/apache/spark/blob/master/R/pkg/R/column.R#L190
> [2] - https://issues.apache.org/jira/browse/SPARK-6817
> [3] - https://github.com/databricks/spark-csv#sql-api
>
>  CONFIDENTIALITY NOTICE This message and any included attachments are from
> Cerner Corporation and are intended only for the addressee. The information
> contained in this message is confidential and may constitute inside or
> non-public information under international, federal, or state securities
> laws. Unauthorized forwarding, printing, copying, distribution, or use of
> such information is strictly prohibited and may be unlawful. If you are not
> the addressee, please promptly delete this message and notify the sender of
> the delivery error by e-mail or you may call Cerner's corporate offices in
> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>