You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by "Eskilson,Aleksander" <Al...@Cerner.com> on 2015/06/02 20:52:47 UTC

CSV Support in SparkR

Are there any intentions to provide first class support for CSV files as one of the loadable file types in SparkR? Data brick’s spark-csv API [1] has support for SQL, Python, and Java/Scala, and implements most of the arguments of R’s read.table API [2], but currently there is no way to load CSV data in SparkR (1.4.0) besides separating our headers from the data, loading into an RDD, splitting by our delimiter, and then converting to a SparkR Data Frame with a vector of the columns gathered from the header.

Regards,
Alek Eskilson

[1] -- https://github.com/databricks/spark-csv
[2] -- http://www.inside-r.org/r-doc/utils/read.table

CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.

Re: CSV Support in SparkR

Posted by Shivaram Venkataraman <sh...@eecs.berkeley.edu>.

Thanks for testing. We should probably include a section for this in the
SparkR programming guide given how popular CSV files are in R. Feel free to
open a PR for that if you get a chance.

Shivaram

On Tue, Jun 2, 2015 at 2:20 PM, Eskilson,Aleksander <
Alek.Eskilson@cerner.com> wrote:

>  Seems to work great in the master build. It’s really good to have this
> functionality.
>
>  Regards,
> Alek Eskilson
>
>   From: <Eskilson>, Aleksander Eskilson <Al...@cerner.com>
> Date: Tuesday, June 2, 2015 at 2:59 PM
> To: "shivaram@eecs.berkeley.edu" <sh...@eecs.berkeley.edu>
> Cc: Burak Yavuz <br...@gmail.com>, "dev@spark.apache.org" <
> dev@spark.apache.org>
>
> Subject: Re: CSV Support in SparkR
>
>   Ah, alright, cool. I’ll rebuild and let you know.
>
>  Thanks again,
> Alek
>
>   From: Shivaram Venkataraman <sh...@eecs.berkeley.edu>
> Reply-To: "shivaram@eecs.berkeley.edu" <sh...@eecs.berkeley.edu>
> Date: Tuesday, June 2, 2015 at 2:57 PM
> To: Aleksander Eskilson <Al...@cerner.com>
> Cc: "shivaram@eecs.berkeley.edu" <sh...@eecs.berkeley.edu>, Burak
> Yavuz <br...@gmail.com>, "dev@spark.apache.org" <de...@spark.apache.org>
> Subject: Re: CSV Support in SparkR
>
>   There was a bug in the SparkContext creation that I fixed yesterday.
> https://github.com/apache/spark/commit/6b44278ef7cd2a278dfa67e8393ef30775c72726
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_commit_6b44278ef7cd2a278dfa67e8393ef30775c72726&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=kO95UBEkBrQwNCQwa2x0MOiUxhLQvBQ1B2q5EDG_bt4&s=UjoHyjJhx1vf6fqNiq3P-MqcvN2FnssT16FJ8o98pF4&e=>
>
>
>  If you build from master it should be fixed. Also I think we might have
> a rc4 which should have this
>
>  Thanks
> Shivaram
>
> On Tue, Jun 2, 2015 at 12:56 PM, Eskilson,Aleksander <
> Alek.Eskilson@cerner.com> wrote:
>
>>  Hey, that’s pretty convenient. Unfortunately, although the package
>> seems to pull fine into the session, I’m getting class not found exceptions
>> with:
>>
>>  Caused by: org.apache.spark.SparkExcetion: Job aborted due to stage
>> failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task
>> 0.3 in stage 6.0: java.lang.ClassNotFoundException:
>> com.databricks.spark.csv.CsvRelation$anonfun$buildScan$1
>>
>>  Which smells like a path issue to me, and I made sure the ivy repo was
>> part of my PATH, but functions like showDF() still fail with that error.
>> Did I miss a setting, or should the package inclusion in the sparkR
>> execution load that in?
>>
>>  I’ve run
>> df <- read.df(sqlCtx, “./data.csv”, “com.databricks.spark.csv”,
>> header=“true”, delimiter=“|”)
>> showDF(df, 10)
>>
>>  (my data is pipeline delimited, and the default SQL context is sqlCtx)
>>
>>  Thanks,
>> Alek
>>
>>   From: Shivaram Venkataraman <sh...@eecs.berkeley.edu>
>> Reply-To: "shivaram@eecs.berkeley.edu" <sh...@eecs.berkeley.edu>
>> Date: Tuesday, June 2, 2015 at 2:08 PM
>> To: Burak Yavuz <br...@gmail.com>
>> Cc: Aleksander Eskilson <Al...@cerner.com>, "dev@spark.apache.org"
>> <de...@spark.apache.org>, Shivaram Venkataraman <shivaram@eecs.berkeley.edu
>> >
>> Subject: Re: CSV Support in SparkR
>>
>>   Hi Alek
>>
>>  As Burak said, you can already use the spark-csv with SparkR in the 1.4
>> release. So right now I use it with something like this
>>
>>  # Launch SparkR
>> ./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3
>>  df <- read.df(sqlContext, "./nycflights13.csv",
>> "com.databricks.spark.csv", header="true")
>>
>>  You can also pass in other options to the spark csv as arguments to
>> `read.df`. Let us know if this works
>>
>>  Thanks
>> Shivaram
>>
>>
>> On Tue, Jun 2, 2015 at 12:03 PM, Burak Yavuz <br...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>>  cc'ing Shivaram here, because he worked on this yesterday.
>>>
>>>  If I'm not mistaken, you can use the following workflow:
>>>  ```./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3```
>>>
>>>  and then
>>>
>>>  ```df <- read.df(sqlContext, "/data", "csv", header = "true")```
>>>
>>>  Best,
>>> Burak
>>>
>>> On Tue, Jun 2, 2015 at 11:52 AM, Eskilson,Aleksander <
>>> Alek.Eskilson@cerner.com> wrote:
>>>
>>>>  Are there any intentions to provide first class support for CSV files
>>>> as one of the loadable file types in SparkR? Data brick’s spark-csv API [1]
>>>> has support for SQL, Python, and Java/Scala, and implements most of the
>>>> arguments of R’s read.table API [2], but currently there is no way to load
>>>> CSV data in SparkR (1.4.0) besides separating our headers from the data,
>>>> loading into an RDD, splitting by our delimiter, and then converting to a
>>>> SparkR Data Frame with a vector of the columns gathered from the header.
>>>>
>>>>  Regards,
>>>> Alek Eskilson
>>>>
>>>>  [1] -- https://github.com/databricks/spark-csv
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dcsv&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=mPtlFYdyx5Rp7pZr-bQ15QMIrq4qE26ECfJCzoMwYhI&s=wT5PU54lVmR2R_o3GidPhDQD9kMMNVYotZEqCd4ASm4&e=>
>>>> [2] -- http://www.inside-r.org/r-doc/utils/read.table
>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.inside-2Dr.org_r-2Ddoc_utils_read.table&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=mPtlFYdyx5Rp7pZr-bQ15QMIrq4qE26ECfJCzoMwYhI&s=h87nnmV5D3soOFo5wasj1J34zbhvukHd1WcSitsjB6s&e=>
>>>> CONFIDENTIALITY NOTICE This message and any included attachments are
>>>> from Cerner Corporation and are intended only for the addressee. The
>>>> information contained in this message is confidential and may constitute
>>>> inside or non-public information under international, federal, or state
>>>> securities laws. Unauthorized forwarding, printing, copying, distribution,
>>>> or use of such information is strictly prohibited and may be unlawful. If
>>>> you are not the addressee, please promptly delete this message and notify
>>>> the sender of the delivery error by e-mail or you may call Cerner's
>>>> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024
>>>> .
>>>>
>>>
>>>
>>
>

Re: CSV Support in SparkR

Posted by "Eskilson,Aleksander" <Al...@Cerner.com>.

Seems to work great in the master build. It’s really good to have this functionality.

Regards,
Alek Eskilson

From: <Eskilson>, Aleksander Eskilson <Al...@cerner.com>>
Date: Tuesday, June 2, 2015 at 2:59 PM
To: "shivaram@eecs.berkeley.edu<ma...@eecs.berkeley.edu>" <sh...@eecs.berkeley.edu>>
Cc: Burak Yavuz <br...@gmail.com>>, "dev@spark.apache.org<ma...@spark.apache.org>" <de...@spark.apache.org>>
Subject: Re: CSV Support in SparkR

Ah, alright, cool. I’ll rebuild and let you know.

Thanks again,
Alek

From: Shivaram Venkataraman <sh...@eecs.berkeley.edu>>
Reply-To: "shivaram@eecs.berkeley.edu<ma...@eecs.berkeley.edu>" <sh...@eecs.berkeley.edu>>
Date: Tuesday, June 2, 2015 at 2:57 PM
To: Aleksander Eskilson <Al...@cerner.com>>
Cc: "shivaram@eecs.berkeley.edu<ma...@eecs.berkeley.edu>" <sh...@eecs.berkeley.edu>>, Burak Yavuz <br...@gmail.com>>, "dev@spark.apache.org<ma...@spark.apache.org>" <de...@spark.apache.org>>
Subject: Re: CSV Support in SparkR

There was a bug in the SparkContext creation that I fixed yesterday. https://github.com/apache/spark/commit/6b44278ef7cd2a278dfa67e8393ef30775c72726<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_commit_6b44278ef7cd2a278dfa67e8393ef30775c72726&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=kO95UBEkBrQwNCQwa2x0MOiUxhLQvBQ1B2q5EDG_bt4&s=UjoHyjJhx1vf6fqNiq3P-MqcvN2FnssT16FJ8o98pF4&e=>

If you build from master it should be fixed. Also I think we might have a rc4 which should have this

Thanks
Shivaram

On Tue, Jun 2, 2015 at 12:56 PM, Eskilson,Aleksander <Al...@cerner.com>> wrote:
Hey, that’s pretty convenient. Unfortunately, although the package seems to pull fine into the session, I’m getting class not found exceptions with:

Caused by: org.apache.spark.SparkExcetion: Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0: java.lang.ClassNotFoundException: com.databricks.spark.csv.CsvRelation$anonfun$buildScan$1

Which smells like a path issue to me, and I made sure the ivy repo was part of my PATH, but functions like showDF() still fail with that error. Did I miss a setting, or should the package inclusion in the sparkR execution load that in?

I’ve run
df <- read.df(sqlCtx, “./data.csv”, “com.databricks.spark.csv”, header=“true”, delimiter=“|”)
showDF(df, 10)

(my data is pipeline delimited, and the default SQL context is sqlCtx)

Thanks,
Alek

From: Shivaram Venkataraman <sh...@eecs.berkeley.edu>>
Reply-To: "shivaram@eecs.berkeley.edu<ma...@eecs.berkeley.edu>" <sh...@eecs.berkeley.edu>>
Date: Tuesday, June 2, 2015 at 2:08 PM
To: Burak Yavuz <br...@gmail.com>>
Cc: Aleksander Eskilson <Al...@cerner.com>>, "dev@spark.apache.org<ma...@spark.apache.org>" <de...@spark.apache.org>>, Shivaram Venkataraman <sh...@eecs.berkeley.edu>>
Subject: Re: CSV Support in SparkR

Hi Alek

As Burak said, you can already use the spark-csv with SparkR in the 1.4 release. So right now I use it with something like this

# Launch SparkR
./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3
df <- read.df(sqlContext, "./nycflights13.csv", "com.databricks.spark.csv", header="true")

You can also pass in other options to the spark csv as arguments to `read.df`. Let us know if this works

Thanks
Shivaram

On Tue, Jun 2, 2015 at 12:03 PM, Burak Yavuz <br...@gmail.com>> wrote:
Hi,

cc'ing Shivaram here, because he worked on this yesterday.

If I'm not mistaken, you can use the following workflow:
```./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3```

and then

```df <- read.df(sqlContext, "/data", "csv", header = "true")```

Best,
Burak

On Tue, Jun 2, 2015 at 11:52 AM, Eskilson,Aleksander <Al...@cerner.com>> wrote:
Are there any intentions to provide first class support for CSV files as one of the loadable file types in SparkR? Data brick’s spark-csv API [1] has support for SQL, Python, and Java/Scala, and implements most of the arguments of R’s read.table API [2], but currently there is no way to load CSV data in SparkR (1.4.0) besides separating our headers from the data, loading into an RDD, splitting by our delimiter, and then converting to a SparkR Data Frame with a vector of the columns gathered from the header.

Regards,
Alek Eskilson

[1] -- https://github.com/databricks/spark-csv<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dcsv&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=mPtlFYdyx5Rp7pZr-bQ15QMIrq4qE26ECfJCzoMwYhI&s=wT5PU54lVmR2R_o3GidPhDQD9kMMNVYotZEqCd4ASm4&e=>
[2] -- http://www.inside-r.org/r-doc/utils/read.table<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.inside-2Dr.org_r-2Ddoc_utils_read.table&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=mPtlFYdyx5Rp7pZr-bQ15QMIrq4qE26ECfJCzoMwYhI&s=h87nnmV5D3soOFo5wasj1J34zbhvukHd1WcSitsjB6s&e=>
CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024<tel:%28%2B1%29%20%28816%29221-1024>.

Re: CSV Support in SparkR

Posted by "Eskilson,Aleksander" <Al...@Cerner.com>.

Ah, alright, cool. I’ll rebuild and let you know.

Thanks again,
Alek

From: Shivaram Venkataraman <sh...@eecs.berkeley.edu>>
Reply-To: "shivaram@eecs.berkeley.edu<ma...@eecs.berkeley.edu>" <sh...@eecs.berkeley.edu>>
Date: Tuesday, June 2, 2015 at 2:57 PM
To: Aleksander Eskilson <Al...@cerner.com>>
Cc: "shivaram@eecs.berkeley.edu<ma...@eecs.berkeley.edu>" <sh...@eecs.berkeley.edu>>, Burak Yavuz <br...@gmail.com>>, "dev@spark.apache.org<ma...@spark.apache.org>" <de...@spark.apache.org>>
Subject: Re: CSV Support in SparkR

There was a bug in the SparkContext creation that I fixed yesterday. https://github.com/apache/spark/commit/6b44278ef7cd2a278dfa67e8393ef30775c72726<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_commit_6b44278ef7cd2a278dfa67e8393ef30775c72726&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=kO95UBEkBrQwNCQwa2x0MOiUxhLQvBQ1B2q5EDG_bt4&s=UjoHyjJhx1vf6fqNiq3P-MqcvN2FnssT16FJ8o98pF4&e=>

If you build from master it should be fixed. Also I think we might have a rc4 which should have this

Thanks
Shivaram

On Tue, Jun 2, 2015 at 12:56 PM, Eskilson,Aleksander <Al...@cerner.com>> wrote:
Hey, that’s pretty convenient. Unfortunately, although the package seems to pull fine into the session, I’m getting class not found exceptions with:

Caused by: org.apache.spark.SparkExcetion: Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0: java.lang.ClassNotFoundException: com.databricks.spark.csv.CsvRelation$anonfun$buildScan$1

Which smells like a path issue to me, and I made sure the ivy repo was part of my PATH, but functions like showDF() still fail with that error. Did I miss a setting, or should the package inclusion in the sparkR execution load that in?

I’ve run
df <- read.df(sqlCtx, “./data.csv”, “com.databricks.spark.csv”, header=“true”, delimiter=“|”)
showDF(df, 10)

(my data is pipeline delimited, and the default SQL context is sqlCtx)

Thanks,
Alek

From: Shivaram Venkataraman <sh...@eecs.berkeley.edu>>
Reply-To: "shivaram@eecs.berkeley.edu<ma...@eecs.berkeley.edu>" <sh...@eecs.berkeley.edu>>
Date: Tuesday, June 2, 2015 at 2:08 PM
To: Burak Yavuz <br...@gmail.com>>
Cc: Aleksander Eskilson <Al...@cerner.com>>, "dev@spark.apache.org<ma...@spark.apache.org>" <de...@spark.apache.org>>, Shivaram Venkataraman <sh...@eecs.berkeley.edu>>
Subject: Re: CSV Support in SparkR

Hi Alek

As Burak said, you can already use the spark-csv with SparkR in the 1.4 release. So right now I use it with something like this

# Launch SparkR
./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3
df <- read.df(sqlContext, "./nycflights13.csv", "com.databricks.spark.csv", header="true")

You can also pass in other options to the spark csv as arguments to `read.df`. Let us know if this works

Thanks
Shivaram

On Tue, Jun 2, 2015 at 12:03 PM, Burak Yavuz <br...@gmail.com>> wrote:
Hi,

cc'ing Shivaram here, because he worked on this yesterday.

If I'm not mistaken, you can use the following workflow:
```./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3```

and then

```df <- read.df(sqlContext, "/data", "csv", header = "true")```

Best,
Burak

On Tue, Jun 2, 2015 at 11:52 AM, Eskilson,Aleksander <Al...@cerner.com>> wrote:
Are there any intentions to provide first class support for CSV files as one of the loadable file types in SparkR? Data brick’s spark-csv API [1] has support for SQL, Python, and Java/Scala, and implements most of the arguments of R’s read.table API [2], but currently there is no way to load CSV data in SparkR (1.4.0) besides separating our headers from the data, loading into an RDD, splitting by our delimiter, and then converting to a SparkR Data Frame with a vector of the columns gathered from the header.

Regards,
Alek Eskilson

[1] -- https://github.com/databricks/spark-csv<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dcsv&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=mPtlFYdyx5Rp7pZr-bQ15QMIrq4qE26ECfJCzoMwYhI&s=wT5PU54lVmR2R_o3GidPhDQD9kMMNVYotZEqCd4ASm4&e=>
[2] -- http://www.inside-r.org/r-doc/utils/read.table<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.inside-2Dr.org_r-2Ddoc_utils_read.table&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=mPtlFYdyx5Rp7pZr-bQ15QMIrq4qE26ECfJCzoMwYhI&s=h87nnmV5D3soOFo5wasj1J34zbhvukHd1WcSitsjB6s&e=>
CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024<tel:%28%2B1%29%20%28816%29221-1024>.

Re: CSV Support in SparkR

Posted by Shivaram Venkataraman <sh...@eecs.berkeley.edu>.

There was a bug in the SparkContext creation that I fixed yesterday.
https://github.com/apache/spark/commit/6b44278ef7cd2a278dfa67e8393ef30775c72726


If you build from master it should be fixed. Also I think we might have a
rc4 which should have this

Thanks
Shivaram

On Tue, Jun 2, 2015 at 12:56 PM, Eskilson,Aleksander <
Alek.Eskilson@cerner.com> wrote:

>  Hey, that’s pretty convenient. Unfortunately, although the package seems
> to pull fine into the session, I’m getting class not found exceptions with:
>
>  Caused by: org.apache.spark.SparkExcetion: Job aborted due to stage
> failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task
> 0.3 in stage 6.0: java.lang.ClassNotFoundException:
> com.databricks.spark.csv.CsvRelation$anonfun$buildScan$1
>
>  Which smells like a path issue to me, and I made sure the ivy repo was
> part of my PATH, but functions like showDF() still fail with that error.
> Did I miss a setting, or should the package inclusion in the sparkR
> execution load that in?
>
>  I’ve run
> df <- read.df(sqlCtx, “./data.csv”, “com.databricks.spark.csv”,
> header=“true”, delimiter=“|”)
> showDF(df, 10)
>
>  (my data is pipeline delimited, and the default SQL context is sqlCtx)
>
>  Thanks,
> Alek
>
>   From: Shivaram Venkataraman <sh...@eecs.berkeley.edu>
> Reply-To: "shivaram@eecs.berkeley.edu" <sh...@eecs.berkeley.edu>
> Date: Tuesday, June 2, 2015 at 2:08 PM
> To: Burak Yavuz <br...@gmail.com>
> Cc: Aleksander Eskilson <Al...@cerner.com>, "dev@spark.apache.org"
> <de...@spark.apache.org>, Shivaram Venkataraman <sh...@eecs.berkeley.edu>
> Subject: Re: CSV Support in SparkR
>
>   Hi Alek
>
>  As Burak said, you can already use the spark-csv with SparkR in the 1.4
> release. So right now I use it with something like this
>
>  # Launch SparkR
> ./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3
>  df <- read.df(sqlContext, "./nycflights13.csv",
> "com.databricks.spark.csv", header="true")
>
>  You can also pass in other options to the spark csv as arguments to
> `read.df`. Let us know if this works
>
>  Thanks
> Shivaram
>
>
> On Tue, Jun 2, 2015 at 12:03 PM, Burak Yavuz <br...@gmail.com> wrote:
>
>> Hi,
>>
>>  cc'ing Shivaram here, because he worked on this yesterday.
>>
>>  If I'm not mistaken, you can use the following workflow:
>>  ```./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3```
>>
>>  and then
>>
>>  ```df <- read.df(sqlContext, "/data", "csv", header = "true")```
>>
>>  Best,
>> Burak
>>
>> On Tue, Jun 2, 2015 at 11:52 AM, Eskilson,Aleksander <
>> Alek.Eskilson@cerner.com> wrote:
>>
>>>  Are there any intentions to provide first class support for CSV files
>>> as one of the loadable file types in SparkR? Data brick’s spark-csv API [1]
>>> has support for SQL, Python, and Java/Scala, and implements most of the
>>> arguments of R’s read.table API [2], but currently there is no way to load
>>> CSV data in SparkR (1.4.0) besides separating our headers from the data,
>>> loading into an RDD, splitting by our delimiter, and then converting to a
>>> SparkR Data Frame with a vector of the columns gathered from the header.
>>>
>>>  Regards,
>>> Alek Eskilson
>>>
>>>  [1] -- https://github.com/databricks/spark-csv
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dcsv&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=mPtlFYdyx5Rp7pZr-bQ15QMIrq4qE26ECfJCzoMwYhI&s=wT5PU54lVmR2R_o3GidPhDQD9kMMNVYotZEqCd4ASm4&e=>
>>> [2] -- http://www.inside-r.org/r-doc/utils/read.table
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.inside-2Dr.org_r-2Ddoc_utils_read.table&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=mPtlFYdyx5Rp7pZr-bQ15QMIrq4qE26ECfJCzoMwYhI&s=h87nnmV5D3soOFo5wasj1J34zbhvukHd1WcSitsjB6s&e=>
>>> CONFIDENTIALITY NOTICE This message and any included attachments are
>>> from Cerner Corporation and are intended only for the addressee. The
>>> information contained in this message is confidential and may constitute
>>> inside or non-public information under international, federal, or state
>>> securities laws. Unauthorized forwarding, printing, copying, distribution,
>>> or use of such information is strictly prohibited and may be unlawful. If
>>> you are not the addressee, please promptly delete this message and notify
>>> the sender of the delivery error by e-mail or you may call Cerner's
>>> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>>>
>>
>>
>

Re: CSV Support in SparkR

Posted by "Eskilson,Aleksander" <Al...@Cerner.com>.

Hey, that’s pretty convenient. Unfortunately, although the package seems to pull fine into the session, I’m getting class not found exceptions with:

Caused by: org.apache.spark.SparkExcetion: Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0: java.lang.ClassNotFoundException: com.databricks.spark.csv.CsvRelation$anonfun$buildScan$1

Which smells like a path issue to me, and I made sure the ivy repo was part of my PATH, but functions like showDF() still fail with that error. Did I miss a setting, or should the package inclusion in the sparkR execution load that in?

I’ve run
df <- read.df(sqlCtx, “./data.csv”, “com.databricks.spark.csv”, header=“true”, delimiter=“|”)
showDF(df, 10)

(my data is pipeline delimited, and the default SQL context is sqlCtx)

Thanks,
Alek

From: Shivaram Venkataraman <sh...@eecs.berkeley.edu>>
Reply-To: "shivaram@eecs.berkeley.edu<ma...@eecs.berkeley.edu>" <sh...@eecs.berkeley.edu>>
Date: Tuesday, June 2, 2015 at 2:08 PM
To: Burak Yavuz <br...@gmail.com>>
Cc: Aleksander Eskilson <Al...@cerner.com>>, "dev@spark.apache.org<ma...@spark.apache.org>" <de...@spark.apache.org>>, Shivaram Venkataraman <sh...@eecs.berkeley.edu>>
Subject: Re: CSV Support in SparkR

Hi Alek

As Burak said, you can already use the spark-csv with SparkR in the 1.4 release. So right now I use it with something like this

# Launch SparkR
./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3
df <- read.df(sqlContext, "./nycflights13.csv", "com.databricks.spark.csv", header="true")

You can also pass in other options to the spark csv as arguments to `read.df`. Let us know if this works

Thanks
Shivaram


On Tue, Jun 2, 2015 at 12:03 PM, Burak Yavuz <br...@gmail.com>> wrote:
Hi,

cc'ing Shivaram here, because he worked on this yesterday.

If I'm not mistaken, you can use the following workflow:
```./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3```

and then

```df <- read.df(sqlContext, "/data", "csv", header = "true")```

Best,
Burak

On Tue, Jun 2, 2015 at 11:52 AM, Eskilson,Aleksander <Al...@cerner.com>> wrote:
Are there any intentions to provide first class support for CSV files as one of the loadable file types in SparkR? Data brick’s spark-csv API [1] has support for SQL, Python, and Java/Scala, and implements most of the arguments of R’s read.table API [2], but currently there is no way to load CSV data in SparkR (1.4.0) besides separating our headers from the data, loading into an RDD, splitting by our delimiter, and then converting to a SparkR Data Frame with a vector of the columns gathered from the header.

Regards,
Alek Eskilson

[1] -- https://github.com/databricks/spark-csv<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dcsv&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=mPtlFYdyx5Rp7pZr-bQ15QMIrq4qE26ECfJCzoMwYhI&s=wT5PU54lVmR2R_o3GidPhDQD9kMMNVYotZEqCd4ASm4&e=>
[2] -- http://www.inside-r.org/r-doc/utils/read.table<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.inside-2Dr.org_r-2Ddoc_utils_read.table&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=mPtlFYdyx5Rp7pZr-bQ15QMIrq4qE26ECfJCzoMwYhI&s=h87nnmV5D3soOFo5wasj1J34zbhvukHd1WcSitsjB6s&e=>
CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024<tel:%28%2B1%29%20%28816%29221-1024>.

Re: CSV Support in SparkR

Posted by Shivaram Venkataraman <sh...@eecs.berkeley.edu>.

Hi Alek

As Burak said, you can already use the spark-csv with SparkR in the 1.4
release. So right now I use it with something like this

# Launch SparkR
./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3
df <- read.df(sqlContext, "./nycflights13.csv", "com.databricks.spark.csv",
header="true")

You can also pass in other options to the spark csv as arguments to
`read.df`. Let us know if this works

Thanks
Shivaram


On Tue, Jun 2, 2015 at 12:03 PM, Burak Yavuz <br...@gmail.com> wrote:

> Hi,
>
> cc'ing Shivaram here, because he worked on this yesterday.
>
> If I'm not mistaken, you can use the following workflow:
> ```./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3```
>
> and then
>
> ```df <- read.df(sqlContext, "/data", "csv", header = "true")```
>
> Best,
> Burak
>
> On Tue, Jun 2, 2015 at 11:52 AM, Eskilson,Aleksander <
> Alek.Eskilson@cerner.com> wrote:
>
>>  Are there any intentions to provide first class support for CSV files
>> as one of the loadable file types in SparkR? Data brick’s spark-csv API [1]
>> has support for SQL, Python, and Java/Scala, and implements most of the
>> arguments of R’s read.table API [2], but currently there is no way to load
>> CSV data in SparkR (1.4.0) besides separating our headers from the data,
>> loading into an RDD, splitting by our delimiter, and then converting to a
>> SparkR Data Frame with a vector of the columns gathered from the header.
>>
>>  Regards,
>>  Alek Eskilson
>>
>>  [1] -- https://github.com/databricks/spark-csv
>> [2] -- http://www.inside-r.org/r-doc/utils/read.table
>>  CONFIDENTIALITY NOTICE This message and any included attachments are
>> from Cerner Corporation and are intended only for the addressee. The
>> information contained in this message is confidential and may constitute
>> inside or non-public information under international, federal, or state
>> securities laws. Unauthorized forwarding, printing, copying, distribution,
>> or use of such information is strictly prohibited and may be unlawful. If
>> you are not the addressee, please promptly delete this message and notify
>> the sender of the delivery error by e-mail or you may call Cerner's
>> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>>
>
>

Re: CSV Support in SparkR

Posted by Burak Yavuz <br...@gmail.com>.

Hi,

cc'ing Shivaram here, because he worked on this yesterday.

If I'm not mistaken, you can use the following workflow:
```./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3```

and then

```df <- read.df(sqlContext, "/data", "csv", header = "true")```

Best,
Burak

On Tue, Jun 2, 2015 at 11:52 AM, Eskilson,Aleksander <
Alek.Eskilson@cerner.com> wrote:

>  Are there any intentions to provide first class support for CSV files as
> one of the loadable file types in SparkR? Data brick’s spark-csv API [1]
> has support for SQL, Python, and Java/Scala, and implements most of the
> arguments of R’s read.table API [2], but currently there is no way to load
> CSV data in SparkR (1.4.0) besides separating our headers from the data,
> loading into an RDD, splitting by our delimiter, and then converting to a
> SparkR Data Frame with a vector of the columns gathered from the header.
>
>  Regards,
>  Alek Eskilson
>
>  [1] -- https://github.com/databricks/spark-csv
> [2] -- http://www.inside-r.org/r-doc/utils/read.table
>  CONFIDENTIALITY NOTICE This message and any included attachments are from
> Cerner Corporation and are intended only for the addressee. The information
> contained in this message is confidential and may constitute inside or
> non-public information under international, federal, or state securities
> laws. Unauthorized forwarding, printing, copying, distribution, or use of
> such information is strictly prohibited and may be unlawful. If you are not
> the addressee, please promptly delete this message and notify the sender of
> the delivery error by e-mail or you may call Cerner's corporate offices in
> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>