You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Devesh Raj Singh <ra...@gmail.com> on 2016/01/25 08:05:46 UTC

NA value handling in sparkR

Hi,

I have applied the following code on airquality dataset available in R ,
which has some missing values. I want to omit the rows which has NAs

library(SparkR) Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages"
"com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')

sc <- sparkR.init("local",sparkHome =
"/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6")

sqlContext <- sparkRSQL.init(sc)

path<-"/Users/devesh/work/airquality/"

aq <- read.df(sqlContext,path,source = "com.databricks.spark.csv",
header="true", inferSchema="true")

head(dropna(aq,how="any"))

I am getting the output as

Ozone Solar_R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3
12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66
5 6

The NAs still exist in the output. Am I missing something here?

-- 
Warm regards,
Devesh.

Re: NA value handling in sparkR

Posted by Hyukjin Kwon <gu...@gmail.com>.

Hm.. As far as I remember, you can set the value to treat as null with
*nullValue* option. Although I am hitting network issues with Github so I
can't check this now but please try that option as described in
https://github.com/databricks/spark-csv.

2016-01-28 0:55 GMT+09:00 Felix Cheung <fe...@hotmail.com>:

> That's correct - and because spark-csv as Spark package is not
> specifically aware of R's notion of  NA and interprets it as a string value.
>
> On the other hand, R native NA is converted to NULL on Spark when creating
> a Spark DataFrame from a R data.frame.
> https://eradiating.wordpress.com/2016/01/04/whats-new-in-sparkr-1-6-0/
>
>
>
> _____________________________
> From: Devesh Raj Singh <ra...@gmail.com>
> Sent: Wednesday, January 27, 2016 3:19 AM
> Subject: Re: NA value handling in sparkR
> To: Deborah Siegel <de...@gmail.com>
> Cc: <us...@spark.apache.org>
>
>
>
> Hi,
>
> While dealing with missing values with R and SparkR I observed the
> following. Please tell me if I am right or wrong?
>
>
> Missing values in native R are represented with a logical constant-NA.
> SparkR DataFrames represents missing values with NULL. If you use
> createDataFrame() to turn a local R data.frame into a distributed SparkR
> DataFrame, SparkR will automatically convert NA to NULL.
>
>                             However, if you are creating a SparkR
> DataFrame by reading in data from a file using read.df(), you may have
> strings of "NA", but not R logical constant NA missing value
> representations. String "NA" is not automatically converted to NULL.
>
> On Tue, Jan 26, 2016 at 2:07 AM, Deborah Siegel <de...@gmail.com>
> wrote:
>
>> Maybe not ideal, but since read.df is inferring all columns from the csv
>> containing "NA" as type of strings, one could filter them rather than using
>> dropna().
>>
>> filtered_aq <- filter(aq, aq$Ozone != "NA" & aq$Solar_R != "NA")
>> head(filtered_aq)
>>
>> Perhaps it would be better to have an option for read.df to convert any
>> "NA" it encounters into null types, like createDataFrame does for <NA>, and
>> then one would be able to use dropna() etc.
>>
>>
>>
>> On Mon, Jan 25, 2016 at 3:24 AM, Devesh Raj Singh <raj.devesh99@gmail.com
>> > wrote:
>>
>>> Hi,
>>>
>>> Yes you are right.
>>>
>>> I think the problem is with reading of csv files. read.df is not
>>> considering NAs in the CSV file
>>>
>>> So what would be a workable solution in dealing with NAs in csv files?
>>>
>>>
>>>
>>> On Mon, Jan 25, 2016 at 2:31 PM, Deborah Siegel <
>>> deborah.siegel@gmail.com> wrote:
>>>
>>>> Hi Devesh,
>>>>
>>>> I'm not certain why that's happening, and it looks like it doesn't
>>>> happen if you use createDataFrame directly:
>>>> aq <- createDataFrame(sqlContext,airquality)
>>>> head(dropna(aq,how="any"))
>>>>
>>>> If I had to guess.. dropna(), I believe, drops null values. I suppose
>>>> its possible that createDataFrame converts R's <NA> values to null, so
>>>> dropna() works with that. But perhaps read.df() does not convert R <NA>s to
>>>> null, as those are most likely interpreted as strings when they come in
>>>> from the csv. Just a guess, can anyone confirm?
>>>>
>>>> Deb
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Jan 24, 2016 at 11:05 PM, Devesh Raj Singh <
>>>> raj.devesh99@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have applied the following code on airquality dataset available in R
>>>>> , which has some missing values. I want to omit the rows which has NAs
>>>>>
>>>>> library(SparkR) Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages"
>>>>> "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')
>>>>>
>>>>> sc <- sparkR.init("local",sparkHome =
>>>>> "/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6")
>>>>>
>>>>> sqlContext <- sparkRSQL.init(sc)
>>>>>
>>>>> path<-"/Users/devesh/work/airquality/"
>>>>>
>>>>> aq <- read.df(sqlContext,path,source = "com.databricks.spark.csv",
>>>>> header="true", inferSchema="true")
>>>>>
>>>>> head(dropna(aq,how="any"))
>>>>>
>>>>> I am getting the output as
>>>>>
>>>>> Ozone Solar_R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72
>>>>> 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28
>>>>> NA 14.9 66 5 6
>>>>>
>>>>> The NAs still exist in the output. Am I missing something here?
>>>>>
>>>>> --
>>>>> Warm regards,
>>>>> Devesh.
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Warm regards,
>>> Devesh.
>>>
>>
>>
>
>
> --
> Warm regards,
> Devesh.
>
>
>

Re: NA value handling in sparkR

Posted by Felix Cheung <fe...@hotmail.com>.

That's correct - and because spark-csv as Spark package is not specifically aware of R's notion of  NA and interprets it as a string value.
On the other hand, R native NA is converted to NULL on Spark when creating a Spark DataFrame from a R data.frame. https://eradiating.wordpress.com/2016/01/04/whats-new-in-sparkr-1-6-0/



    _____________________________
From: Devesh Raj Singh <ra...@gmail.com>
Sent: Wednesday, January 27, 2016 3:19 AM
Subject: Re: NA value handling in sparkR
To: Deborah Siegel <de...@gmail.com>
Cc:  <us...@spark.apache.org>


       Hi,       
          

While dealing with missing values with R and SparkR I observed the following. Please tell me if I am right or wrong?    


    

Missing values in native R are represented with a logical constant-NA. SparkR DataFrames represents missing values with NULL. If you use createDataFrame() to turn a local R data.frame into a distributed SparkR DataFrame, SparkR will automatically convert NA to NULL.     

                            However, if you are creating a SparkR DataFrame by reading in data from a file using read.df(), you may have strings of "NA", but not R logical constant NA missing value representations. String "NA" is not automatically converted to NULL.          
       On Tue, Jan 26, 2016 at 2:07 AM, Deborah Siegel     <de...@gmail.com> wrote:    
               Maybe not ideal, but since read.df is inferring all columns from the csv containing "NA" as type of strings, one could filter them rather than using dropna().              
                           filtered_aq <- filter(aq, aq$Ozone != "NA" & aq$Solar_R != "NA")                      head(filtered_aq)                      
                      Perhaps it would be better to have an option for read.df to convert any "NA" it encounters into null types, like createDataFrame does for <NA>, and then one would be able to use dropna() etc.                        
                      
                                            
                 On Mon, Jan 25, 2016 at 3:24 AM, Devesh Raj Singh          <ra...@gmail.com> wrote:         
                              Hi,                       
                                  Yes you are right.                         
                                     I think the problem is with reading of csv files. read.df is not considering NAs in the CSV file             
                                              
                                  So what would be a workable solution in dealing with NAs in csv files?                                  
                                  
                                                                   
                           On Mon, Jan 25, 2016 at 2:31 PM, Deborah Siegel               <de...@gmail.com> wrote:              
                                             Hi Devesh,                 
                
I'm not certain why that's happening, and it looks like it doesn't happen if you use createDataFrame directly:                
aq <- createDataFrame(sqlContext,airquality)                
head(dropna(aq,how="any"))                
                
If I had to guess.. dropna(), I believe, drops null values. I suppose its possible that createDataFrame converts R's <NA> values to null, so dropna() works with that. But perhaps read.df() does not convert R <NA>s to null, as those are most likely interpreted as strings when they come in from the csv. Just a guess, can anyone confirm?                                 
                                                 Deb                 
                                   
                                                    
                                                    
                                                    
                                                    
                                                                                                                  
                                     On Sun, Jan 24, 2016 at 11:05 PM, Devesh Raj Singh                    <ra...@gmail.com> wrote:                   
                                                            Hi,                                           
                                                                

I have applied the following code on airquality dataset available in R , which has some missing values. I want to omit the rows which has NAs                      

library(SparkR) Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')                      

sc <- sparkR.init("local",sparkHome = "/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6")                      

sqlContext <- sparkRSQL.init(sc)                      

path<-"/Users/devesh/work/airquality/"                      

aq <- read.df(sqlContext,path,source = "com.databricks.spark.csv", header="true", inferSchema="true")                      

head(dropna(aq,how="any"))                      

I am getting the output as                      

Ozone Solar_R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6                      

The NAs still exist in the output. Am I missing something here?                                                                       
                        -- 
                                                                                                      Warm regards,                           
                          Devesh.                          
                                                                                                                                                   
                                                                                         
             
                           
                                               -- 
                                                          Warm regards,                
               Devesh.               
                                                                
                             
   
       
   --    
                  Warm regards,      
     Devesh.

Re: NA value handling in sparkR

Posted by Devesh Raj Singh <ra...@gmail.com>.

Hi,

While dealing with missing values with R and SparkR I observed the
following. Please tell me if I am right or wrong?


Missing values in native R are represented with a logical constant-NA.
SparkR DataFrames represents missing values with NULL. If you use
createDataFrame() to turn a local R data.frame into a distributed SparkR
DataFrame, SparkR will automatically convert NA to NULL.

                            However, if you are creating a SparkR DataFrame
by reading in data from a file using read.df(), you may have strings of
"NA", but not R logical constant NA missing value representations. String
"NA" is not automatically converted to NULL.

On Tue, Jan 26, 2016 at 2:07 AM, Deborah Siegel <de...@gmail.com>
wrote:

> Maybe not ideal, but since read.df is inferring all columns from the csv
> containing "NA" as type of strings, one could filter them rather than using
> dropna().
>
> filtered_aq <- filter(aq, aq$Ozone != "NA" & aq$Solar_R != "NA")
> head(filtered_aq)
>
> Perhaps it would be better to have an option for read.df to convert any
> "NA" it encounters into null types, like createDataFrame does for <NA>, and
> then one would be able to use dropna() etc.
>
>
>
> On Mon, Jan 25, 2016 at 3:24 AM, Devesh Raj Singh <ra...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Yes you are right.
>>
>> I think the problem is with reading of csv files. read.df is not
>> considering NAs in the CSV file
>>
>> So what would be a workable solution in dealing with NAs in csv files?
>>
>>
>>
>> On Mon, Jan 25, 2016 at 2:31 PM, Deborah Siegel <deborah.siegel@gmail.com
>> > wrote:
>>
>>> Hi Devesh,
>>>
>>> I'm not certain why that's happening, and it looks like it doesn't
>>> happen if you use createDataFrame directly:
>>> aq <- createDataFrame(sqlContext,airquality)
>>> head(dropna(aq,how="any"))
>>>
>>> If I had to guess.. dropna(), I believe, drops null values. I suppose
>>> its possible that createDataFrame converts R's <NA> values to null, so
>>> dropna() works with that. But perhaps read.df() does not convert R <NA>s to
>>> null, as those are most likely interpreted as strings when they come in
>>> from the csv. Just a guess, can anyone confirm?
>>>
>>> Deb
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Jan 24, 2016 at 11:05 PM, Devesh Raj Singh <
>>> raj.devesh99@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have applied the following code on airquality dataset available in R
>>>> , which has some missing values. I want to omit the rows which has NAs
>>>>
>>>> library(SparkR) Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages"
>>>> "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')
>>>>
>>>> sc <- sparkR.init("local",sparkHome =
>>>> "/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6")
>>>>
>>>> sqlContext <- sparkRSQL.init(sc)
>>>>
>>>> path<-"/Users/devesh/work/airquality/"
>>>>
>>>> aq <- read.df(sqlContext,path,source = "com.databricks.spark.csv",
>>>> header="true", inferSchema="true")
>>>>
>>>> head(dropna(aq,how="any"))
>>>>
>>>> I am getting the output as
>>>>
>>>> Ozone Solar_R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5
>>>> 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA
>>>> 14.9 66 5 6
>>>>
>>>> The NAs still exist in the output. Am I missing something here?
>>>>
>>>> --
>>>> Warm regards,
>>>> Devesh.
>>>>
>>>
>>>
>>
>>
>> --
>> Warm regards,
>> Devesh.
>>
>
>


-- 
Warm regards,
Devesh.

Re: NA value handling in sparkR

Posted by Devesh Raj Singh <ra...@gmail.com>.

Hi,
If we want to create dummy variables out of categorical columns for data
manipulation purpose, how would we do it in sparkR?

On Wednesday, January 27, 2016, Deborah Siegel <de...@gmail.com>
wrote:

> While fitting the currently available sparkR models, such as glm for
> linear and logistic regression, columns which contains strings are one-hot
> encoded behind the scenes, as part of the parsing of the RFormula. Does
> that help, or did you have something else in mind?
>
>
>
>
>> Thank you so much for your mail. It is working .
>>       I have another small question in sparkR - can we create dummy
>> variables for categorical columns ( like in R we have " dummies" package)
>> eg in iris dataset we have Spieces as a categorical column so 3 dummy
>> variables columns like setosa, virginica would be created with 0 and 1 as
>> values
>
>
> On Mon, Jan 25, 2016 at 12:37 PM, Deborah Siegel <deborah.siegel@gmail.com
> <javascript:_e(%7B%7D,'cvml','deborah.siegel@gmail.com');>> wrote:
>
>> Maybe not ideal, but since read.df is inferring all columns from the csv
>> containing "NA" as type of strings, one could filter them rather than using
>> dropna().
>>
>> filtered_aq <- filter(aq, aq$Ozone != "NA" & aq$Solar_R != "NA")
>> head(filtered_aq)
>>
>> Perhaps it would be better to have an option for read.df to convert any
>> "NA" it encounters into null types, like createDataFrame does for <NA>, and
>> then one would be able to use dropna() etc.
>>
>>
>>
>> On Mon, Jan 25, 2016 at 3:24 AM, Devesh Raj Singh <raj.devesh99@gmail.com
>> <javascript:_e(%7B%7D,'cvml','raj.devesh99@gmail.com');>> wrote:
>>
>>> Hi,
>>>
>>> Yes you are right.
>>>
>>> I think the problem is with reading of csv files. read.df is not
>>> considering NAs in the CSV file
>>>
>>> So what would be a workable solution in dealing with NAs in csv files?
>>>
>>>
>>>
>>> On Mon, Jan 25, 2016 at 2:31 PM, Deborah Siegel <
>>> deborah.siegel@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','deborah.siegel@gmail.com');>> wrote:
>>>
>>>> Hi Devesh,
>>>>
>>>> I'm not certain why that's happening, and it looks like it doesn't
>>>> happen if you use createDataFrame directly:
>>>> aq <- createDataFrame(sqlContext,airquality)
>>>> head(dropna(aq,how="any"))
>>>>
>>>> If I had to guess.. dropna(), I believe, drops null values. I suppose
>>>> its possible that createDataFrame converts R's <NA> values to null, so
>>>> dropna() works with that. But perhaps read.df() does not convert R <NA>s to
>>>> null, as those are most likely interpreted as strings when they come in
>>>> from the csv. Just a guess, can anyone confirm?
>>>>
>>>> Deb
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Jan 24, 2016 at 11:05 PM, Devesh Raj Singh <
>>>> raj.devesh99@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','raj.devesh99@gmail.com');>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have applied the following code on airquality dataset available in R
>>>>> , which has some missing values. I want to omit the rows which has NAs
>>>>>
>>>>> library(SparkR) Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages"
>>>>> "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')
>>>>>
>>>>> sc <- sparkR.init("local",sparkHome =
>>>>> "/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6")
>>>>>
>>>>> sqlContext <- sparkRSQL.init(sc)
>>>>>
>>>>> path<-"/Users/devesh/work/airquality/"
>>>>>
>>>>> aq <- read.df(sqlContext,path,source = "com.databricks.spark.csv",
>>>>> header="true", inferSchema="true")
>>>>>
>>>>> head(dropna(aq,how="any"))
>>>>>
>>>>> I am getting the output as
>>>>>
>>>>> Ozone Solar_R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72
>>>>> 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA
>>>>> 14.9 66 5 6
>>>>>
>>>>> The NAs still exist in the output. Am I missing something here?
>>>>>
>>>>> --
>>>>> Warm regards,
>>>>> Devesh.
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Warm regards,
>>> Devesh.
>>>
>>
>>
>

-- 
Warm regards,
Devesh.

Re: NA value handling in sparkR

Posted by Deborah Siegel <de...@gmail.com>.

While fitting the currently available sparkR models, such as glm for linear
and logistic regression, columns which contains strings are one-hot encoded
behind the scenes, as part of the parsing of the RFormula. Does that help,
or did you have something else in mind?




> Thank you so much for your mail. It is working .
>       I have another small question in sparkR - can we create dummy
> variables for categorical columns ( like in R we have " dummies" package)
> eg in iris dataset we have Spieces as a categorical column so 3 dummy
> variables columns like setosa, virginica would be created with 0 and 1 as
> values


On Mon, Jan 25, 2016 at 12:37 PM, Deborah Siegel <de...@gmail.com>
wrote:

> Maybe not ideal, but since read.df is inferring all columns from the csv
> containing "NA" as type of strings, one could filter them rather than using
> dropna().
>
> filtered_aq <- filter(aq, aq$Ozone != "NA" & aq$Solar_R != "NA")
> head(filtered_aq)
>
> Perhaps it would be better to have an option for read.df to convert any
> "NA" it encounters into null types, like createDataFrame does for <NA>, and
> then one would be able to use dropna() etc.
>
>
>
> On Mon, Jan 25, 2016 at 3:24 AM, Devesh Raj Singh <ra...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Yes you are right.
>>
>> I think the problem is with reading of csv files. read.df is not
>> considering NAs in the CSV file
>>
>> So what would be a workable solution in dealing with NAs in csv files?
>>
>>
>>
>> On Mon, Jan 25, 2016 at 2:31 PM, Deborah Siegel <deborah.siegel@gmail.com
>> > wrote:
>>
>>> Hi Devesh,
>>>
>>> I'm not certain why that's happening, and it looks like it doesn't
>>> happen if you use createDataFrame directly:
>>> aq <- createDataFrame(sqlContext,airquality)
>>> head(dropna(aq,how="any"))
>>>
>>> If I had to guess.. dropna(), I believe, drops null values. I suppose
>>> its possible that createDataFrame converts R's <NA> values to null, so
>>> dropna() works with that. But perhaps read.df() does not convert R <NA>s to
>>> null, as those are most likely interpreted as strings when they come in
>>> from the csv. Just a guess, can anyone confirm?
>>>
>>> Deb
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Jan 24, 2016 at 11:05 PM, Devesh Raj Singh <
>>> raj.devesh99@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have applied the following code on airquality dataset available in R
>>>> , which has some missing values. I want to omit the rows which has NAs
>>>>
>>>> library(SparkR) Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages"
>>>> "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')
>>>>
>>>> sc <- sparkR.init("local",sparkHome =
>>>> "/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6")
>>>>
>>>> sqlContext <- sparkRSQL.init(sc)
>>>>
>>>> path<-"/Users/devesh/work/airquality/"
>>>>
>>>> aq <- read.df(sqlContext,path,source = "com.databricks.spark.csv",
>>>> header="true", inferSchema="true")
>>>>
>>>> head(dropna(aq,how="any"))
>>>>
>>>> I am getting the output as
>>>>
>>>> Ozone Solar_R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5
>>>> 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA
>>>> 14.9 66 5 6
>>>>
>>>> The NAs still exist in the output. Am I missing something here?
>>>>
>>>> --
>>>> Warm regards,
>>>> Devesh.
>>>>
>>>
>>>
>>
>>
>> --
>> Warm regards,
>> Devesh.
>>
>
>

Re: NA value handling in sparkR

Posted by Deborah Siegel <de...@gmail.com>.

Maybe not ideal, but since read.df is inferring all columns from the csv
containing "NA" as type of strings, one could filter them rather than using
dropna().

filtered_aq <- filter(aq, aq$Ozone != "NA" & aq$Solar_R != "NA")
head(filtered_aq)

Perhaps it would be better to have an option for read.df to convert any
"NA" it encounters into null types, like createDataFrame does for <NA>, and
then one would be able to use dropna() etc.



On Mon, Jan 25, 2016 at 3:24 AM, Devesh Raj Singh <ra...@gmail.com>
wrote:

> Hi,
>
> Yes you are right.
>
> I think the problem is with reading of csv files. read.df is not
> considering NAs in the CSV file
>
> So what would be a workable solution in dealing with NAs in csv files?
>
>
>
> On Mon, Jan 25, 2016 at 2:31 PM, Deborah Siegel <de...@gmail.com>
> wrote:
>
>> Hi Devesh,
>>
>> I'm not certain why that's happening, and it looks like it doesn't happen
>> if you use createDataFrame directly:
>> aq <- createDataFrame(sqlContext,airquality)
>> head(dropna(aq,how="any"))
>>
>> If I had to guess.. dropna(), I believe, drops null values. I suppose its
>> possible that createDataFrame converts R's <NA> values to null, so dropna()
>> works with that. But perhaps read.df() does not convert R <NA>s to null, as
>> those are most likely interpreted as strings when they come in from the
>> csv. Just a guess, can anyone confirm?
>>
>> Deb
>>
>>
>>
>>
>>
>>
>> On Sun, Jan 24, 2016 at 11:05 PM, Devesh Raj Singh <
>> raj.devesh99@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have applied the following code on airquality dataset available in R ,
>>> which has some missing values. I want to omit the rows which has NAs
>>>
>>> library(SparkR) Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages"
>>> "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')
>>>
>>> sc <- sparkR.init("local",sparkHome =
>>> "/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6")
>>>
>>> sqlContext <- sparkRSQL.init(sc)
>>>
>>> path<-"/Users/devesh/work/airquality/"
>>>
>>> aq <- read.df(sqlContext,path,source = "com.databricks.spark.csv",
>>> header="true", inferSchema="true")
>>>
>>> head(dropna(aq,how="any"))
>>>
>>> I am getting the output as
>>>
>>> Ozone Solar_R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5
>>> 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA
>>> 14.9 66 5 6
>>>
>>> The NAs still exist in the output. Am I missing something here?
>>>
>>> --
>>> Warm regards,
>>> Devesh.
>>>
>>
>>
>
>
> --
> Warm regards,
> Devesh.
>

Re: NA value handling in sparkR

Posted by Devesh Raj Singh <ra...@gmail.com>.

Hi,

Yes you are right.

I think the problem is with reading of csv files. read.df is not
considering NAs in the CSV file

So what would be a workable solution in dealing with NAs in csv files?



On Mon, Jan 25, 2016 at 2:31 PM, Deborah Siegel <de...@gmail.com>
wrote:

> Hi Devesh,
>
> I'm not certain why that's happening, and it looks like it doesn't happen
> if you use createDataFrame directly:
> aq <- createDataFrame(sqlContext,airquality)
> head(dropna(aq,how="any"))
>
> If I had to guess.. dropna(), I believe, drops null values. I suppose its
> possible that createDataFrame converts R's <NA> values to null, so dropna()
> works with that. But perhaps read.df() does not convert R <NA>s to null, as
> those are most likely interpreted as strings when they come in from the
> csv. Just a guess, can anyone confirm?
>
> Deb
>
>
>
>
>
>
> On Sun, Jan 24, 2016 at 11:05 PM, Devesh Raj Singh <raj.devesh99@gmail.com
> > wrote:
>
>> Hi,
>>
>> I have applied the following code on airquality dataset available in R ,
>> which has some missing values. I want to omit the rows which has NAs
>>
>> library(SparkR) Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages"
>> "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')
>>
>> sc <- sparkR.init("local",sparkHome =
>> "/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6")
>>
>> sqlContext <- sparkRSQL.init(sc)
>>
>> path<-"/Users/devesh/work/airquality/"
>>
>> aq <- read.df(sqlContext,path,source = "com.databricks.spark.csv",
>> header="true", inferSchema="true")
>>
>> head(dropna(aq,how="any"))
>>
>> I am getting the output as
>>
>> Ozone Solar_R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2
>> 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9
>> 66 5 6
>>
>> The NAs still exist in the output. Am I missing something here?
>>
>> --
>> Warm regards,
>> Devesh.
>>
>
>


-- 
Warm regards,
Devesh.