You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Devesh Raj Singh <ra...@gmail.com> on 2016/02/05 07:44:28 UTC
different behavior while using createDataFrame and read.df in SparkR
Hi,
I am using Spark 1.5.1
When I do this
df <- createDataFrame(sqlContext, iris)
#creating a new column for category "Setosa"
df$Species1<-ifelse((df)[[5]]=="setosa",1,0)
head(df)
output: new column created
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
*but when I saved the iris dataset as a CSV file and try to read it and
convert it to sparkR dataframe*
df <- read.df(sqlContext,"/Users/devesh/Github/deveshgit2/bdaml/data/iris/",
source = "com.databricks.spark.csv",header =
"true",inferSchema = "true")
now when I try to create new column
df$Species1<-ifelse((df)[[5]]=="setosa",1,0)
I get the below error:
16/02/05 12:11:01 ERROR RBackendHandler: col on 922 failed
Error in select(x, x$"*", alias(col, colName)) :
error in evaluating the argument 'col' in selecting a method for function
'select': Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...)
:
org.apache.spark.sql.AnalysisException: Cannot resolve column name
"Sepal.Length" among (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
Species);
at org.apache.spark.s
--
Warm regards,
Devesh.
RE: different behavior while using createDataFrame and read.df in
SparkR
Posted by "Sun, Rui" <ru...@intel.com>.
I guess the problem is:
dummy.df<-withColumn(dataframe,paste0(colnames(cat.column),j),ifelse(column[[1]]==levels(as.factor(unlist(cat.column)))[j],1,0) )
dataframe<-dummy.df
Once dataframe is re-assigned to reference a new DataFrame in each iteration, the column variable has to be re-assigned to reference a column in the new DataFrame.
From: Devesh Raj Singh [mailto:raj.devesh99@gmail.com]
Sent: Saturday, February 6, 2016 8:31 PM
To: Sun, Rui <ru...@intel.com>
Cc: user@spark.apache.org
Subject: Re: different behavior while using createDataFrame and read.df in SparkR
Thank you ! Rui Sun for the observation! It helped.
I have a new problem arising. When I create a small function for dummy variable creation for categorical column
BDADummies<-function(dataframe,column){
cat.column<-vector(mode="character",length=nrow(dataframe))
cat.column<-collect(column)
lev<-length(levels(as.factor(unlist(cat.column))))
for (j in 1:lev){
dummy.df<-withColumn(dataframe,paste0(colnames(cat.column),j),ifelse(column[[1]]==levels(as.factor(unlist(cat.column)))[j],1,0) )
dataframe<-dummy.df
}
return(dataframe)
}
and when I call the function using
newdummy.df<-BDADummies(df1,column=select(df1,df1$Species))
I get the below error
Error in withColumn(dataframe, paste0(colnames(cat.column), j), ifelse(column[[1]] == :
error in evaluating the argument 'col' in selecting a method for function 'withColumn': Error in if (le > 0) paste0("[1:", paste(le), "]") else "(0)" :
argument is not interpretable as logical
but when i use it without calling or creating a function , the statement
dummy.df<-withColumn(dataframe,paste0(colnames(cat.column),j),ifelse(column[[1]]==levels(as.factor(unlist(cat.column)))[j],1,0) )
gives me the new columns generating column names as desired.
Warm regards,
Devesh.
On Sat, Feb 6, 2016 at 7:09 AM, Sun, Rui <ru...@intel.com>> wrote:
I guess this is related to https://issues.apache.org/jira/browse/SPARK-11976
When calling createDataFrame on iris, the “.” Character in column names will be replaced with “_”.
It seems that when you create a DataFrame from the CSV file, the “.” Character in column names are still there.
From: Devesh Raj Singh [mailto:raj.devesh99@gmail.com<ma...@gmail.com>]
Sent: Friday, February 5, 2016 2:44 PM
To: user@spark.apache.org<ma...@spark.apache.org>
Cc: Sun, Rui
Subject: different behavior while using createDataFrame and read.df in SparkR
Hi,
I am using Spark 1.5.1
When I do this
df <- createDataFrame(sqlContext, iris)
#creating a new column for category "Setosa"
df$Species1<-ifelse((df)[[5]]=="setosa",1,0)
head(df)
output: new column created
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
but when I saved the iris dataset as a CSV file and try to read it and convert it to sparkR dataframe
df <- read.df(sqlContext,"/Users/devesh/Github/deveshgit2/bdaml/data/iris/",
source = "com.databricks.spark.csv",header = "true",inferSchema = "true")
now when I try to create new column
df$Species1<-ifelse((df)[[5]]=="setosa",1,0)
I get the below error:
16/02/05 12:11:01 ERROR RBackendHandler: col on 922 failed
Error in select(x, x$"*", alias(col, colName)) :
error in evaluating the argument 'col' in selecting a method for function 'select': Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: Cannot resolve column name "Sepal.Length" among (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species);
at org.apache.spark.s
--
Warm regards,
Devesh.
--
Warm regards,
Devesh.
Re: different behavior while using createDataFrame and read.df in SparkR
Posted by Devesh Raj Singh <ra...@gmail.com>.
Thank you ! Rui Sun for the observation! It helped.
I have a new problem arising. When I create a small function for dummy
variable creation for categorical column
BDADummies<-function(dataframe,column){
cat.column<-vector(mode="character",length=nrow(dataframe))
cat.column<-collect(column)
lev<-length(levels(as.factor(unlist(cat.column))))
for (j in 1:lev){
dummy.df<-withColumn(dataframe,paste0(colnames(cat.column),j),ifelse(column[[1]]==levels(as.factor(unlist(cat.column)))[j],1,0)
)
dataframe<-dummy.df
}
return(dataframe)
}
*and when I call the function using*
newdummy.df<-BDADummies(df1,column=select(df1,df1$Species))
I get the below error
Error in withColumn(dataframe, paste0(colnames(cat.column), j),
ifelse(column[[1]] == :
error in evaluating the argument 'col' in selecting a method for function
'withColumn': Error in if (le > 0) paste0("[1:", paste(le), "]") else "(0)"
:
argument is not interpretable as logical
*but when i use it without calling or creating a function , the statement *
dummy.df<-withColumn(dataframe,paste0(colnames(cat.column),j),ifelse(column[[1]]==levels(as.factor(unlist(cat.column)))[j],1,0)
)
gives me the new columns generating column names as desired.
Warm regards,
Devesh.
On Sat, Feb 6, 2016 at 7:09 AM, Sun, Rui <ru...@intel.com> wrote:
> I guess this is related to
> https://issues.apache.org/jira/browse/SPARK-11976
>
>
>
> When calling createDataFrame on iris, the “.” Character in column names
> will be replaced with “_”.
>
> It seems that when you create a DataFrame from the CSV file, the “.”
> Character in column names are still there.
>
>
>
> *From:* Devesh Raj Singh [mailto:raj.devesh99@gmail.com]
> *Sent:* Friday, February 5, 2016 2:44 PM
> *To:* user@spark.apache.org
> *Cc:* Sun, Rui
> *Subject:* different behavior while using createDataFrame and read.df in
> SparkR
>
>
>
>
> Hi,
>
>
>
> I am using Spark 1.5.1
>
>
>
> When I do this
>
>
>
> df <- createDataFrame(sqlContext, iris)
>
>
>
> #creating a new column for category "Setosa"
>
>
>
> df$Species1<-ifelse((df)[[5]]=="setosa",1,0)
>
>
>
> head(df)
>
>
>
> output: new column created
>
>
>
> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
>
> 1 5.1 3.5 1.4 0.2 setosa
>
> 2 4.9 3.0 1.4 0.2 setosa
>
> 3 4.7 3.2 1.3 0.2 setosa
>
> 4 4.6 3.1 1.5 0.2 setosa
>
> 5 5.0 3.6 1.4 0.2 setosa
>
> 6 5.4 3.9 1.7 0.4 setosa
>
>
>
> *but when I saved the iris dataset as a CSV file and try to read it and
> convert it to sparkR dataframe*
>
>
>
> df <-
> read.df(sqlContext,"/Users/devesh/Github/deveshgit2/bdaml/data/iris/",
>
> source = "com.databricks.spark.csv",header =
> "true",inferSchema = "true")
>
>
>
> now when I try to create new column
>
>
>
> df$Species1<-ifelse((df)[[5]]=="setosa",1,0)
>
> I get the below error:
>
>
>
> 16/02/05 12:11:01 ERROR RBackendHandler: col on 922 failed
>
> Error in select(x, x$"*", alias(col, colName)) :
>
> error in evaluating the argument 'col' in selecting a method for
> function 'select': Error in invokeJava(isStatic = FALSE, objId$id,
> methodName, ...) :
>
> org.apache.spark.sql.AnalysisException: Cannot resolve column name
> "Sepal.Length" among (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width,
> Species);
>
> at org.apache.spark.s
>
> --
>
> Warm regards,
>
> Devesh.
>
--
Warm regards,
Devesh.
RE: different behavior while using createDataFrame and read.df in
SparkR
Posted by "Sun, Rui" <ru...@intel.com>.
I guess this is related to https://issues.apache.org/jira/browse/SPARK-11976
When calling createDataFrame on iris, the “.” Character in column names will be replaced with “_”.
It seems that when you create a DataFrame from the CSV file, the “.” Character in column names are still there.
From: Devesh Raj Singh [mailto:raj.devesh99@gmail.com]
Sent: Friday, February 5, 2016 2:44 PM
To: user@spark.apache.org
Cc: Sun, Rui
Subject: different behavior while using createDataFrame and read.df in SparkR
Hi,
I am using Spark 1.5.1
When I do this
df <- createDataFrame(sqlContext, iris)
#creating a new column for category "Setosa"
df$Species1<-ifelse((df)[[5]]=="setosa",1,0)
head(df)
output: new column created
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
but when I saved the iris dataset as a CSV file and try to read it and convert it to sparkR dataframe
df <- read.df(sqlContext,"/Users/devesh/Github/deveshgit2/bdaml/data/iris/",
source = "com.databricks.spark.csv",header = "true",inferSchema = "true")
now when I try to create new column
df$Species1<-ifelse((df)[[5]]=="setosa",1,0)
I get the below error:
16/02/05 12:11:01 ERROR RBackendHandler: col on 922 failed
Error in select(x, x$"*", alias(col, colName)) :
error in evaluating the argument 'col' in selecting a method for function 'select': Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: Cannot resolve column name "Sepal.Length" among (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species);
at org.apache.spark.s
--
Warm regards,
Devesh.