You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ashish Shrowty (JIRA)" <ji...@apache.org> on 2016/10/01 18:01:20 UTC

[jira] [Comment Edited] (SPARK-17709) spark 2.0 join - column resolution error

    [ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15538927#comment-15538927 ] 

Ashish Shrowty edited comment on SPARK-17709 at 10/1/16 6:00 PM:
-----------------------------------------------------------------

[~dkbiswal] I just went through manual steps of creating the table in Hive (using EMR 5.0.0), inserting data into it, and then querying using spark .. and got the exception .. steps I followed - 
Step 1 - 
hive> create external table referencedata.testproduct (
    > productid int,
    > name string,
    > price double,
    > itemcount int
    > ) PARTITIONED BY (companyid int)
    > STORED AS PARQUET
    > LOCATION 's3://com.birdzi.datalake.test/testtable'
    > ;
Step 2 - Insert data -
set hive.exec.dynamic.partition.mode=nonstrict
insert into referencedata.testproduct partition(companyid) values(1,"p1",10.0,10,100);
insert into referencedata.testproduct partition(companyid) values(2,"p1",12.0,12,100);
insert into referencedata.testproduct partition(companyid) values(3,"p3",13.0,12,101);

Step 3 - query using spark-shell -
val d1 = spark.sql("select * from referencedata.testproduct")
val df1 = d1.sample(false,0.5).select("companyid","productid","price").groupBy("companyid","productid").agg(avg("price").as("avgprice"))
val df2 = d1.sample(false,0.5).select("companyid","productid","itemcount").groupBy("companyid","productid").agg(avg("itemcount").as("avgitemcount"))
df1.join(df2, Seq("companyid","loyaltycardnumber")) .. throws exception -
org.apache.spark.sql.AnalysisException: using columns ['companyid,'loyaltycardnumber] can not be resolved given input columns: [companyid, productid, price, avgitemcount] ;
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
  at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2589)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:641)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:614)
  ... 49 elided



was (Author: ashrowty):
[~dkbiswal] I just went through manual steps of creating the table in Hive (using EMR 5.0.0), inserting data into it, and then querying using spark .. and got the exception .. steps I followed - 
Step 1 - 
hive> create external table referencedata.testproduct (
hive> create external table referencedata.testproduct (
    > productid int,
    > name string,
    > price double,
    > itemcount int
    > ) PARTITIONED BY (companyid int)
    > STORED AS PARQUET
    > LOCATION 's3://com.birdzi.datalake.test/testtable'
    > ;
Step 2 - Insert data -
set hive.exec.dynamic.partition.mode=nonstrict
insert into referencedata.testproduct partition(companyid) values(1,"p1",10.0,10,100);
insert into referencedata.testproduct partition(companyid) values(2,"p1",12.0,12,100);
insert into referencedata.testproduct partition(companyid) values(3,"p3",13.0,12,101);

Step 3 - query using spark-shell -
val d1 = spark.sql("select * from referencedata.testproduct")
val df1 = d1.sample(false,0.5).select("companyid","productid","price").groupBy("companyid","productid").agg(avg("price").as("avgprice"))
val df2 = d1.sample(false,0.5).select("companyid","productid","itemcount").groupBy("companyid","productid").agg(avg("itemcount").as("avgitemcount"))
df1.join(df2, Seq("companyid","loyaltycardnumber")) .. throws exception -
org.apache.spark.sql.AnalysisException: using columns ['companyid,'loyaltycardnumber] can not be resolved given input columns: [companyid, productid, price, avgitemcount] ;
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
  at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2589)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:641)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:614)
  ... 49 elided


> spark 2.0 join - column resolution error
> ----------------------------------------
>
>                 Key: SPARK-17709
>                 URL: https://issues.apache.org/jira/browse/SPARK-17709
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Ashish Shrowty
>              Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from <hivetable>")  
> val df1 = d1.groupBy("key1","key2")
>           .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>           .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org