You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Joshua TAYLOR <jo...@gmail.com> on 2016/01/22 23:57:24 UTC
Trouble dropping columns from a DataFrame that has other columns with
dots in their names
I've been having lots of trouble with DataFrames whose columns have dots in
their names today. I know that in many places, backticks can be used to
quote column names, but the problem I'm running into now is that I can't
drop a column that has *no* dots in its name when there are *other* columns
in the table that do. Here's some code that tries four ways of dropping
the column. One throws a weird exception, one is a semi-expected no-op,
and the other two work.
public class SparkExample {
public static void main(String[] args) {
/* Get the spark and sql contexts. Setting spark.ui.enabled to
false
* keeps Spark from using its built in dependency on Jersey. */
SparkConf conf = new SparkConf()
.setMaster("local[*]")
.setAppName("test")
.set("spark.ui.enabled", "false");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sparkContext);
/* Create a schema with two columns, one of which as no dots (a_b),
* and the other which does (a.b). */
StructType schema = new StructType(new StructField[] {
DataTypes.createStructField("a_b", DataTypes.StringType,
false),
DataTypes.createStructField("a.c", DataTypes.IntegerType,
false)
});
/* Create an RDD of Rows, and then convert it into a DataFrame. */
List<Row> rows = Arrays.asList(
RowFactory.create("t", 2),
RowFactory.create("u", 4));
JavaRDD<Row> rdd = sparkContext.parallelize(rows);
DataFrame df = sqlContext.createDataFrame(rdd, schema);
/* Four ways to attempt dropping a_b from the DataFrame.
* We'll try calling each one of these and looking at
* the results (or the resulting exception). */
Function<DataFrame,DataFrame> x1 = d -> d.drop("a_b"); //
exception
Function<DataFrame,DataFrame> x2 = d -> d.drop("`a_b`"); //
no-op
Function<DataFrame,DataFrame> x3 = d -> d.drop(d.col("a_b")); //
works
Function<DataFrame,DataFrame> x4 = d -> d.drop(d.col("`a_b`")); //
works
int i=0;
for (Function<DataFrame,DataFrame> x : Arrays.asList(x1, x2, x3,
x4)) {
System.out.println("Case "+i++);
try {
x.apply(df).show();
} catch (Exception e) {
e.printStackTrace(System.out);
}
}
}
}
Here's the output. Case 1 is a no-op, which I think I can understand,
because DataFrame.drop(String) doesn't do any resolution (it doesn't need
to), so d.drop("`a_b`") doesn't do anything because there's no column whose
name is literally "`a_b`". The third and fourth cases work, because
DataFrame.col() does do resolution, and both "a_b" and "`a_b`" resolve
correctly. But why does the first case fail? And why with the message
that it does? Why is it trying to resolve "a.c" at all in this case?
Case 0
org.apache.spark.sql.AnalysisException: cannot resolve 'a.c' given input
columns a_b, a.c;
at
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
at
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org
$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
at
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org
$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
at
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:125)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
at
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
at
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
at
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:133)
at org.apache.spark.sql.DataFrame.org
$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
at org.apache.spark.sql.DataFrame.select(DataFrame.scala:751)
at org.apache.spark.sql.DataFrame.drop(DataFrame.scala:1286)
at SparkExample.lambda$0(SparkExample.java:45)
at SparkExample.main(SparkExample.java:54)
Case 1
+---+---+
|a_b|a.c|
+---+---+
| t| 2|
| u| 4|
+---+---+
Case 2
+---+
|a.c|
+---+
| 2|
| 4|
+---+
Case 3
+---+
|a.c|
+---+
| 2|
| 4|
+---+
Thanks in advance,
Joshua
--
Joshua Taylor, http://www.cs.rpi.edu/~tayloj/