You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Barry Becker (JIRA)" <ji...@apache.org> on 2016/11/17 16:23:58 UTC
[jira] [Commented] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()

    [ https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674129#comment-15674129 ] 

Barry Becker commented on SPARK-12965:
--------------------------------------

This is a big issue for us because we don't control the names of the columns that we get. One ugly workaround might be to to convert . to _ in the columns, but then you need to worry about conflicting with other columns that differ only by their use of . or _. The backquoting works in many places, but there are still many places, like this, where we have seen that it does not work.  

> Indexer setInputCol() doesn't resolve column names like DataFrame.col()
> -----------------------------------------------------------------------
>
>                 Key: SPARK-12965
>                 URL: https://issues.apache.org/jira/browse/SPARK-12965
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, Spark Core
>    Affects Versions: 1.6.0
>            Reporter: Joshua Taylor
>         Attachments: SparkMLDotColumn.java
>
>
> The setInputCol() method doesn't seem to resolve column names in the same way that other methods do.  E.g., Given a DataFrame df, {{df.col("`a.b`")}} will return a column.  On a StringIndexer indexer, {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting and transforming seem to have no effect.  Running the following code produces:
> {noformat}
> +---+---+--------+
> |a.b|a_b|a_bIndex|
> +---+---+--------+
> |foo|foo|     0.0|
> |bar|bar|     1.0|
> +---+---+--------+
> {noformat}
> but I think it should have another column, {{abIndex}} with the same contents as a_bIndex.
> {code}
> public class SparkMLDotColumn {
> 	public static void main(String[] args) {
> 		// Get the contexts
> 		SparkConf conf = new SparkConf()
> 				.setMaster("local[*]")
> 				.setAppName("test")
> 				.set("spark.ui.enabled", "false");
> 		JavaSparkContext sparkContext = new JavaSparkContext(conf);
> 		SQLContext sqlContext = new SQLContext(sparkContext);
> 		
> 		// Create a schema with a single string column named "a.b"
> 		StructType schema = new StructType(new StructField[] {
> 				DataTypes.createStructField("a.b", DataTypes.StringType, false)
> 		});
> 		// Create an empty RDD and DataFrame
> 		List<Row> rows = Arrays.asList(RowFactory.create("foo"), RowFactory.create("bar")); 
> 		JavaRDD<Row> rdd = sparkContext.parallelize(rows);
> 		DataFrame df = sqlContext.createDataFrame(rdd, schema);
> 		
> 		df = df.withColumn("a_b", df.col("`a.b`"));
> 		
> 		StringIndexer indexer0 = new StringIndexer();
> 		indexer0.setInputCol("a_b");
> 		indexer0.setOutputCol("a_bIndex");
> 		df = indexer0.fit(df).transform(df);
> 		
> 		StringIndexer indexer1 = new StringIndexer();
> 		indexer1.setInputCol("`a.b`");
> 		indexer1.setOutputCol("abIndex");
> 		df = indexer1.fit(df).transform(df);
> 		
> 		df.show();
> 	}
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org