You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Richard Cobbe <ri...@oracle.com> on 2016/02/10 22:48:13 UTC
legal column names

I'm working with Spark 1.5.0, and I'm using the Scala API to construct
DataFrames and perform operations on them.  My application requires that I
synthesize column names for intermediate results under some circumstances,
and I don't know what the rules are for legal column names.  In particular,
I'm running into some interesting behavior involving the ability (or lack
thereof) to resolve column references.  Is there documentation anywhere
that describes which column names are considered "safe"?

To see what I mean by "safe", consider the following examples:

Let df be a DataFrame with schema [id: bigint].

    scala> val df = ...   // Details don't matter
    df: org.apache.spark.sql.DataFrame = [id: bigint]

    scala> df.select($"id".as("x")).select($"x")
    res32: org.apache.spark.sql.DataFrame = [x: bigint]

Great; that works just as I'd expect it to.  Things don't seem to be
case-sensitive, though:

    scala> df.select($"id", $"id".as("ID")).select($"id")
    org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id#0L, id#163L.;
    ... and a big stack trace ...

Ok, make sure we don't create data frames with two columns whose names
differ only by case; fair enough.  Certain characters in column names also
cause problems:

    scala> df.select($"id".as("a.b"))
    res34: org.apache.spark.sql.DataFrame = [a.b: bigint]

Good; but can we use the column?

    scala> df.select($"id".as("a.b")).select($"a.b")
    org.apache.spark.sql.AnalysisException: cannot resolve 'a.b' given input columns a.b;
    ... and another big stack trace ...

Apparently not.  Ok, I think I remember reading somewhere that SparkSQL
limits column names to containing only alphanumerics and underscore; does
that apply here too?

    scala> df.select($"id".as("x%y")).select($"x%y")
    res35: org.apache.spark.sql.DataFrame = [x%y: bigint]

Apparently not; % is legal too.  (I've done a variety of experiments, not
repeated here, that suggest that alphanumerics + underscore are safe.
Oddly enough, so are internal spaces.)

Is there a specification for legal column names that won't cause resolution
problems?  I've looked through the Scala API docs for DataFrame, Column,
and ColumnName without finding any.

Thanks,

Richard

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org