You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Lior Baber <li...@gettaxi.com> on 2016/01/26 14:51:05 UTC

py4j.protocol.Py4JJavaError when selecting nested column in dataframe using select statetment

'm trying to perform a simple task in spark dataframe (python) which is
create new dataframe by selecting specific column and nested columns from
another dataframe
for example :


df.printSchema()
root
 |-- time_stamp: long (nullable = true)
 |-- country: struct (nullable = true)
 |    |-- code: string (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- time_zone: string (nullable = true)
 |-- event_name: string (nullable = true)
 |-- order: struct (nullable = true)
 |    |-- created_at: string (nullable = true)
 |    |-- creation_type: struct (nullable = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |-- destination: struct (nullable = true)
 |    |    |-- state: string (nullable = true)
 |    |-- ordering_user: struct (nullable = true)
 |    |    |-- cancellation_score: long (nullable = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- is_test: boolean (nullable = true)

    df2=df.sqlContext.sql("""select a.country_code as country_code,
a.order_destination_state as order_destination_state,
a.order_ordering_user_id as order_ordering_user_id,
a.order_ordering_user_is_test as order_ordering_user_is_test,
a.time_stamp as time_stamp
from
(select
flat_order_creation.order.destination.state as order_destination_state,
flat_order_creation.order.ordering_user.id as order_ordering_user_id,
flat_order_creation.order.ordering_user.is_test as
order_ordering_user_is_test,
flat_order_creation.time_stamp as time_stamp
from flat_order_creation) a""")


and I get the following error:

Traceback (most recent call last):
  File "/home/hadoop/scripts/orders_all.py", line 180, in <module>
    df2=sqlContext.sql(q)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line
552, in sql
  File
"/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line
538, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line
36, in deco
  File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o60.sql.
: java.lang.RuntimeException: [6.21] failure: ``*'' expected but `order'
found

flat_order_creation.order.destination.state as order_destination_state,



I'm using spark-submit with master in local mode to run the this code.
it important to mention the when I'm connecting to pyspark shell and run
the code (line by line) it works , but when submitting it (even in local
mode) it fails.
another thing is important to mention is that when selecting a non nested
field it works as well.
I'm using spark 1.5.2 on EMR (version 4.2.0)