You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2018/06/20 02:09:00 UTC
[jira] [Resolved] (SPARK-24595) What about additional support on deeply nested column?

     [ https://issues.apache.org/jira/browse/SPARK-24595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-24595.
----------------------------------
    Resolution: Invalid

Let's redirect the question to dev/user mailing list before filing here as a JIRA.

> What about additional support on deeply nested column?
> ------------------------------------------------------
>
>                 Key: SPARK-24595
>                 URL: https://issues.apache.org/jira/browse/SPARK-24595
>             Project: Spark
>          Issue Type: Question
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Zejun Li
>            Priority: Major
>
> I store some trajectories data in parquet with this schema:
> {code:java}
> create table traj(
>   id     string,
>   points array<struct<
>             lat:   double,
>             lng:   double,
>             time:  bigint,
>             speed: double,
>             ... lots attributes here
>             candidate_road: array<struct<linestring: string, score: double>>
>            >>
> ){code}
> It contains a lots of attribute comes from sensors. It also have a nested array which contains information generated during map-matching algorithm.
>  
> All of my algorithm run on this dataset is trajectory-oriented, which means they often do iteration on points, and use a subset of point's attributes to do some computation. With this schema I can get points of trajectory without doing `group by` and `collect_list`.
>  
> Because Parquet works very well on deeply nested data, so I directly store it in parquet format with no flatten.
> It works very well with Impala, because Impala has some special support on nested data:
>  
> {code:java}
> select
>   id,
>   avg_speed
> from 
>   traj t,
>   (select avg(speed) avg_speed from t.points where time < '2018-06-19'){code}
> As you can see, Impala treat array of structs as a nested table, and can do some computation on array elements at pre-row level. And Impala will use Parquet's features to prune unused attributes in point struct.
>  
> I use Spark for some complex algorithm which cannot written in pure SQL. But I meet some trouble with Spark DataFrame API:
> Spark cannot do schema prune and filter push-down on nested column. And it seems like there is no handy syntax to play with deeply nested data.
>  * `explode` not help in my scenario, because I need to preserve the trajectory-points hierarchy. If I use `explode` here, I need do a extra `group by` on `id`.
>  * Although, I can directly select `points.lat`, but it lost it structure. If I need array of (lat, lng) pair, I need to zip two array. And it cannot work at deeper nested level, such as select `points.candidate_road.score`. 
>  * Maybe I can use parquet-mr package to read file as RDD, and pass read schema directly to it. But this manner lost Hive integration and vectorized reader in Spark.
>  
> So, I think it is nice to have a Impala style subquery syntax on complex data, or can we add some support to do schema projection on nested data like:
>  
> {code:java}
> select id, extract(points, lat, lng, extract(candidate_road, score)) from traj{code}
> which produce schema as:
>  
>  
> {code:java}
> |- id string
> |- points array of struct
>     |- lat double
>     |- lng double
>     |- candidate_road array of struct
>         |- score double{code}
> And user can play with points with desired schema and data prune in Parquet.
>  
>  
> Or if there are some existing syntax to done my work?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org