You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Kevin Mellott <ke...@gmail.com> on 2016/02/29 17:54:57 UTC
Flattening Data within DataFrames
Fellow Sparkers,
I'm trying to "flatten" my view of data within a DataFrame, and am having
difficulties doing so. The DataFrame contains product information, which
includes multiple levels of categories (primary, secondary, etc).
*Example Data (Raw):*
*Name Level Category*
Baked Code Food 1
Baked Code Seafood 2
Baked Code Fish 3
Hockey Stick Sports 1
Hockey Stick Hockey 2
Hockey Stick Equipment 3
*Desired Data:*
*Name Category1 Category2 Category3*
Baked Code Food Seafood Fish
Hockey Stick Sports Hockey Equipment
*Approach:*
After parsing the "raw" information into two separate DataFrames
(called *products
*and *categories*) and registering them as a Spark SQL tables, I was
attempting to perform the following query to flatten this all into the
"desired data" (depicted above).
products.registerTempTable("products")
categories.registerTempTable("categories")
val productList = sqlContext.sql(
" SELECT p.Name, " +
" c1.Description AS Category1, " +
" c2.Description AS Category2, " +
" c3.Description AS Category3 " +
" FROM products AS p " +
" JOIN categories AS c1 " +
" ON c1.Name = p.Name AND c1.Level = '1' "
" JOIN categories AS c2 " +
" ON c2.Name = p.Name AND c2.Level = '2' "
" JOIN categories AS c3 " +
" ON c3.Name = p.Name AND c3.Level = '3' "
*Issue:*
I get an error when running my query above, because I am not able to JOIN
the *categories* table more than once. Has anybody dealt with this type of
use case before, and if so how did you achieve the desired behavior?
Thank you in advance for your thoughts.
Kevin
Re: Flattening Data within DataFrames
Posted by Kevin Mellott <ke...@gmail.com>.
Thanks Michal - this is exactly what I need.
On Mon, Feb 29, 2016 at 11:40 AM, Michał Zieliński <
zielinski.michal0@gmail.com> wrote:
> Hi Kevin,
>
> This should help:
>
> https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-spark.html
>
> On 29 February 2016 at 16:54, Kevin Mellott <ke...@gmail.com>
> wrote:
>
>> Fellow Sparkers,
>>
>> I'm trying to "flatten" my view of data within a DataFrame, and am having
>> difficulties doing so. The DataFrame contains product information, which
>> includes multiple levels of categories (primary, secondary, etc).
>>
>> *Example Data (Raw):*
>> *Name Level Category*
>> Baked Code Food 1
>> Baked Code Seafood 2
>> Baked Code Fish 3
>> Hockey Stick Sports 1
>> Hockey Stick Hockey 2
>> Hockey Stick Equipment 3
>>
>> *Desired Data:*
>> *Name Category1 Category2 Category3*
>> Baked Code Food Seafood Fish
>> Hockey Stick Sports Hockey Equipment
>>
>> *Approach:*
>> After parsing the "raw" information into two separate DataFrames (called *products
>> *and *categories*) and registering them as a Spark SQL tables, I was
>> attempting to perform the following query to flatten this all into the
>> "desired data" (depicted above).
>>
>> products.registerTempTable("products")
>> categories.registerTempTable("categories")
>>
>> val productList = sqlContext.sql(
>> " SELECT p.Name, " +
>> " c1.Description AS Category1, " +
>> " c2.Description AS Category2, " +
>> " c3.Description AS Category3 " +
>> " FROM products AS p " +
>> " JOIN categories AS c1 " +
>> " ON c1.Name = p.Name AND c1.Level = '1' "
>> " JOIN categories AS c2 " +
>> " ON c2.Name = p.Name AND c2.Level = '2' "
>> " JOIN categories AS c3 " +
>> " ON c3.Name = p.Name AND c3.Level = '3' "
>>
>> *Issue:*
>> I get an error when running my query above, because I am not able to JOIN
>> the *categories* table more than once. Has anybody dealt with this type
>> of use case before, and if so how did you achieve the desired behavior?
>>
>> Thank you in advance for your thoughts.
>>
>> Kevin
>>
>
>
Re: Flattening Data within DataFrames
Posted by Michał Zieliński <zi...@gmail.com>.
Hi Kevin,
This should help:
https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-spark.html
On 29 February 2016 at 16:54, Kevin Mellott <ke...@gmail.com>
wrote:
> Fellow Sparkers,
>
> I'm trying to "flatten" my view of data within a DataFrame, and am having
> difficulties doing so. The DataFrame contains product information, which
> includes multiple levels of categories (primary, secondary, etc).
>
> *Example Data (Raw):*
> *Name Level Category*
> Baked Code Food 1
> Baked Code Seafood 2
> Baked Code Fish 3
> Hockey Stick Sports 1
> Hockey Stick Hockey 2
> Hockey Stick Equipment 3
>
> *Desired Data:*
> *Name Category1 Category2 Category3*
> Baked Code Food Seafood Fish
> Hockey Stick Sports Hockey Equipment
>
> *Approach:*
> After parsing the "raw" information into two separate DataFrames (called *products
> *and *categories*) and registering them as a Spark SQL tables, I was
> attempting to perform the following query to flatten this all into the
> "desired data" (depicted above).
>
> products.registerTempTable("products")
> categories.registerTempTable("categories")
>
> val productList = sqlContext.sql(
> " SELECT p.Name, " +
> " c1.Description AS Category1, " +
> " c2.Description AS Category2, " +
> " c3.Description AS Category3 " +
> " FROM products AS p " +
> " JOIN categories AS c1 " +
> " ON c1.Name = p.Name AND c1.Level = '1' "
> " JOIN categories AS c2 " +
> " ON c2.Name = p.Name AND c2.Level = '2' "
> " JOIN categories AS c3 " +
> " ON c3.Name = p.Name AND c3.Level = '3' "
>
> *Issue:*
> I get an error when running my query above, because I am not able to JOIN
> the *categories* table more than once. Has anybody dealt with this type
> of use case before, and if so how did you achieve the desired behavior?
>
> Thank you in advance for your thoughts.
>
> Kevin
>