You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Kevin Mellott <ke...@gmail.com> on 2016/02/29 17:54:57 UTC

Flattening Data within DataFrames

Fellow Sparkers,

I'm trying to "flatten" my view of data within a DataFrame, and am having
difficulties doing so. The DataFrame contains product information, which
includes multiple levels of categories (primary, secondary, etc).

*Example Data (Raw):*
*Name                    Level            Category*
Baked Code            Food             1
Baked Code            Seafood         2
Baked Code            Fish               3
Hockey Stick          Sports            1
Hockey Stick          Hockey          2
Hockey Stick          Equipment      3

*Desired Data:*
*Name                    Category1     Category2     Category3*
Baked Code            Food              Seafood         Fish
Hockey Stick          Sports            Hockey          Equipment

*Approach:*
After parsing the "raw" information into two separate DataFrames
(called *products
*and *categories*) and registering them as a Spark SQL tables, I was
attempting to perform the following query to flatten this all into the
"desired data" (depicted above).

products.registerTempTable("products")
categories.registerTempTable("categories")

val productList = sqlContext.sql(
  " SELECT p.Name, " +
  " c1.Description AS Category1, " +
  " c2.Description AS Category2, " +
  " c3.Description AS Category3 " +
  " FROM products AS p " +
  "   JOIN categories AS c1 " +
  "     ON c1.Name = p.Name AND c1.Level = '1' "
  "   JOIN categories AS c2 " +
  "     ON c2.Name = p.Name AND c2.Level = '2' "
  "   JOIN categories AS c3 " +
  "     ON c3.Name = p.Name AND c3.Level = '3' "

*Issue:*
I get an error when running my query above, because I am not able to JOIN
the *categories* table more than once. Has anybody dealt with this type of
use case before, and if so how did you achieve the desired behavior?

Thank you in advance for your thoughts.

Kevin

Re: Flattening Data within DataFrames

Posted by Kevin Mellott <ke...@gmail.com>.

Thanks Michal - this is exactly what I need.

On Mon, Feb 29, 2016 at 11:40 AM, Michał Zieliński <
zielinski.michal0@gmail.com> wrote:

> Hi Kevin,
>
> This should help:
>
> https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-spark.html
>
> On 29 February 2016 at 16:54, Kevin Mellott <ke...@gmail.com>
> wrote:
>
>> Fellow Sparkers,
>>
>> I'm trying to "flatten" my view of data within a DataFrame, and am having
>> difficulties doing so. The DataFrame contains product information, which
>> includes multiple levels of categories (primary, secondary, etc).
>>
>> *Example Data (Raw):*
>> *Name                    Level            Category*
>> Baked Code            Food             1
>> Baked Code            Seafood         2
>> Baked Code            Fish               3
>> Hockey Stick          Sports            1
>> Hockey Stick          Hockey          2
>> Hockey Stick          Equipment      3
>>
>> *Desired Data:*
>> *Name                    Category1     Category2     Category3*
>> Baked Code            Food              Seafood         Fish
>> Hockey Stick          Sports            Hockey          Equipment
>>
>> *Approach:*
>> After parsing the "raw" information into two separate DataFrames (called *products
>> *and *categories*) and registering them as a Spark SQL tables, I was
>> attempting to perform the following query to flatten this all into the
>> "desired data" (depicted above).
>>
>> products.registerTempTable("products")
>> categories.registerTempTable("categories")
>>
>> val productList = sqlContext.sql(
>>   " SELECT p.Name, " +
>>   " c1.Description AS Category1, " +
>>   " c2.Description AS Category2, " +
>>   " c3.Description AS Category3 " +
>>   " FROM products AS p " +
>>   "   JOIN categories AS c1 " +
>>   "     ON c1.Name = p.Name AND c1.Level = '1' "
>>   "   JOIN categories AS c2 " +
>>   "     ON c2.Name = p.Name AND c2.Level = '2' "
>>   "   JOIN categories AS c3 " +
>>   "     ON c3.Name = p.Name AND c3.Level = '3' "
>>
>> *Issue:*
>> I get an error when running my query above, because I am not able to JOIN
>> the *categories* table more than once. Has anybody dealt with this type
>> of use case before, and if so how did you achieve the desired behavior?
>>
>> Thank you in advance for your thoughts.
>>
>> Kevin
>>
>
>

Re: Flattening Data within DataFrames

Posted by Michał Zieliński <zi...@gmail.com>.

Hi Kevin,

This should help:
https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-spark.html

On 29 February 2016 at 16:54, Kevin Mellott <ke...@gmail.com>
wrote:

> Fellow Sparkers,
>
> I'm trying to "flatten" my view of data within a DataFrame, and am having
> difficulties doing so. The DataFrame contains product information, which
> includes multiple levels of categories (primary, secondary, etc).
>
> *Example Data (Raw):*
> *Name                    Level            Category*
> Baked Code            Food             1
> Baked Code            Seafood         2
> Baked Code            Fish               3
> Hockey Stick          Sports            1
> Hockey Stick          Hockey          2
> Hockey Stick          Equipment      3
>
> *Desired Data:*
> *Name                    Category1     Category2     Category3*
> Baked Code            Food              Seafood         Fish
> Hockey Stick          Sports            Hockey          Equipment
>
> *Approach:*
> After parsing the "raw" information into two separate DataFrames (called *products
> *and *categories*) and registering them as a Spark SQL tables, I was
> attempting to perform the following query to flatten this all into the
> "desired data" (depicted above).
>
> products.registerTempTable("products")
> categories.registerTempTable("categories")
>
> val productList = sqlContext.sql(
>   " SELECT p.Name, " +
>   " c1.Description AS Category1, " +
>   " c2.Description AS Category2, " +
>   " c3.Description AS Category3 " +
>   " FROM products AS p " +
>   "   JOIN categories AS c1 " +
>   "     ON c1.Name = p.Name AND c1.Level = '1' "
>   "   JOIN categories AS c2 " +
>   "     ON c2.Name = p.Name AND c2.Level = '2' "
>   "   JOIN categories AS c3 " +
>   "     ON c3.Name = p.Name AND c3.Level = '3' "
>
> *Issue:*
> I get an error when running my query above, because I am not able to JOIN
> the *categories* table more than once. Has anybody dealt with this type
> of use case before, and if so how did you achieve the desired behavior?
>
> Thank you in advance for your thoughts.
>
> Kevin
>