You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Richard Siebeling <rs...@gmail.com> on 2016/01/19 16:17:51 UTC

Split columns in RDD

Hi,

what is the most efficient way to split columns and know how many columns
are created.

Here is the current RDD
-----------------
ID   STATE
-----------------
1       TX, NY, FL
2       CA, OH
-----------------

This is the preferred output:
-------------------------
ID    STATE_1     STATE_2      STATE_3
-------------------------
1     TX              NY              FL
2     CA              OH
-------------------------

With a separated with the new columns STATE_1, STATE_2, STATE_3


It looks like the following output is feasible using a ReduceBy operator
-------------------------
ID    STATE_1     STATE_2      STATE_3       NEW_COLUMNS
-------------------------
1     TX                NY               FL            STATE_1, STATE_2,
STATE_3
2     CA                OH                             STATE_1, STATE_2
-------------------------

Then in the reduce step, the distinct new columns can be calculated.
Is it possible to get the second output where next to the RDD the
new_columns are saved somewhere?
Or is the required to use the second approach?

thanks in advance,
Richard

Re: Split columns in RDD

Posted by Richard Siebeling <rs...@gmail.com>.
thanks Daniel, this will certainly help,
regards, Richard

On Tue, Jan 19, 2016 at 6:35 PM, Daniel Imberman <da...@gmail.com>
wrote:

> edit 2: filter should be map
>
> val numColumns = separatedInputStrings.map{ case(id, (stateList,
> numStates)) => numStates}.reduce(math.max)
>
> On Tue, Jan 19, 2016 at 8:19 AM Daniel Imberman <da...@gmail.com>
> wrote:
>
>> edit: Mistake in the second code example
>>
>> val numColumns = separatedInputStrings.filter{ case(id, (stateList,
>> numStates)) => numStates}.reduce(math.max)
>>
>>
>> On Tue, Jan 19, 2016 at 8:17 AM Daniel Imberman <
>> daniel.imberman@gmail.com> wrote:
>>
>>> Hi Richard,
>>>
>>> If I understand the question correctly it sounds like you could probably
>>> do this using mapValues (I'm assuming that you want two pieces of
>>> information out of all rows, the states as individual items, and the number
>>> of states in the row)
>>>
>>>
>>> val separatedInputStrings = input:RDD[(Int, String).mapValues{
>>>     val inputsString = "TX,NV,WY"
>>>     val stringList = inputString.split(",")
>>>     (stringList, stringList.size)
>>> }
>>>
>>> If you then wanted to find out how many state columns you should have in
>>> your table you could use a normal reduce (with a filter beforehand to
>>> reduce how much data you are shuffling)
>>>
>>> val numColumns = separatedInputStrings.filter(_._2).reduce(math.max)
>>>
>>> I hope this helps!
>>>
>>>
>>>
>>> On Tue, Jan 19, 2016 at 8:05 AM Richard Siebeling <rs...@gmail.com>
>>> wrote:
>>>
>>>> that's true and that's the way we're doing it now but then we're only
>>>> using the first row to determine the number of splitted columns.
>>>> It could be that in the second (or last) row there are 10 new columns
>>>> and we'd like to know that too.
>>>>
>>>> Probably a reduceby operator can be used to do that, but I'm hoping
>>>> that there is a better or another way,
>>>>
>>>> thanks,
>>>> Richard
>>>>
>>>> On Tue, Jan 19, 2016 at 4:22 PM, Sabarish Sasidharan <
>>>> sabarish.sasidharan@manthan.com> wrote:
>>>>
>>>>> The most efficient to determine the number of columns would be to do a
>>>>> take(1) and split in the driver.
>>>>>
>>>>> Regards
>>>>> Sab
>>>>> On 19-Jan-2016 8:48 pm, "Richard Siebeling" <rs...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> what is the most efficient way to split columns and know how many
>>>>>> columns are created.
>>>>>>
>>>>>> Here is the current RDD
>>>>>> -----------------
>>>>>> ID   STATE
>>>>>> -----------------
>>>>>> 1       TX, NY, FL
>>>>>> 2       CA, OH
>>>>>> -----------------
>>>>>>
>>>>>> This is the preferred output:
>>>>>> -------------------------
>>>>>> ID    STATE_1     STATE_2      STATE_3
>>>>>> -------------------------
>>>>>> 1     TX              NY              FL
>>>>>> 2     CA              OH
>>>>>> -------------------------
>>>>>>
>>>>>> With a separated with the new columns STATE_1, STATE_2, STATE_3
>>>>>>
>>>>>>
>>>>>> It looks like the following output is feasible using a ReduceBy
>>>>>> operator
>>>>>> -------------------------
>>>>>> ID    STATE_1     STATE_2      STATE_3       NEW_COLUMNS
>>>>>> -------------------------
>>>>>> 1     TX                NY               FL            STATE_1,
>>>>>> STATE_2, STATE_3
>>>>>> 2     CA                OH                             STATE_1,
>>>>>> STATE_2
>>>>>> -------------------------
>>>>>>
>>>>>> Then in the reduce step, the distinct new columns can be calculated.
>>>>>> Is it possible to get the second output where next to the RDD the
>>>>>> new_columns are saved somewhere?
>>>>>> Or is the required to use the second approach?
>>>>>>
>>>>>> thanks in advance,
>>>>>> Richard
>>>>>>
>>>>>>
>>>>

Re: Split columns in RDD

Posted by Daniel Imberman <da...@gmail.com>.
edit 2: filter should be map

val numColumns = separatedInputStrings.map{ case(id, (stateList,
numStates)) => numStates}.reduce(math.max)

On Tue, Jan 19, 2016 at 8:19 AM Daniel Imberman <da...@gmail.com>
wrote:

> edit: Mistake in the second code example
>
> val numColumns = separatedInputStrings.filter{ case(id, (stateList,
> numStates)) => numStates}.reduce(math.max)
>
>
> On Tue, Jan 19, 2016 at 8:17 AM Daniel Imberman <da...@gmail.com>
> wrote:
>
>> Hi Richard,
>>
>> If I understand the question correctly it sounds like you could probably
>> do this using mapValues (I'm assuming that you want two pieces of
>> information out of all rows, the states as individual items, and the number
>> of states in the row)
>>
>>
>> val separatedInputStrings = input:RDD[(Int, String).mapValues{
>>     val inputsString = "TX,NV,WY"
>>     val stringList = inputString.split(",")
>>     (stringList, stringList.size)
>> }
>>
>> If you then wanted to find out how many state columns you should have in
>> your table you could use a normal reduce (with a filter beforehand to
>> reduce how much data you are shuffling)
>>
>> val numColumns = separatedInputStrings.filter(_._2).reduce(math.max)
>>
>> I hope this helps!
>>
>>
>>
>> On Tue, Jan 19, 2016 at 8:05 AM Richard Siebeling <rs...@gmail.com>
>> wrote:
>>
>>> that's true and that's the way we're doing it now but then we're only
>>> using the first row to determine the number of splitted columns.
>>> It could be that in the second (or last) row there are 10 new columns
>>> and we'd like to know that too.
>>>
>>> Probably a reduceby operator can be used to do that, but I'm hoping that
>>> there is a better or another way,
>>>
>>> thanks,
>>> Richard
>>>
>>> On Tue, Jan 19, 2016 at 4:22 PM, Sabarish Sasidharan <
>>> sabarish.sasidharan@manthan.com> wrote:
>>>
>>>> The most efficient to determine the number of columns would be to do a
>>>> take(1) and split in the driver.
>>>>
>>>> Regards
>>>> Sab
>>>> On 19-Jan-2016 8:48 pm, "Richard Siebeling" <rs...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> what is the most efficient way to split columns and know how many
>>>>> columns are created.
>>>>>
>>>>> Here is the current RDD
>>>>> -----------------
>>>>> ID   STATE
>>>>> -----------------
>>>>> 1       TX, NY, FL
>>>>> 2       CA, OH
>>>>> -----------------
>>>>>
>>>>> This is the preferred output:
>>>>> -------------------------
>>>>> ID    STATE_1     STATE_2      STATE_3
>>>>> -------------------------
>>>>> 1     TX              NY              FL
>>>>> 2     CA              OH
>>>>> -------------------------
>>>>>
>>>>> With a separated with the new columns STATE_1, STATE_2, STATE_3
>>>>>
>>>>>
>>>>> It looks like the following output is feasible using a ReduceBy
>>>>> operator
>>>>> -------------------------
>>>>> ID    STATE_1     STATE_2      STATE_3       NEW_COLUMNS
>>>>> -------------------------
>>>>> 1     TX                NY               FL            STATE_1,
>>>>> STATE_2, STATE_3
>>>>> 2     CA                OH                             STATE_1, STATE_2
>>>>> -------------------------
>>>>>
>>>>> Then in the reduce step, the distinct new columns can be calculated.
>>>>> Is it possible to get the second output where next to the RDD the
>>>>> new_columns are saved somewhere?
>>>>> Or is the required to use the second approach?
>>>>>
>>>>> thanks in advance,
>>>>> Richard
>>>>>
>>>>>
>>>

Re: Split columns in RDD

Posted by Daniel Imberman <da...@gmail.com>.
edit: Mistake in the second code example

val numColumns = separatedInputStrings.filter{ case(id, (stateList,
numStates)) => numStates}.reduce(math.max)


On Tue, Jan 19, 2016 at 8:17 AM Daniel Imberman <da...@gmail.com>
wrote:

> Hi Richard,
>
> If I understand the question correctly it sounds like you could probably
> do this using mapValues (I'm assuming that you want two pieces of
> information out of all rows, the states as individual items, and the number
> of states in the row)
>
>
> val separatedInputStrings = input:RDD[(Int, String).mapValues{
>     val inputsString = "TX,NV,WY"
>     val stringList = inputString.split(",")
>     (stringList, stringList.size)
> }
>
> If you then wanted to find out how many state columns you should have in
> your table you could use a normal reduce (with a filter beforehand to
> reduce how much data you are shuffling)
>
> val numColumns = separatedInputStrings.filter(_._2).reduce(math.max)
>
> I hope this helps!
>
>
>
> On Tue, Jan 19, 2016 at 8:05 AM Richard Siebeling <rs...@gmail.com>
> wrote:
>
>> that's true and that's the way we're doing it now but then we're only
>> using the first row to determine the number of splitted columns.
>> It could be that in the second (or last) row there are 10 new columns and
>> we'd like to know that too.
>>
>> Probably a reduceby operator can be used to do that, but I'm hoping that
>> there is a better or another way,
>>
>> thanks,
>> Richard
>>
>> On Tue, Jan 19, 2016 at 4:22 PM, Sabarish Sasidharan <
>> sabarish.sasidharan@manthan.com> wrote:
>>
>>> The most efficient to determine the number of columns would be to do a
>>> take(1) and split in the driver.
>>>
>>> Regards
>>> Sab
>>> On 19-Jan-2016 8:48 pm, "Richard Siebeling" <rs...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> what is the most efficient way to split columns and know how many
>>>> columns are created.
>>>>
>>>> Here is the current RDD
>>>> -----------------
>>>> ID   STATE
>>>> -----------------
>>>> 1       TX, NY, FL
>>>> 2       CA, OH
>>>> -----------------
>>>>
>>>> This is the preferred output:
>>>> -------------------------
>>>> ID    STATE_1     STATE_2      STATE_3
>>>> -------------------------
>>>> 1     TX              NY              FL
>>>> 2     CA              OH
>>>> -------------------------
>>>>
>>>> With a separated with the new columns STATE_1, STATE_2, STATE_3
>>>>
>>>>
>>>> It looks like the following output is feasible using a ReduceBy operator
>>>> -------------------------
>>>> ID    STATE_1     STATE_2      STATE_3       NEW_COLUMNS
>>>> -------------------------
>>>> 1     TX                NY               FL            STATE_1,
>>>> STATE_2, STATE_3
>>>> 2     CA                OH                             STATE_1, STATE_2
>>>> -------------------------
>>>>
>>>> Then in the reduce step, the distinct new columns can be calculated.
>>>> Is it possible to get the second output where next to the RDD the
>>>> new_columns are saved somewhere?
>>>> Or is the required to use the second approach?
>>>>
>>>> thanks in advance,
>>>> Richard
>>>>
>>>>
>>

Re: Split columns in RDD

Posted by Daniel Imberman <da...@gmail.com>.
Hi Richard,

If I understand the question correctly it sounds like you could probably do
this using mapValues (I'm assuming that you want two pieces of information
out of all rows, the states as individual items, and the number of states
in the row)


val separatedInputStrings = input:RDD[(Int, String).mapValues{
    val inputsString = "TX,NV,WY"
    val stringList = inputString.split(",")
    (stringList, stringList.size)
}

If you then wanted to find out how many state columns you should have in
your table you could use a normal reduce (with a filter beforehand to
reduce how much data you are shuffling)

val numColumns = separatedInputStrings.filter(_._2).reduce(math.max)

I hope this helps!



On Tue, Jan 19, 2016 at 8:05 AM Richard Siebeling <rs...@gmail.com>
wrote:

> that's true and that's the way we're doing it now but then we're only
> using the first row to determine the number of splitted columns.
> It could be that in the second (or last) row there are 10 new columns and
> we'd like to know that too.
>
> Probably a reduceby operator can be used to do that, but I'm hoping that
> there is a better or another way,
>
> thanks,
> Richard
>
> On Tue, Jan 19, 2016 at 4:22 PM, Sabarish Sasidharan <
> sabarish.sasidharan@manthan.com> wrote:
>
>> The most efficient to determine the number of columns would be to do a
>> take(1) and split in the driver.
>>
>> Regards
>> Sab
>> On 19-Jan-2016 8:48 pm, "Richard Siebeling" <rs...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> what is the most efficient way to split columns and know how many
>>> columns are created.
>>>
>>> Here is the current RDD
>>> -----------------
>>> ID   STATE
>>> -----------------
>>> 1       TX, NY, FL
>>> 2       CA, OH
>>> -----------------
>>>
>>> This is the preferred output:
>>> -------------------------
>>> ID    STATE_1     STATE_2      STATE_3
>>> -------------------------
>>> 1     TX              NY              FL
>>> 2     CA              OH
>>> -------------------------
>>>
>>> With a separated with the new columns STATE_1, STATE_2, STATE_3
>>>
>>>
>>> It looks like the following output is feasible using a ReduceBy operator
>>> -------------------------
>>> ID    STATE_1     STATE_2      STATE_3       NEW_COLUMNS
>>> -------------------------
>>> 1     TX                NY               FL            STATE_1, STATE_2,
>>> STATE_3
>>> 2     CA                OH                             STATE_1, STATE_2
>>> -------------------------
>>>
>>> Then in the reduce step, the distinct new columns can be calculated.
>>> Is it possible to get the second output where next to the RDD the
>>> new_columns are saved somewhere?
>>> Or is the required to use the second approach?
>>>
>>> thanks in advance,
>>> Richard
>>>
>>>
>

Re: Split columns in RDD

Posted by Richard Siebeling <rs...@gmail.com>.
that's true and that's the way we're doing it now but then we're only using
the first row to determine the number of splitted columns.
It could be that in the second (or last) row there are 10 new columns and
we'd like to know that too.

Probably a reduceby operator can be used to do that, but I'm hoping that
there is a better or another way,

thanks,
Richard

On Tue, Jan 19, 2016 at 4:22 PM, Sabarish Sasidharan <
sabarish.sasidharan@manthan.com> wrote:

> The most efficient to determine the number of columns would be to do a
> take(1) and split in the driver.
>
> Regards
> Sab
> On 19-Jan-2016 8:48 pm, "Richard Siebeling" <rs...@gmail.com> wrote:
>
>> Hi,
>>
>> what is the most efficient way to split columns and know how many columns
>> are created.
>>
>> Here is the current RDD
>> -----------------
>> ID   STATE
>> -----------------
>> 1       TX, NY, FL
>> 2       CA, OH
>> -----------------
>>
>> This is the preferred output:
>> -------------------------
>> ID    STATE_1     STATE_2      STATE_3
>> -------------------------
>> 1     TX              NY              FL
>> 2     CA              OH
>> -------------------------
>>
>> With a separated with the new columns STATE_1, STATE_2, STATE_3
>>
>>
>> It looks like the following output is feasible using a ReduceBy operator
>> -------------------------
>> ID    STATE_1     STATE_2      STATE_3       NEW_COLUMNS
>> -------------------------
>> 1     TX                NY               FL            STATE_1, STATE_2,
>> STATE_3
>> 2     CA                OH                             STATE_1, STATE_2
>> -------------------------
>>
>> Then in the reduce step, the distinct new columns can be calculated.
>> Is it possible to get the second output where next to the RDD the
>> new_columns are saved somewhere?
>> Or is the required to use the second approach?
>>
>> thanks in advance,
>> Richard
>>
>>

Re: Split columns in RDD

Posted by Sabarish Sasidharan <sa...@manthan.com>.
The most efficient to determine the number of columns would be to do a
take(1) and split in the driver.

Regards
Sab
On 19-Jan-2016 8:48 pm, "Richard Siebeling" <rs...@gmail.com> wrote:

> Hi,
>
> what is the most efficient way to split columns and know how many columns
> are created.
>
> Here is the current RDD
> -----------------
> ID   STATE
> -----------------
> 1       TX, NY, FL
> 2       CA, OH
> -----------------
>
> This is the preferred output:
> -------------------------
> ID    STATE_1     STATE_2      STATE_3
> -------------------------
> 1     TX              NY              FL
> 2     CA              OH
> -------------------------
>
> With a separated with the new columns STATE_1, STATE_2, STATE_3
>
>
> It looks like the following output is feasible using a ReduceBy operator
> -------------------------
> ID    STATE_1     STATE_2      STATE_3       NEW_COLUMNS
> -------------------------
> 1     TX                NY               FL            STATE_1, STATE_2,
> STATE_3
> 2     CA                OH                             STATE_1, STATE_2
> -------------------------
>
> Then in the reduce step, the distinct new columns can be calculated.
> Is it possible to get the second output where next to the RDD the
> new_columns are saved somewhere?
> Or is the required to use the second approach?
>
> thanks in advance,
> Richard
>
>