You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Aakash Basu <aa...@gmail.com> on 2017/10/30 17:52:34 UTC

Split column with dynamic data

Hi all,

I've a requirement to split a column and fetch only the description where I
have numbers appended before that for some rows whereas other rows have
only the description -

Eg - (Description is the column header)

*Description*
Inventory Tree
Products
1. AT&T Services
2. Accessories
4. Miscellaneous
5. Service Center Items
3. 3rd Party Services
Integrated Service
6. Demo Devices
IT Department
Merchandising
2A. Impulse
2. Handsets
3. Strategic Products
1. Opportunities
5. Features
6. Rate Plans
7. Other AT&T Incentives
8. Wired
4. Rate Plan Tier Change
Integration SKUs (March 2016)
9. Financing SKUs
1. Smartphone

From the above, I only want the words of description and remove the
numbers. But since they're abrupt, it is difficult to use a single logic to
have it done.

Trying with the following options -

1) Using split by *fullstop*, together -

*split_col = split(CategoryInp_DF['description'], '.')*

*CategoryInp_DF= CategoryInp_DF.withColumn('unneeded',
split_col.getItem(0))*
*CategoryInp_DF= CategoryInp_DF.withColumn('description',
split_col.getItem(1))*


Result -> But both the output columns come as nulls.

2) Using split by *fullstop and space*, together -

*split_col = split(CategoryInp_DF['description'], '. ')*

*CategoryInp_DF= CategoryInp_DF.withColumn('unneeded',
split_col.getItem(0))*
*CategoryInp_DF= CategoryInp_DF.withColumn('description',
split_col.getItem(1))*

Result -> This is perfectly working for the numbered rows, but is
unnecessarily splitting the non-numbered rows too.

Any help would be greatly appreciated.


Thanks,
Aakash.

RE: Split column with dynamic data

Posted by Aakash Basu <aa...@gmail.com>.
Hey buddy,


Thanks a TON! Issue resolved.

Thanks again,
Aakash.

On 30-Oct-2017 11:44 PM, "Hondros, Constantine (ELS-AMS)" <
C.Hondros@elsevier.com> wrote:

> You should just use regexp_replace to remove all the leading number
> information (assuming it ends with a full-stop, and catering for the
> possibility of a capital letter).
>
>
>
> This is untested, but it shoud do the trick based on your examples so far:
>
>
>
> df.withColumn(“new_column”, regexp_replace($”Description”, “^\d+A-Z?\.”,
> “”))
>
>
>
>
>
> *From:* Aakash Basu [mailto:aakash.spark.raj@gmail.com]
> *Sent:* 30 October 2017 18:53
> *To:* user
> *Subject:* Split column with dynamic data
>
>
>
> **** External email: use caution ****
>
>
>
> Hi all,
>
>
>
> I've a requirement to split a column and fetch only the description where
> I have numbers appended before that for some rows whereas other rows have
> only the description -
>
>
>
> Eg - (Description is the column header)
>
>
>
> *Description*
>
> Inventory Tree
>
> Products
>
> 1. AT&T Services
>
> 2. Accessories
>
> 4. Miscellaneous
>
> 5. Service Center Items
>
> 3. 3rd Party Services
>
> Integrated Service
>
> 6. Demo Devices
>
> IT Department
>
> Merchandising
>
> 2A. Impulse
>
> 2. Handsets
>
> 3. Strategic Products
>
> 1. Opportunities
>
> 5. Features
>
> 6. Rate Plans
>
> 7. Other AT&T Incentives
>
> 8. Wired
>
> 4. Rate Plan Tier Change
>
> Integration SKUs (March 2016)
>
> 9. Financing SKUs
>
> 1. Smartphone
>
>
>
>
> From the above, I only want the words of description and remove the
> numbers. But since they're abrupt, it is difficult to use a single logic to
> have it done.
>
>
>
> Trying with the following options -
>
>
>
> 1) Using split by *fullstop*, together -
>
>
>
> *split_col = split(CategoryInp_DF['description'], '.')*
>
>
>
> *CategoryInp_DF= CategoryInp_DF.withColumn('unneeded',
> split_col.getItem(0))*
>
> *CategoryInp_DF= CategoryInp_DF.withColumn('description',
> split_col.getItem(1))*
>
>
>
>
>
> Result -> But both the output columns come as nulls.
>
>
>
> 2) Using split by *fullstop and space*, together -
>
> *split_col = split(CategoryInp_DF['description'], '. ')*
>
>
>
> *CategoryInp_DF= CategoryInp_DF.withColumn('unneeded',
> split_col.getItem(0))*
>
> *CategoryInp_DF= CategoryInp_DF.withColumn('description',
> split_col.getItem(1))*
>
>
>
> Result -> This is perfectly working for the numbered rows, but is
> unnecessarily splitting the non-numbered rows too.
>
>
>
> Any help would be greatly appreciated.
>
>
>
>
>
> Thanks,
>
> Aakash.
>
> ------------------------------
>
> Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The
> Netherlands
> <https://maps.google.com/?q=Radarweg+29,+1043+NX+Amsterdam,+The+Netherlands&entry=gmail&source=g>,
> Registration No. 33156677, Registered in The Netherlands.
>

RE: Split column with dynamic data

Posted by "Hondros, Constantine (ELS-AMS)" <C....@elsevier.com>.
You should just use regexp_replace to remove all the leading number information (assuming it ends with a full-stop, and catering for the possibility of a capital letter).

This is untested, but it shoud do the trick based on your examples so far:

df.withColumn(“new_column”, regexp_replace($”Description”, “^\d+A-Z?\.”, “”))


From: Aakash Basu [mailto:aakash.spark.raj@gmail.com]
Sent: 30 October 2017 18:53
To: user
Subject: Split column with dynamic data


*** External email: use caution ***


Hi all,

I've a requirement to split a column and fetch only the description where I have numbers appended before that for some rows whereas other rows have only the description -

Eg - (Description is the column header)

Description

Inventory Tree

Products

1. AT&T Services

2. Accessories

4. Miscellaneous

5. Service Center Items

3. 3rd Party Services

Integrated Service

6. Demo Devices

IT Department

Merchandising

2A. Impulse

2. Handsets

3. Strategic Products

1. Opportunities

5. Features

6. Rate Plans

7. Other AT&T Incentives

8. Wired

4. Rate Plan Tier Change

Integration SKUs (March 2016)

9. Financing SKUs

1. Smartphone



From the above, I only want the words of description and remove the numbers. But since they're abrupt, it is difficult to use a single logic to have it done.

Trying with the following options -

1) Using split by fullstop, together -

split_col = split(CategoryInp_DF['description'], '.')

CategoryInp_DF= CategoryInp_DF.withColumn('unneeded', split_col.getItem(0))
CategoryInp_DF= CategoryInp_DF.withColumn('description', split_col.getItem(1))


Result -> But both the output columns come as nulls.

2) Using split by fullstop and space, together -
split_col = split(CategoryInp_DF['description'], '. ')

CategoryInp_DF= CategoryInp_DF.withColumn('unneeded', split_col.getItem(0))
CategoryInp_DF= CategoryInp_DF.withColumn('description', split_col.getItem(1))

Result -> This is perfectly working for the numbered rows, but is unnecessarily splitting the non-numbered rows too.

Any help would be greatly appreciated.


Thanks,
Aakash.

________________________________

Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The Netherlands, Registration No. 33156677, Registered in The Netherlands.