You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by John Paul Jayme <jo...@tdcx.com.INVALID> on 2023/06/21 10:45:03 UTC

Rename columns without manually setting them all

Hi,

This is currently my column definition :
Employee ID     Name    Client  Project Team    01/01/2022      02/01/2022      03/01/2022      04/01/2022      05/01/2022
12345       Dummy x     Dummy a abc     team a  OFF     WO      WH      WH      WH

As you can see, the outer columns are just daily attendance dates. My goal is to count the employees who were OFF / WO / WH on said dates. I need to transpose them so it would look like this :

[cid:ff6d0260-0168-40a4-82db-6c2acd517c39]

I am still new to pandas. Can you guide me on how to produce this? I am reading about melt() and set_index() but I am not sure if they are the correct functions to use.


Re: Rename columns without manually setting them all

Posted by Bjørn Jørgensen <bj...@gmail.com>.
data = {
    "Employee ID": [12345, 12346, 12347, 12348, 12349],
    "Name": ["Dummy x", "Dummy y", "Dummy z", "Dummy a", "Dummy b"],
    "Client": ["Dummy a", "Dummy b", "Dummy c", "Dummy d", "Dummy e"],
    "Project": ["abc", "def", "ghi", "jkl", "mno"],
    "Team": ["team a", "team b", "team c", "team d", "team e"],
    "01/01/2022": ["OFF", "WO", "WH", "WH", "OFF"],
    "02/01/2022": ["WO", "WO", "OFF", "WH", "WH"],
    "03/01/2022": ["WH", "WH", "WH", "OFF", "WO"],
    "04/01/2022": ["WH", "WO", "WO", "WH", "OFF"],
    "05/01/2022": ["WH", "WH", "OFF", "WO", "WO"],
}

df = ps.DataFrame(data)

# Define dates columns
dates_columns = df.columns[5:]

# Melt the dataframe and count the occurrences
df_melt = df.melt(id_vars=df.columns[:5], value_vars=dates_columns,
var_name="Date", value_name="Status")
df_counts = df_melt.groupby(["Date", "Status"]).size().unstack()
df_counts.sort_index(inplace=True)
df_counts

[image: image.png]

ons. 21. juni 2023 kl. 14:39 skrev Farshid Ashouri <
farsheed.ashouri@gmail.com>:

> You can use selectExpr and stack to achieve the same effect in PySpark:
>
>
>
> df = spark.read.csv("your_file.csv", header=True, inferSchema=True)
>
> date_columns = [col for col in df.columns if '/' in col]
>
> df = df.selectExpr(["`Employee ID`", "`Name`", "`Client`", "`Project`",
> "`Team`”]
> + [f"stack({len(date_columns)}, {', '.join([f'`{col}`, `{col}` as
> `Status`' for col in date_columns])}) as (`Date`, `Status`)”])
>
> result = df.groupby("Date", "Status").count()
>
>
>
>
> On 21 Jun 2023, at 11:45, John Paul Jayme <jo...@tdcx.com.INVALID>
> wrote:
>
> Hi,
>
> This is currently my column definition :
> Employee ID Name Client Project Team 01/01/2022 02/01/2022 03/01/2022
> 04/01/2022 05/01/2022
> 12345     Dummy x Dummy a abc team a OFF WO  WH WH WH
> As you can see, the outer columns are just daily attendance dates. My goal
> is to count the employees who were OFF / WO / WH on said dates. I need to
> transpose them so it would look like this :
>
> <image.png>
>
> I am still new to pandas. Can you guide me on how to produce this? I am
> reading about melt() and set_index() but I am not sure if they are the
> correct functions to use.
>
>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: Rename columns without manually setting them all

Posted by Farshid Ashouri <fa...@gmail.com>.
You can use selectExpr and stack to achieve the same effect in PySpark:



df = spark.read.csv("your_file.csv", header=True, inferSchema=True)

date_columns = [col for col in df.columns if '/' in col]

df = df.selectExpr(["`Employee ID`", "`Name`", "`Client`", "`Project`", "`Team`”] 
	+ [f"stack({len(date_columns)}, {', '.join([f'`{col}`, `{col}` as `Status`' for col in date_columns])}) as (`Date`, `Status`)”])

result = df.groupby("Date", "Status").count()




> On 21 Jun 2023, at 11:45, John Paul Jayme <jo...@tdcx.com.INVALID> wrote:
> 
> Hi,
> 
> This is currently my column definition :
> Employee ID	Name	Client	Project	Team	01/01/2022	02/01/2022	03/01/2022	04/01/2022	05/01/2022
> 12345	    Dummy x	Dummy a	abc	team a	OFF	WO 	WH	WH	WH
> 
> As you can see, the outer columns are just daily attendance dates. My goal is to count the employees who were OFF / WO / WH on said dates. I need to transpose them so it would look like this : 
> 
> <image.png>
> 
> I am still new to pandas. Can you guide me on how to produce this? I am reading about melt() and set_index() but I am not sure if they are the correct functions to use.