You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by anbutech <an...@outlook.com> on 2019/11/30 03:04:53 UTC

Flatten log data Using Pyspark

Hi,

I have a raw source data frame having 2 columns as below

timestamp                              
2019-11-29 9:30:45

message_log

<123>NOV 29 10:20:35 ips01 sfids: connection:
tcp,bytes:104,user:unknown,url:unknown,host:127.0.0.1

how do we break above each key value as separate columns using udf in
pyspark?

what is the right approach for flattening this type of log data - regex or
python logic?

Could you please help me the logic to bring flattening the log data?

Final output dataframe having the below  each columns:

timestamp                              
2019-11-29 9:30:45

prio
123

msg_ts
NOV 29 10:20:35

msg_ids
ips01 

sfids

connection
tcp

bytes
104

user
unknown

url
unknown

host
127.0.0.1


Thanks
Anbu




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Flatten log data Using Pyspark

Posted by Gourav Sengupta <go...@gmail.com>.
Why do you want to use UDF?

Regards,
Gourav

On Sat, Nov 30, 2019 at 3:06 AM anbutech <an...@outlook.com> wrote:

> Hi,
>
> I have a raw source data frame having 2 columns as below
>
> timestamp
> 2019-11-29 9:30:45
>
> message_log
>
> <123>NOV 29 10:20:35 ips01 sfids: connection:
> tcp,bytes:104,user:unknown,url:unknown,host:127.0.0.1
>
> how do we break above each key value as separate columns using udf in
> pyspark?
>
> what is the right approach for flattening this type of log data - regex or
> python logic?
>
> Could you please help me the logic to bring flattening the log data?
>
> Final output dataframe having the below  each columns:
>
> timestamp
> 2019-11-29 9:30:45
>
> prio
> 123
>
> msg_ts
> NOV 29 10:20:35
>
> msg_ids
> ips01
>
> sfids
>
> connection
> tcp
>
> bytes
> 104
>
> user
> unknown
>
> url
> unknown
>
> host
> 127.0.0.1
>
>
> Thanks
> Anbu
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>