You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by anbutech <an...@outlook.com> on 2019/11/30 03:04:53 UTC
Flatten log data Using Pyspark
Hi,
I have a raw source data frame having 2 columns as below
timestamp
2019-11-29 9:30:45
message_log
<123>NOV 29 10:20:35 ips01 sfids: connection:
tcp,bytes:104,user:unknown,url:unknown,host:127.0.0.1
how do we break above each key value as separate columns using udf in
pyspark?
what is the right approach for flattening this type of log data - regex or
python logic?
Could you please help me the logic to bring flattening the log data?
Final output dataframe having the below each columns:
timestamp
2019-11-29 9:30:45
prio
123
msg_ts
NOV 29 10:20:35
msg_ids
ips01
sfids
connection
tcp
bytes
104
user
unknown
url
unknown
host
127.0.0.1
Thanks
Anbu
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Re: Flatten log data Using Pyspark
Posted by Gourav Sengupta <go...@gmail.com>.
Why do you want to use UDF?
Regards,
Gourav
On Sat, Nov 30, 2019 at 3:06 AM anbutech <an...@outlook.com> wrote:
> Hi,
>
> I have a raw source data frame having 2 columns as below
>
> timestamp
> 2019-11-29 9:30:45
>
> message_log
>
> <123>NOV 29 10:20:35 ips01 sfids: connection:
> tcp,bytes:104,user:unknown,url:unknown,host:127.0.0.1
>
> how do we break above each key value as separate columns using udf in
> pyspark?
>
> what is the right approach for flattening this type of log data - regex or
> python logic?
>
> Could you please help me the logic to bring flattening the log data?
>
> Final output dataframe having the below each columns:
>
> timestamp
> 2019-11-29 9:30:45
>
> prio
> 123
>
> msg_ts
> NOV 29 10:20:35
>
> msg_ids
> ips01
>
> sfids
>
> connection
> tcp
>
> bytes
> 104
>
> user
> unknown
>
> url
> unknown
>
> host
> 127.0.0.1
>
>
> Thanks
> Anbu
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>