You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Arun Patel <ar...@gmail.com> on 2016/07/01 18:19:38 UTC
Re: RegexSerDe with Filters
Dudu,
Thanks for your continued support. I need one more quick help. I have one
more log file as shown below.
STD-SERV 2016-06-29 12:10:39.142 c.f.c.s.F.ProcessingHandler [INFO]
TID:101114719017567668 cluster1 ProcessingHandler Message timestamp:
1467216639090
STD-SERV 2016-06-29 12:10:39.143 c.f.c.s.F.ProcessingHandler [INFO] TID:
101114719017567668 cluster1: Processed request
STD-SERV 2016-06-29 12:10:39.163 c.f.c.s.F.WritingHandler [INFO] TID:
101114719017567668 Request received in writer
STD-SERV 2016-06-29 12:10:39.273 c.f.c.s.F.WritingHandler [INFO] TID:
101114719017567668 Processed request
STD-SERV 2016-06-29 12:10:39.273 c.f.c.s.F.WritingHandler [INFO] TID:
101114719017567668 Total time: 10 ms
I need to create 3 views for 3 requirements.
1) create a view to get timestamp, TID number and cluster1 for lines
"ProcessingHandler
Message timestamp". But, for this line there is no space between TID: and
TID number.
2) create a view to get timestamp, TID for the lines "Request received in
writer". There is a space between TID: and TID number.
3) Create a view to get timestamp, TID for the lines "Total time:". There
is a space between TID: and TID number.
How do I create base table and views? I am planning to join these 3 views
based on TID. Do I need to take any special considerations?
Regards,
Venkat
On Fri, Jun 24, 2016 at 5:17 PM, Arun Patel <ar...@gmail.com> wrote:
> Dudu, Thanks for the clarification. Looks like I have an issue with my
> Hive installation. I tried in a different cluster and it works.
>
> Thanks again.
>
>
> On Fri, Jun 24, 2016 at 4:59 PM, Markovitz, Dudu <dm...@paypal.com>
> wrote:
>
>> This is a tested, working code.
>>
>> If you’re using https://regex101.com ,first replace backslash pairs (\\
>> ) with a single backslash (\) and also use the ‘g’ modifier in order to
>> find all of the matches.
>>
>>
>>
>> The regular expression is -
>>
>> (\S+)\s+([0-9]{4}-[0-9]{2}-[0-9]{2}
>> [0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})\s+(\S+)\s+\[([^]]+)\]\s+(\S+)\s+:\s+(TID:\s\d+)?\s*(.*)
>>
>>
>>
>> I’ll send you a screen shot in private, since you don’t want to expose
>> the data.
>>
>>
>>
>> Dudu
>>
>>
>>
>>
>>
>> *From:* Arun Patel [mailto:arunp.bigdata@gmail.com]
>> *Sent:* Friday, June 24, 2016 9:33 PM
>>
>> *To:* user@hive.apache.org
>> *Subject:* Re: RegexSerDe with Filters
>>
>>
>>
>> Looks like Regex pattern is not working. I tested the pattern on
>> https://regex101.com/ and it does not find any match.
>>
>>
>>
>> Any suggestions?
>>
>>
>>
>> On Thu, Jun 23, 2016 at 3:01 PM, Markovitz, Dudu <dm...@paypal.com>
>> wrote:
>>
>> My pleasure.
>>
>> Please feel free to reach me if needed.
>>
>>
>>
>> Dudu
>>
>>
>>
>> *From:* Arun Patel [mailto:arunp.bigdata@gmail.com]
>> *Sent:* Wednesday, June 22, 2016 2:57 AM
>> *To:* user@hive.apache.org
>> *Subject:* Re: RegexSerDe with Filters
>>
>>
>>
>> Thank you very much, Dudu. This really helps.
>>
>>
>>
>> On Tue, Jun 21, 2016 at 7:48 PM, Markovitz, Dudu <dm...@paypal.com>
>> wrote:
>>
>> Hi
>>
>>
>>
>> Here is the code (without the log data).
>>
>>
>>
>> I’ve created some of the views using different text processing technics.
>>
>> The rest of the views could be create in similar ways.
>>
>>
>>
>>
>>
>> Dudu
>>
>>
>>
>>
>>
>>
>>
>> bash
>>
>>
>> ----------------------------------------------------------------------------------------------------
>>
>>
>>
>> hdfs dfs -mkdir -p /tmp/log/20160621
>>
>> hdfs dfs –put logfile.txt /tmp/log/20160621
>>
>>
>>
>>
>>
>> hive
>>
>>
>> ----------------------------------------------------------------------------------------------------
>>
>>
>>
>> /*
>>
>> External table log
>>
>>
>>
>> Defines all common columns + optional column 'tid' which appears in
>> most log records + the rest of the log ('txt')
>>
>>
>>
>> */
>>
>>
>>
>> drop table if exists log;
>>
>>
>>
>> create external table log
>>
>> (
>>
>> c1 string
>>
>> ,ts string
>>
>> ,ts_frac string
>>
>> ,log_rec_level string
>>
>> ,c4 string
>>
>> ,c5 string
>>
>> ,tid string
>>
>> ,txt string
>>
>> )
>>
>> partitioned by (dt date)
>>
>> row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
>>
>> with serdeproperties ('input.regex'='(
>> \\S+)\\s+([0-9]{4}-[0-9]{2}-[0-9]{2}
>> [0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})
>> \\s+(\\S+)\\s+\\[([^]]+)\\]\\s+(\\S+)\\s+:\\s+(TID:\\s\\d+)?\\s*(.*)')
>>
>> stored as textfile
>>
>> location '/tmp/log'
>>
>> ;
>>
>>
>>
>> alter table log add partition (dt=date '2016-06-21') location
>> '/tmp/log/20160621';
>>
>>
>>
>> select * from log;
>>
>>
>> ----------------------------------------------------------------------------------------------------
>>
>>
>>
>> /*
>>
>> View log_v
>>
>>
>>
>> Base view for all other views
>>
>>
>>
>> */
>>
>>
>>
>> drop view if exists log_v;
>>
>>
>>
>> create view log_v
>>
>> as
>>
>> select c1
>>
>> ,cast (concat_ws ('.',ts,ts_frac) as timestamp) as ts
>>
>> ,log_rec_level
>>
>> ,c4
>>
>> ,c5
>>
>> ,cast (ltrim(substr (tid,5)) as bigint) as tid
>>
>> ,txt
>>
>>
>>
>> from log
>>
>> ;
>>
>>
>>
>> select * from log_v;
>>
>>
>>
>>
>> ----------------------------------------------------------------------------------------------------
>>
>>
>>
>> drop view if exists log_v_reaping_path;
>>
>>
>>
>> create view log_v_reaping_path
>>
>> as
>>
>> select c1
>>
>> ,ts
>>
>> ,log_rec_level
>>
>> ,c4
>>
>> ,c5
>>
>> ,substr (txt,15) as reaping_path
>>
>>
>>
>> from log_V
>>
>>
>>
>> where txt like 'Reaping path: %'
>>
>> ;
>>
>>
>>
>> select * from log_v_reaping_path;
>>
>>
>>
>>
>> ----------------------------------------------------------------------------------------------------
>>
>>
>>
>> drop view if exists log_v_published_to_kafka;
>>
>>
>>
>> create view log_v_published_to_kafka
>>
>> as
>>
>> select c1
>>
>> ,ts
>>
>> ,log_rec_level
>>
>> ,c4
>>
>> ,c5
>>
>> ,tid
>>
>>
>>
>> , ltrim (kv [' Key'] ) as key
>>
>> ,cast (ltrim (kv [' size'] ) as bigint ) as size
>>
>> , ltrim (kv [' topic'] ) as topic
>>
>> ,cast (ltrim (kv [' partition']) as int ) as partition
>>
>> ,cast (ltrim (kv [' offset'] ) as bigint ) as offset
>>
>>
>>
>> from (select c1
>>
>> ,ts
>>
>> ,log_rec_level
>>
>> ,c4
>>
>> ,c5
>>
>> ,tid
>>
>> ,str_to_map (substr (txt ,locate
>> ('.',txt)+1),',',':') as kv
>>
>>
>>
>> from log_V
>>
>>
>>
>> where txt like 'Published to Kafka. %'
>>
>> )
>>
>> as t
>>
>> ;
>>
>>
>>
>> select * from log_v_published_to_kafka;
>>
>>
>>
>>
>> ----------------------------------------------------------------------------------------------------
>>
>>
>>
>> drop view if exists log_v_get_request;
>>
>>
>>
>> create view log_v_get_request
>>
>> as
>>
>> select c1
>>
>> ,ts
>>
>> ,log_rec_level
>>
>> ,c4
>>
>> ,c5
>>
>> ,tid
>>
>> ,substr (txt,31) as path
>>
>>
>>
>> from log_V
>>
>>
>>
>> where txt like 'GET request received for path %'
>>
>> ;
>>
>>
>>
>> select * from log_v_get_request;
>>
>>
>>
>>
>> ----------------------------------------------------------------------------------------------------
>>
>>
>>
>> drop view if exists log_v_unlock_request;
>>
>>
>>
>> create view log_v_unlock_request
>>
>> as
>>
>> select c1
>>
>> ,ts
>>
>> ,log_rec_level
>>
>> ,c4
>>
>> ,c5
>>
>> ,tid
>>
>> ,regexp_extract (txt,'rowkey (\\S+)',1) as rowkey
>>
>> ,regexp_extract (txt,'lock id (\\S+)',1) as lock_id
>>
>>
>>
>> from log_V
>>
>>
>>
>> where txt like 'Unlock request for schema DU %'
>>
>> ;
>>
>>
>>
>>
>>
>> *From:* Markovitz, Dudu [mailto:dmarkovitz@paypal.com]
>> *Sent:* Tuesday, June 21, 2016 2:26 PM
>> *To:* user@hive.apache.org
>> *Subject:* RE: RegexSerDe with Filters
>>
>>
>>
>> Hi
>>
>>
>>
>> I would suggest creating a single external table with daily partitions
>> and multiple views each with the appropriate filtering.
>>
>> If you’ll send me log sample (~100 rows) I’ll send you an example.
>>
>>
>>
>> Dudu
>>
>>
>>
>> *From:* Arun Patel [mailto:arunp.bigdata@gmail.com
>> <ar...@gmail.com>]
>> *Sent:* Tuesday, June 21, 2016 1:51 AM
>> *To:* user@hive.apache.org
>> *Subject:* RegexSerDe with Filters
>>
>>
>>
>> Hello Hive Experts,
>>
>>
>>
>> I use flume to ingest application specific logs from Syslog to HDFS.
>> Currently, I grep the HDFS directory for specific patterns (for multiple
>> types of requests) and then create reports. However, generating reports
>> for Weekly and Monthly are not salable.
>>
>>
>>
>> I would like to create multiple external on the daily HDFS directory
>> partitioned by date with RegexSerde and then create separate Parquet tables
>> for every kind of request.
>>
>>
>>
>> Question is - How do I create multiple (about 20) RegexSerde tables on
>> same data applying filters? This will be just like 20 grep commands I am
>> running today.
>>
>>
>>
>> Example: hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'STORE
>> Request Received for APPXXXX' | awk '{print $4, $13, $14, $17, $20}'
>>
>> hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'SCAN
>> Request Received for APPYYYY' | awk '{print $4, $14, $19, $21, $22}'
>>
>> hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'TOTAL
>> TIME' | awk '{print $4, $24}'
>>
>>
>>
>> I would like to create a tables which does this kind of job and then
>> writes output to Parquet tables.
>>
>>
>>
>> Please let me know how this can be done. Thank you!
>>
>>
>>
>> Regards,
>>
>> Arun
>>
>>
>>
>>
>>
>
>
Re: RegexSerDe with Filters
Posted by Arun Patel <ar...@gmail.com>.
Thanks Dudu. Couple of additional questions.
1) I need to have the partition column in my view. I am able to add it to
my view. But, will it add any performance overhead?
2) I have six 2GB files under date partition and currently I have only one
partition. I always run into 'OutOfMemoryError: Java heap space' when run
above query which uses group by tid. I considered
increasing hive.tez.container.size and hive.tez.java.opts. Is this due to
'group by' ? How to identify the required memory setting? or any other
suggestions?
On Sun, Jul 3, 2016 at 12:14 AM, Markovitz, Dudu <dm...@paypal.com>
wrote:
>
>
> select ...
>
> ,sort_array (collect_list(case when att like
> '%ProcessingHandler Message%' then val_num end)) as timestamps_with_dup
>
> ,sort_array (collect_set (case when att like
> '%ProcessingHandler Message%' then val_num end)) as timestamps_no_dup
>
>
>
> ...
>
> from v
>
> group by tid
>
> ;
>
>
>
> *From:* Arun Patel [mailto:arunp.bigdata@gmail.com]
> *Sent:* Sunday, July 03, 2016 12:39 AM
>
> *To:* user@hive.apache.org
> *Subject:* Re: RegexSerDe with Filters
>
>
>
> Dudu,
>
>
>
> That's an excellent suggestion. Yes, it works for me.
>
>
>
> But, if there are two lines with "ProcessingHandler Message" and they
> have same TID, I want to create an array of two timestamps.
>
>
>
> Is it possible without join?
>
>
>
>
>
>
>
>
>
> On Sat, Jul 2, 2016 at 3:56 PM, Markovitz, Dudu <dm...@paypal.com>
> wrote:
>
> Hi Venkat
>
>
>
> You don’t necessarily need the three views if your goal is to join them.
>
> You can achieve the same result using a single view and an aggregated
> query.
>
> Please test the following code and see if it works for you or you would
> like to get a different solution.
>
>
>
> Dudu
>
>
>
>
>
> create external table t
>
> (
>
> c1 string
>
> ,ts string
>
> ,c3 string
>
> ,log_rec_level string
>
> ,tid string
>
> ,att string
>
> ,val string
>
> ,val_num string
>
> )
>
> row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
>
> with serdeproperties ('input.regex'='(
> \\S+)\\s+(.{23})\\s+(\\S+)\\s+\\[([^]]+)\\]\\s+TID:\\s*(\\d+)\\s+([^:]+):?\\s*((\\d+)?.*)')
>
> stored as textfile
>
> location '/tmp/t'
>
> ;
>
>
>
> create view v
>
> as
>
> select c1
>
> ,cast (ts as timestamp) as ts
>
> ,c3 as c3
>
> ,log_rec_level as log_rec_level
>
> ,cast (tid as bigint) as tid
>
> ,att as att
>
> ,val as val
>
> ,cast (val_num as bigint) as val_num
>
>
>
> from t
>
> ;
>
>
>
>
>
> select tid
>
>
>
> ,min (case when att like '%ProcessingHandler Message%' then
> ts end) as ts_ProcessingHandler_Message
>
> ,min (case when att = 'Request received in writer' then
> ts end) as ts_Request_received_in_writer
>
> ,min (case when att = 'Total time' then
> ts end) as ts_Total_time
>
>
>
> ,min (case when att like '%ProcessingHandler Message%' then
> val_num end) as timestamp
>
> ,min (case when att = 'Total time' then
> val_num end) as Total_time
>
>
>
> from v
>
>
>
> group by tid
>
> ;
>
>
>
>
>
>
>
>
>
> *From:* Arun Patel [mailto:arunp.bigdata@gmail.com]
> *Sent:* Friday, July 01, 2016 9:20 PM
>
>
> *To:* user@hive.apache.org
> *Subject:* Re: RegexSerDe with Filters
>
>
>
> Dudu,
>
>
>
> Thanks for your continued support. I need one more quick help. I have
> one more log file as shown below.
>
>
>
> STD-SERV 2016-06-29 12:10:39.142 c.f.c.s.F.ProcessingHandler [INFO]
> TID:101114719017567668 cluster1 ProcessingHandler Message timestamp:
> 1467216639090
>
> STD-SERV 2016-06-29 12:10:39.143 c.f.c.s.F.ProcessingHandler [INFO] TID:
> 101114719017567668 cluster1: Processed request
>
> STD-SERV 2016-06-29 12:10:39.163 c.f.c.s.F.WritingHandler [INFO] TID:
> 101114719017567668 Request received in writer
>
> STD-SERV 2016-06-29 12:10:39.273 c.f.c.s.F.WritingHandler [INFO] TID:
> 101114719017567668 Processed request
>
> STD-SERV 2016-06-29 12:10:39.273 c.f.c.s.F.WritingHandler [INFO] TID:
> 101114719017567668 Total time: 10 ms
>
>
>
> I need to create 3 views for 3 requirements.
>
> 1) create a view to get timestamp, TID number and cluster1 for lines
> "ProcessingHandler Message timestamp". But, for this line there is no
> space between TID: and TID number.
>
>
>
> 2) create a view to get timestamp, TID for the lines "Request received in
> writer". There is a space between TID: and TID number.
>
>
>
> 3) Create a view to get timestamp, TID for the lines "Total time:". There
> is a space between TID: and TID number.
>
>
>
> How do I create base table and views? I am planning to join these 3 views
> based on TID. Do I need to take any special considerations?
>
>
>
> Regards,
>
> Venkat
>
>
>
>
>
>
>
>
>
> On Fri, Jun 24, 2016 at 5:17 PM, Arun Patel <ar...@gmail.com>
> wrote:
>
> Dudu, Thanks for the clarification. Looks like I have an issue with my
> Hive installation. I tried in a different cluster and it works.
>
>
>
> Thanks again.
>
>
>
>
>
> On Fri, Jun 24, 2016 at 4:59 PM, Markovitz, Dudu <dm...@paypal.com>
> wrote:
>
> This is a tested, working code.
>
> If you’re using https://regex101.com ,first replace backslash pairs (\\ )
> with a single backslash (\) and also use the ‘g’ modifier in order to find
> all of the matches.
>
>
>
> The regular expression is -
>
> (\S+)\s+([0-9]{4}-[0-9]{2}-[0-9]{2}
> [0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})\s+(\S+)\s+\[([^]]+)\]\s+(\S+)\s+:\s+(TID:\s\d+)?\s*(.*)
>
>
>
> I’ll send you a screen shot in private, since you don’t want to expose the
> data.
>
>
>
> Dudu
>
>
>
>
>
> *From:* Arun Patel [mailto:arunp.bigdata@gmail.com]
> *Sent:* Friday, June 24, 2016 9:33 PM
>
>
> *To:* user@hive.apache.org
> *Subject:* Re: RegexSerDe with Filters
>
>
>
> Looks like Regex pattern is not working. I tested the pattern on
> https://regex101.com/ and it does not find any match.
>
>
>
> Any suggestions?
>
>
>
> On Thu, Jun 23, 2016 at 3:01 PM, Markovitz, Dudu <dm...@paypal.com>
> wrote:
>
> My pleasure.
>
> Please feel free to reach me if needed.
>
>
>
> Dudu
>
>
>
> *From:* Arun Patel [mailto:arunp.bigdata@gmail.com]
> *Sent:* Wednesday, June 22, 2016 2:57 AM
> *To:* user@hive.apache.org
> *Subject:* Re: RegexSerDe with Filters
>
>
>
> Thank you very much, Dudu. This really helps.
>
>
>
> On Tue, Jun 21, 2016 at 7:48 PM, Markovitz, Dudu <dm...@paypal.com>
> wrote:
>
> Hi
>
>
>
> Here is the code (without the log data).
>
>
>
> I’ve created some of the views using different text processing technics.
>
> The rest of the views could be create in similar ways.
>
>
>
>
>
> Dudu
>
>
>
>
>
>
>
> bash
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> hdfs dfs -mkdir -p /tmp/log/20160621
>
> hdfs dfs –put logfile.txt /tmp/log/20160621
>
>
>
>
>
> hive
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> /*
>
> External table log
>
>
>
> Defines all common columns + optional column 'tid' which appears in
> most log records + the rest of the log ('txt')
>
>
>
> */
>
>
>
> drop table if exists log;
>
>
>
> create external table log
>
> (
>
> c1 string
>
> ,ts string
>
> ,ts_frac string
>
> ,log_rec_level string
>
> ,c4 string
>
> ,c5 string
>
> ,tid string
>
> ,txt string
>
> )
>
> partitioned by (dt date)
>
> row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
>
> with serdeproperties ('input.regex'='(
> \\S+)\\s+([0-9]{4}-[0-9]{2}-[0-9]{2}
> [0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})
> \\s+(\\S+)\\s+\\[([^]]+)\\]\\s+(\\S+)\\s+:\\s+(TID:\\s\\d+)?\\s*(.*)')
>
> stored as textfile
>
> location '/tmp/log'
>
> ;
>
>
>
> alter table log add partition (dt=date '2016-06-21') location
> '/tmp/log/20160621';
>
>
>
> select * from log;
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> /*
>
> View log_v
>
>
>
> Base view for all other views
>
>
>
> */
>
>
>
> drop view if exists log_v;
>
>
>
> create view log_v
>
> as
>
> select c1
>
> ,cast (concat_ws ('.',ts,ts_frac) as timestamp) as ts
>
> ,log_rec_level
>
> ,c4
>
> ,c5
>
> ,cast (ltrim(substr (tid,5)) as bigint) as tid
>
> ,txt
>
>
>
> from log
>
> ;
>
>
>
> select * from log_v;
>
>
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> drop view if exists log_v_reaping_path;
>
>
>
> create view log_v_reaping_path
>
> as
>
> select c1
>
> ,ts
>
> ,log_rec_level
>
> ,c4
>
> ,c5
>
> ,substr (txt,15) as reaping_path
>
>
>
> from log_V
>
>
>
> where txt like 'Reaping path: %'
>
> ;
>
>
>
> select * from log_v_reaping_path;
>
>
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> drop view if exists log_v_published_to_kafka;
>
>
>
> create view log_v_published_to_kafka
>
> as
>
> select c1
>
> ,ts
>
> ,log_rec_level
>
> ,c4
>
> ,c5
>
> ,tid
>
>
>
> , ltrim (kv [' Key'] ) as key
>
> ,cast (ltrim (kv [' size'] ) as bigint ) as size
>
> , ltrim (kv [' topic'] ) as topic
>
> ,cast (ltrim (kv [' partition']) as int ) as partition
>
> ,cast (ltrim (kv [' offset'] ) as bigint ) as offset
>
>
>
> from (select c1
>
> ,ts
>
> ,log_rec_level
>
> ,c4
>
> ,c5
>
> ,tid
>
> ,str_to_map (substr (txt ,locate
> ('.',txt)+1),',',':') as kv
>
>
>
> from log_V
>
>
>
> where txt like 'Published to Kafka. %'
>
> )
>
> as t
>
> ;
>
>
>
> select * from log_v_published_to_kafka;
>
>
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> drop view if exists log_v_get_request;
>
>
>
> create view log_v_get_request
>
> as
>
> select c1
>
> ,ts
>
> ,log_rec_level
>
> ,c4
>
> ,c5
>
> ,tid
>
> ,substr (txt,31) as path
>
>
>
> from log_V
>
>
>
> where txt like 'GET request received for path %'
>
> ;
>
>
>
> select * from log_v_get_request;
>
>
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> drop view if exists log_v_unlock_request;
>
>
>
> create view log_v_unlock_request
>
> as
>
> select c1
>
> ,ts
>
> ,log_rec_level
>
> ,c4
>
> ,c5
>
> ,tid
>
> ,regexp_extract (txt,'rowkey (\\S+)',1) as rowkey
>
> ,regexp_extract (txt,'lock id (\\S+)',1) as lock_id
>
>
>
> from log_V
>
>
>
> where txt like 'Unlock request for schema DU %'
>
> ;
>
>
>
>
>
> *From:* Markovitz, Dudu [mailto:dmarkovitz@paypal.com]
> *Sent:* Tuesday, June 21, 2016 2:26 PM
> *To:* user@hive.apache.org
> *Subject:* RE: RegexSerDe with Filters
>
>
>
> Hi
>
>
>
> I would suggest creating a single external table with daily partitions and
> multiple views each with the appropriate filtering.
>
> If you’ll send me log sample (~100 rows) I’ll send you an example.
>
>
>
> Dudu
>
>
>
> *From:* Arun Patel [mailto:arunp.bigdata@gmail.com
> <ar...@gmail.com>]
> *Sent:* Tuesday, June 21, 2016 1:51 AM
> *To:* user@hive.apache.org
> *Subject:* RegexSerDe with Filters
>
>
>
> Hello Hive Experts,
>
>
>
> I use flume to ingest application specific logs from Syslog to HDFS.
> Currently, I grep the HDFS directory for specific patterns (for multiple
> types of requests) and then create reports. However, generating reports
> for Weekly and Monthly are not salable.
>
>
>
> I would like to create multiple external on the daily HDFS directory
> partitioned by date with RegexSerde and then create separate Parquet tables
> for every kind of request.
>
>
>
> Question is - How do I create multiple (about 20) RegexSerde tables on
> same data applying filters? This will be just like 20 grep commands I am
> running today.
>
>
>
> Example: hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'STORE Request
> Received for APPXXXX' | awk '{print $4, $13, $14, $17, $20}'
>
> hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'SCAN
> Request Received for APPYYYY' | awk '{print $4, $14, $19, $21, $22}'
>
> hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'TOTAL
> TIME' | awk '{print $4, $24}'
>
>
>
> I would like to create a tables which does this kind of job and then
> writes output to Parquet tables.
>
>
>
> Please let me know how this can be done. Thank you!
>
>
>
> Regards,
>
> Arun
>
>
>
>
>
>
>
>
>
>
>
RE: RegexSerDe with Filters
Posted by "Markovitz, Dudu" <dm...@paypal.com>.
select ...
,sort_array (collect_list(case when att like '%ProcessingHandler Message%' then val_num end)) as timestamps_with_dup
,sort_array (collect_set (case when att like '%ProcessingHandler Message%' then val_num end)) as timestamps_no_dup
...
from v
group by tid
;
From: Arun Patel [mailto:arunp.bigdata@gmail.com]
Sent: Sunday, July 03, 2016 12:39 AM
To: user@hive.apache.org
Subject: Re: RegexSerDe with Filters
Dudu,
That's an excellent suggestion. Yes, it works for me.
But, if there are two lines with "ProcessingHandler Message" and they have same TID, I want to create an array of two timestamps.
Is it possible without join?
On Sat, Jul 2, 2016 at 3:56 PM, Markovitz, Dudu <dm...@paypal.com>> wrote:
Hi Venkat
You don’t necessarily need the three views if your goal is to join them.
You can achieve the same result using a single view and an aggregated query.
Please test the following code and see if it works for you or you would like to get a different solution.
Dudu
create external table t
(
c1 string
,ts string
,c3 string
,log_rec_level string
,tid string
,att string
,val string
,val_num string
)
row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
with serdeproperties ('input.regex'='(\\S+)\\s+(.{23})\\s+(\\S+)\\s+\\[([^]]+)\\]\\s+TID:\\s*(\\d+)\\s+([^:]+):?\\s*((\\d+)?.*)')<file:///\\S+)\s+(.%7b23%7d)\s+(\S+)\s+\%5b(%5b%5e%5d%5d+)\%5d\s+TID:\s*(\d+)\s+(%5b%5e:%5d+):%3f\s*((\d+)%3f.*)')>
stored as textfile
location '/tmp/t'
;
create view v
as
select c1
,cast (ts as timestamp) as ts
,c3 as c3
,log_rec_level as log_rec_level
,cast (tid as bigint) as tid
,att as att
,val as val
,cast (val_num as bigint) as val_num
from t
;
select tid
,min (case when att like '%ProcessingHandler Message%' then ts end) as ts_ProcessingHandler_Message
,min (case when att = 'Request received in writer' then ts end) as ts_Request_received_in_writer
,min (case when att = 'Total time' then ts end) as ts_Total_time
,min (case when att like '%ProcessingHandler Message%' then val_num end) as timestamp
,min (case when att = 'Total time' then val_num end) as Total_time
from v
group by tid
;
From: Arun Patel [mailto:arunp.bigdata@gmail.com<ma...@gmail.com>]
Sent: Friday, July 01, 2016 9:20 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: RegexSerDe with Filters
Dudu,
Thanks for your continued support. I need one more quick help. I have one more log file as shown below.
STD-SERV 2016-06-29 12:10:39.142 c.f.c.s.F.ProcessingHandler [INFO] TID:101114719017567668 cluster1 ProcessingHandler Message timestamp: 1467216639090
STD-SERV 2016-06-29 12:10:39.143 c.f.c.s.F.ProcessingHandler [INFO] TID: 101114719017567668 cluster1: Processed request
STD-SERV 2016-06-29 12:10:39.163 c.f.c.s.F.WritingHandler [INFO] TID: 101114719017567668 Request received in writer
STD-SERV 2016-06-29 12:10:39.273 c.f.c.s.F.WritingHandler [INFO] TID: 101114719017567668 Processed request
STD-SERV 2016-06-29 12:10:39.273 c.f.c.s.F.WritingHandler [INFO] TID: 101114719017567668 Total time: 10 ms
I need to create 3 views for 3 requirements.
1) create a view to get timestamp, TID number and cluster1 for lines "ProcessingHandler Message timestamp". But, for this line there is no space between TID: and TID number.
2) create a view to get timestamp, TID for the lines "Request received in writer". There is a space between TID: and TID number.
3) Create a view to get timestamp, TID for the lines "Total time:". There is a space between TID: and TID number.
How do I create base table and views? I am planning to join these 3 views based on TID. Do I need to take any special considerations?
Regards,
Venkat
On Fri, Jun 24, 2016 at 5:17 PM, Arun Patel <ar...@gmail.com>> wrote:
Dudu, Thanks for the clarification. Looks like I have an issue with my Hive installation. I tried in a different cluster and it works.
Thanks again.
On Fri, Jun 24, 2016 at 4:59 PM, Markovitz, Dudu <dm...@paypal.com>> wrote:
This is a tested, working code.
If you’re using https://regex101.com ,first replace backslash pairs (\\ ) with a single backslash (\) and also use the ‘g’ modifier in order to find all of the matches.
The regular expression is -
(\S+)\s+([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})\s+(\S+)\s+\[([^]]+)\]\s+(\S+)\s+:\s+(TID:\s\d+)?\s*(.*)
I’ll send you a screen shot in private, since you don’t want to expose the data.
Dudu
From: Arun Patel [mailto:arunp.bigdata@gmail.com<ma...@gmail.com>]
Sent: Friday, June 24, 2016 9:33 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: RegexSerDe with Filters
Looks like Regex pattern is not working. I tested the pattern on https://regex101.com/ and it does not find any match.
Any suggestions?
On Thu, Jun 23, 2016 at 3:01 PM, Markovitz, Dudu <dm...@paypal.com>> wrote:
My pleasure.
Please feel free to reach me if needed.
Dudu
From: Arun Patel [mailto:arunp.bigdata@gmail.com<ma...@gmail.com>]
Sent: Wednesday, June 22, 2016 2:57 AM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: RegexSerDe with Filters
Thank you very much, Dudu. This really helps.
On Tue, Jun 21, 2016 at 7:48 PM, Markovitz, Dudu <dm...@paypal.com>> wrote:
Hi
Here is the code (without the log data).
I’ve created some of the views using different text processing technics.
The rest of the views could be create in similar ways.
Dudu
bash
----------------------------------------------------------------------------------------------------
hdfs dfs -mkdir -p /tmp/log/20160621
hdfs dfs –put logfile.txt /tmp/log/20160621
hive
----------------------------------------------------------------------------------------------------
/*
External table log
Defines all common columns + optional column 'tid' which appears in most log records + the rest of the log ('txt')
*/
drop table if exists log;
create external table log
(
c1 string
,ts string
,ts_frac string
,log_rec_level string
,c4 string
,c5 string
,tid string
,txt string
)
partitioned by (dt date)
row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
with serdeproperties ('input.regex'='(\\S+)\\s+([0-9]{4}-[0-9]{2}-[0-9]{2}<file:///\\S+)\s+(%5b0-9%5d%7b4%7d-%5b0-9%5d%7b2%7d-%5b0-9%5d%7b2%7d> [0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})\\s+(\\S+)\\s+\\[([^]]+)\\]\\s+(\\S+)\\s+:\\s+(TID:\\s\\d+)?\\s*(.*)')<file:///\\s+(\S+)\s+\%5b(%5b%5e%5d%5d+)\%5d\s+(\S+)\s+:\s+(TID:\s\d+)%3f\s*(.*)')>
stored as textfile
location '/tmp/log'
;
alter table log add partition (dt=date '2016-06-21') location '/tmp/log/20160621';
select * from log;
----------------------------------------------------------------------------------------------------
/*
View log_v
Base view for all other views
*/
drop view if exists log_v;
create view log_v
as
select c1
,cast (concat_ws ('.',ts,ts_frac) as timestamp) as ts
,log_rec_level
,c4
,c5
,cast (ltrim(substr (tid,5)) as bigint) as tid
,txt
from log
;
select * from log_v;
----------------------------------------------------------------------------------------------------
drop view if exists log_v_reaping_path;
create view log_v_reaping_path
as
select c1
,ts
,log_rec_level
,c4
,c5
,substr (txt,15) as reaping_path
from log_V
where txt like 'Reaping path: %'
;
select * from log_v_reaping_path;
----------------------------------------------------------------------------------------------------
drop view if exists log_v_published_to_kafka;
create view log_v_published_to_kafka
as
select c1
,ts
,log_rec_level
,c4
,c5
,tid
, ltrim (kv [' Key'] ) as key
,cast (ltrim (kv [' size'] ) as bigint ) as size
, ltrim (kv [' topic'] ) as topic
,cast (ltrim (kv [' partition']) as int ) as partition
,cast (ltrim (kv [' offset'] ) as bigint ) as offset
from (select c1
,ts
,log_rec_level
,c4
,c5
,tid
,str_to_map (substr (txt ,locate ('.',txt)+1),',',':') as kv
from log_V
where txt like 'Published to Kafka. %'
)
as t
;
select * from log_v_published_to_kafka;
----------------------------------------------------------------------------------------------------
drop view if exists log_v_get_request;
create view log_v_get_request
as
select c1
,ts
,log_rec_level
,c4
,c5
,tid
,substr (txt,31) as path
from log_V
where txt like 'GET request received for path %'
;
select * from log_v_get_request;
----------------------------------------------------------------------------------------------------
drop view if exists log_v_unlock_request;
create view log_v_unlock_request
as
select c1
,ts
,log_rec_level
,c4
,c5
,tid
,regexp_extract (txt,'rowkey (\\S+)',1<file:///\\S+)',1>) as rowkey
,regexp_extract (txt,'lock id (\\S+)',1<file:///\\S+)',1>) as lock_id
from log_V
where txt like 'Unlock request for schema DU %'
;
From: Markovitz, Dudu [mailto:dmarkovitz@paypal.com<ma...@paypal.com>]
Sent: Tuesday, June 21, 2016 2:26 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: RE: RegexSerDe with Filters
Hi
I would suggest creating a single external table with daily partitions and multiple views each with the appropriate filtering.
If you’ll send me log sample (~100 rows) I’ll send you an example.
Dudu
From: Arun Patel [mailto:arunp.bigdata@gmail.com]
Sent: Tuesday, June 21, 2016 1:51 AM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: RegexSerDe with Filters
Hello Hive Experts,
I use flume to ingest application specific logs from Syslog to HDFS. Currently, I grep the HDFS directory for specific patterns (for multiple types of requests) and then create reports. However, generating reports for Weekly and Monthly are not salable.
I would like to create multiple external on the daily HDFS directory partitioned by date with RegexSerde and then create separate Parquet tables for every kind of request.
Question is - How do I create multiple (about 20) RegexSerde tables on same data applying filters? This will be just like 20 grep commands I am running today.
Example: hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'STORE Request Received for APPXXXX' | awk '{print $4, $13, $14, $17, $20}'
hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'SCAN Request Received for APPYYYY' | awk '{print $4, $14, $19, $21, $22}'
hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'TOTAL TIME' | awk '{print $4, $24}'
I would like to create a tables which does this kind of job and then writes output to Parquet tables.
Please let me know how this can be done. Thank you!
Regards,
Arun
Re: RegexSerDe with Filters
Posted by Arun Patel <ar...@gmail.com>.
Dudu,
That's an excellent suggestion. Yes, it works for me.
But, if there are two lines with "ProcessingHandler Message" and they have
same TID, I want to create an array of two timestamps.
Is it possible without join?
On Sat, Jul 2, 2016 at 3:56 PM, Markovitz, Dudu <dm...@paypal.com>
wrote:
> Hi Venkat
>
>
>
> You don’t necessarily need the three views if your goal is to join them.
>
> You can achieve the same result using a single view and an aggregated
> query.
>
> Please test the following code and see if it works for you or you would
> like to get a different solution.
>
>
>
> Dudu
>
>
>
>
>
> create external table t
>
> (
>
> c1 string
>
> ,ts string
>
> ,c3 string
>
> ,log_rec_level string
>
> ,tid string
>
> ,att string
>
> ,val string
>
> ,val_num string
>
> )
>
> row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
>
> with serdeproperties
> ('input.regex'='(\\S+)\\s+(.{23})\\s+(\\S+)\\s+\\[([^]]+)\\]\\s+TID:\\s*(\\d+)\\s+([^:]+):?\\s*((\\d+)?.*)')
>
> stored as textfile
>
> location '/tmp/t'
>
> ;
>
>
>
> create view v
>
> as
>
> select c1
>
> ,cast (ts as timestamp) as ts
>
> ,c3 as c3
>
> ,log_rec_level as log_rec_level
>
> ,cast (tid as bigint) as tid
>
> ,att as att
>
> ,val as val
>
> ,cast (val_num as bigint) as val_num
>
>
>
> from t
>
> ;
>
>
>
>
>
> select tid
>
>
>
> ,min (case when att like '%ProcessingHandler Message%' then
> ts end) as ts_ProcessingHandler_Message
>
> ,min (case when att = 'Request received in writer' then
> ts end) as ts_Request_received_in_writer
>
> ,min (case when att = 'Total time' then
> ts end) as ts_Total_time
>
>
>
> ,min (case when att like '%ProcessingHandler Message%' then
> val_num end) as timestamp
>
> ,min (case when att = 'Total time' then
> val_num end) as Total_time
>
>
>
> from v
>
>
>
> group by tid
>
> ;
>
>
>
>
>
>
>
>
>
> *From:* Arun Patel [mailto:arunp.bigdata@gmail.com]
> *Sent:* Friday, July 01, 2016 9:20 PM
>
> *To:* user@hive.apache.org
> *Subject:* Re: RegexSerDe with Filters
>
>
>
> Dudu,
>
>
>
> Thanks for your continued support. I need one more quick help. I have
> one more log file as shown below.
>
>
>
> STD-SERV 2016-06-29 12:10:39.142 c.f.c.s.F.ProcessingHandler [INFO]
> TID:101114719017567668 cluster1 ProcessingHandler Message timestamp:
> 1467216639090
>
> STD-SERV 2016-06-29 12:10:39.143 c.f.c.s.F.ProcessingHandler [INFO] TID:
> 101114719017567668 cluster1: Processed request
>
> STD-SERV 2016-06-29 12:10:39.163 c.f.c.s.F.WritingHandler [INFO] TID:
> 101114719017567668 Request received in writer
>
> STD-SERV 2016-06-29 12:10:39.273 c.f.c.s.F.WritingHandler [INFO] TID:
> 101114719017567668 Processed request
>
> STD-SERV 2016-06-29 12:10:39.273 c.f.c.s.F.WritingHandler [INFO] TID:
> 101114719017567668 Total time: 10 ms
>
>
>
> I need to create 3 views for 3 requirements.
>
> 1) create a view to get timestamp, TID number and cluster1 for lines
> "ProcessingHandler Message timestamp". But, for this line there is no
> space between TID: and TID number.
>
>
>
> 2) create a view to get timestamp, TID for the lines "Request received in
> writer". There is a space between TID: and TID number.
>
>
>
> 3) Create a view to get timestamp, TID for the lines "Total time:". There
> is a space between TID: and TID number.
>
>
>
> How do I create base table and views? I am planning to join these 3 views
> based on TID. Do I need to take any special considerations?
>
>
>
> Regards,
>
> Venkat
>
>
>
>
>
>
>
>
>
> On Fri, Jun 24, 2016 at 5:17 PM, Arun Patel <ar...@gmail.com>
> wrote:
>
> Dudu, Thanks for the clarification. Looks like I have an issue with my
> Hive installation. I tried in a different cluster and it works.
>
>
>
> Thanks again.
>
>
>
>
>
> On Fri, Jun 24, 2016 at 4:59 PM, Markovitz, Dudu <dm...@paypal.com>
> wrote:
>
> This is a tested, working code.
>
> If you’re using https://regex101.com ,first replace backslash pairs (\\ )
> with a single backslash (\) and also use the ‘g’ modifier in order to find
> all of the matches.
>
>
>
> The regular expression is -
>
> (\S+)\s+([0-9]{4}-[0-9]{2}-[0-9]{2}
> [0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})\s+(\S+)\s+\[([^]]+)\]\s+(\S+)\s+:\s+(TID:\s\d+)?\s*(.*)
>
>
>
> I’ll send you a screen shot in private, since you don’t want to expose the
> data.
>
>
>
> Dudu
>
>
>
>
>
> *From:* Arun Patel [mailto:arunp.bigdata@gmail.com]
> *Sent:* Friday, June 24, 2016 9:33 PM
>
>
> *To:* user@hive.apache.org
> *Subject:* Re: RegexSerDe with Filters
>
>
>
> Looks like Regex pattern is not working. I tested the pattern on
> https://regex101.com/ and it does not find any match.
>
>
>
> Any suggestions?
>
>
>
> On Thu, Jun 23, 2016 at 3:01 PM, Markovitz, Dudu <dm...@paypal.com>
> wrote:
>
> My pleasure.
>
> Please feel free to reach me if needed.
>
>
>
> Dudu
>
>
>
> *From:* Arun Patel [mailto:arunp.bigdata@gmail.com]
> *Sent:* Wednesday, June 22, 2016 2:57 AM
> *To:* user@hive.apache.org
> *Subject:* Re: RegexSerDe with Filters
>
>
>
> Thank you very much, Dudu. This really helps.
>
>
>
> On Tue, Jun 21, 2016 at 7:48 PM, Markovitz, Dudu <dm...@paypal.com>
> wrote:
>
> Hi
>
>
>
> Here is the code (without the log data).
>
>
>
> I’ve created some of the views using different text processing technics.
>
> The rest of the views could be create in similar ways.
>
>
>
>
>
> Dudu
>
>
>
>
>
>
>
> bash
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> hdfs dfs -mkdir -p /tmp/log/20160621
>
> hdfs dfs –put logfile.txt /tmp/log/20160621
>
>
>
>
>
> hive
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> /*
>
> External table log
>
>
>
> Defines all common columns + optional column 'tid' which appears in
> most log records + the rest of the log ('txt')
>
>
>
> */
>
>
>
> drop table if exists log;
>
>
>
> create external table log
>
> (
>
> c1 string
>
> ,ts string
>
> ,ts_frac string
>
> ,log_rec_level string
>
> ,c4 string
>
> ,c5 string
>
> ,tid string
>
> ,txt string
>
> )
>
> partitioned by (dt date)
>
> row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
>
> with serdeproperties ('input.regex'='(
> \\S+)\\s+([0-9]{4}-[0-9]{2}-[0-9]{2}
> [0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})
> \\s+(\\S+)\\s+\\[([^]]+)\\]\\s+(\\S+)\\s+:\\s+(TID:\\s\\d+)?\\s*(.*)')
>
> stored as textfile
>
> location '/tmp/log'
>
> ;
>
>
>
> alter table log add partition (dt=date '2016-06-21') location
> '/tmp/log/20160621';
>
>
>
> select * from log;
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> /*
>
> View log_v
>
>
>
> Base view for all other views
>
>
>
> */
>
>
>
> drop view if exists log_v;
>
>
>
> create view log_v
>
> as
>
> select c1
>
> ,cast (concat_ws ('.',ts,ts_frac) as timestamp) as ts
>
> ,log_rec_level
>
> ,c4
>
> ,c5
>
> ,cast (ltrim(substr (tid,5)) as bigint) as tid
>
> ,txt
>
>
>
> from log
>
> ;
>
>
>
> select * from log_v;
>
>
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> drop view if exists log_v_reaping_path;
>
>
>
> create view log_v_reaping_path
>
> as
>
> select c1
>
> ,ts
>
> ,log_rec_level
>
> ,c4
>
> ,c5
>
> ,substr (txt,15) as reaping_path
>
>
>
> from log_V
>
>
>
> where txt like 'Reaping path: %'
>
> ;
>
>
>
> select * from log_v_reaping_path;
>
>
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> drop view if exists log_v_published_to_kafka;
>
>
>
> create view log_v_published_to_kafka
>
> as
>
> select c1
>
> ,ts
>
> ,log_rec_level
>
> ,c4
>
> ,c5
>
> ,tid
>
>
>
> , ltrim (kv [' Key'] ) as key
>
> ,cast (ltrim (kv [' size'] ) as bigint ) as size
>
> , ltrim (kv [' topic'] ) as topic
>
> ,cast (ltrim (kv [' partition']) as int ) as partition
>
> ,cast (ltrim (kv [' offset'] ) as bigint ) as offset
>
>
>
> from (select c1
>
> ,ts
>
> ,log_rec_level
>
> ,c4
>
> ,c5
>
> ,tid
>
> ,str_to_map (substr (txt ,locate
> ('.',txt)+1),',',':') as kv
>
>
>
> from log_V
>
>
>
> where txt like 'Published to Kafka. %'
>
> )
>
> as t
>
> ;
>
>
>
> select * from log_v_published_to_kafka;
>
>
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> drop view if exists log_v_get_request;
>
>
>
> create view log_v_get_request
>
> as
>
> select c1
>
> ,ts
>
> ,log_rec_level
>
> ,c4
>
> ,c5
>
> ,tid
>
> ,substr (txt,31) as path
>
>
>
> from log_V
>
>
>
> where txt like 'GET request received for path %'
>
> ;
>
>
>
> select * from log_v_get_request;
>
>
>
>
> ----------------------------------------------------------------------------------------------------
>
>
>
> drop view if exists log_v_unlock_request;
>
>
>
> create view log_v_unlock_request
>
> as
>
> select c1
>
> ,ts
>
> ,log_rec_level
>
> ,c4
>
> ,c5
>
> ,tid
>
> ,regexp_extract (txt,'rowkey (\\S+)',1) as rowkey
>
> ,regexp_extract (txt,'lock id (\\S+)',1) as lock_id
>
>
>
> from log_V
>
>
>
> where txt like 'Unlock request for schema DU %'
>
> ;
>
>
>
>
>
> *From:* Markovitz, Dudu [mailto:dmarkovitz@paypal.com]
> *Sent:* Tuesday, June 21, 2016 2:26 PM
> *To:* user@hive.apache.org
> *Subject:* RE: RegexSerDe with Filters
>
>
>
> Hi
>
>
>
> I would suggest creating a single external table with daily partitions and
> multiple views each with the appropriate filtering.
>
> If you’ll send me log sample (~100 rows) I’ll send you an example.
>
>
>
> Dudu
>
>
>
> *From:* Arun Patel [mailto:arunp.bigdata@gmail.com
> <ar...@gmail.com>]
> *Sent:* Tuesday, June 21, 2016 1:51 AM
> *To:* user@hive.apache.org
> *Subject:* RegexSerDe with Filters
>
>
>
> Hello Hive Experts,
>
>
>
> I use flume to ingest application specific logs from Syslog to HDFS.
> Currently, I grep the HDFS directory for specific patterns (for multiple
> types of requests) and then create reports. However, generating reports
> for Weekly and Monthly are not salable.
>
>
>
> I would like to create multiple external on the daily HDFS directory
> partitioned by date with RegexSerde and then create separate Parquet tables
> for every kind of request.
>
>
>
> Question is - How do I create multiple (about 20) RegexSerde tables on
> same data applying filters? This will be just like 20 grep commands I am
> running today.
>
>
>
> Example: hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'STORE Request
> Received for APPXXXX' | awk '{print $4, $13, $14, $17, $20}'
>
> hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'SCAN
> Request Received for APPYYYY' | awk '{print $4, $14, $19, $21, $22}'
>
> hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'TOTAL
> TIME' | awk '{print $4, $24}'
>
>
>
> I would like to create a tables which does this kind of job and then
> writes output to Parquet tables.
>
>
>
> Please let me know how this can be done. Thank you!
>
>
>
> Regards,
>
> Arun
>
>
>
>
>
>
>
>
>
RE: RegexSerDe with Filters
Posted by "Markovitz, Dudu" <dm...@paypal.com>.
Hi Venkat
You don’t necessarily need the three views if your goal is to join them.
You can achieve the same result using a single view and an aggregated query.
Please test the following code and see if it works for you or you would like to get a different solution.
Dudu
create external table t
(
c1 string
,ts string
,c3 string
,log_rec_level string
,tid string
,att string
,val string
,val_num string
)
row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
with serdeproperties ('input.regex'='(\\S+)\\s+(.{23})\\s+(\\S+)\\s+\\[([^]]+)\\]\\s+TID:\\s*(\\d+)\\s+([^:]+):?\\s*((\\d+)?.*)')
stored as textfile
location '/tmp/t'
;
create view v
as
select c1
,cast (ts as timestamp) as ts
,c3 as c3
,log_rec_level as log_rec_level
,cast (tid as bigint) as tid
,att as att
,val as val
,cast (val_num as bigint) as val_num
from t
;
select tid
,min (case when att like '%ProcessingHandler Message%' then ts end) as ts_ProcessingHandler_Message
,min (case when att = 'Request received in writer' then ts end) as ts_Request_received_in_writer
,min (case when att = 'Total time' then ts end) as ts_Total_time
,min (case when att like '%ProcessingHandler Message%' then val_num end) as timestamp
,min (case when att = 'Total time' then val_num end) as Total_time
from v
group by tid
;
From: Arun Patel [mailto:arunp.bigdata@gmail.com]
Sent: Friday, July 01, 2016 9:20 PM
To: user@hive.apache.org
Subject: Re: RegexSerDe with Filters
Dudu,
Thanks for your continued support. I need one more quick help. I have one more log file as shown below.
STD-SERV 2016-06-29 12:10:39.142 c.f.c.s.F.ProcessingHandler [INFO] TID:101114719017567668 cluster1 ProcessingHandler Message timestamp: 1467216639090
STD-SERV 2016-06-29 12:10:39.143 c.f.c.s.F.ProcessingHandler [INFO] TID: 101114719017567668 cluster1: Processed request
STD-SERV 2016-06-29 12:10:39.163 c.f.c.s.F.WritingHandler [INFO] TID: 101114719017567668 Request received in writer
STD-SERV 2016-06-29 12:10:39.273 c.f.c.s.F.WritingHandler [INFO] TID: 101114719017567668 Processed request
STD-SERV 2016-06-29 12:10:39.273 c.f.c.s.F.WritingHandler [INFO] TID: 101114719017567668 Total time: 10 ms
I need to create 3 views for 3 requirements.
1) create a view to get timestamp, TID number and cluster1 for lines "ProcessingHandler Message timestamp". But, for this line there is no space between TID: and TID number.
2) create a view to get timestamp, TID for the lines "Request received in writer". There is a space between TID: and TID number.
3) Create a view to get timestamp, TID for the lines "Total time:". There is a space between TID: and TID number.
How do I create base table and views? I am planning to join these 3 views based on TID. Do I need to take any special considerations?
Regards,
Venkat
On Fri, Jun 24, 2016 at 5:17 PM, Arun Patel <ar...@gmail.com>> wrote:
Dudu, Thanks for the clarification. Looks like I have an issue with my Hive installation. I tried in a different cluster and it works.
Thanks again.
On Fri, Jun 24, 2016 at 4:59 PM, Markovitz, Dudu <dm...@paypal.com>> wrote:
This is a tested, working code.
If you’re using https://regex101.com ,first replace backslash pairs (\\ ) with a single backslash (\) and also use the ‘g’ modifier in order to find all of the matches.
The regular expression is -
(\S+)\s+([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})\s+(\S+)\s+\[([^]]+)\]\s+(\S+)\s+:\s+(TID:\s\d+)?\s*(.*)
I’ll send you a screen shot in private, since you don’t want to expose the data.
Dudu
From: Arun Patel [mailto:arunp.bigdata@gmail.com<ma...@gmail.com>]
Sent: Friday, June 24, 2016 9:33 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: RegexSerDe with Filters
Looks like Regex pattern is not working. I tested the pattern on https://regex101.com/ and it does not find any match.
Any suggestions?
On Thu, Jun 23, 2016 at 3:01 PM, Markovitz, Dudu <dm...@paypal.com>> wrote:
My pleasure.
Please feel free to reach me if needed.
Dudu
From: Arun Patel [mailto:arunp.bigdata@gmail.com<ma...@gmail.com>]
Sent: Wednesday, June 22, 2016 2:57 AM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: RegexSerDe with Filters
Thank you very much, Dudu. This really helps.
On Tue, Jun 21, 2016 at 7:48 PM, Markovitz, Dudu <dm...@paypal.com>> wrote:
Hi
Here is the code (without the log data).
I’ve created some of the views using different text processing technics.
The rest of the views could be create in similar ways.
Dudu
bash
----------------------------------------------------------------------------------------------------
hdfs dfs -mkdir -p /tmp/log/20160621
hdfs dfs –put logfile.txt /tmp/log/20160621
hive
----------------------------------------------------------------------------------------------------
/*
External table log
Defines all common columns + optional column 'tid' which appears in most log records + the rest of the log ('txt')
*/
drop table if exists log;
create external table log
(
c1 string
,ts string
,ts_frac string
,log_rec_level string
,c4 string
,c5 string
,tid string
,txt string
)
partitioned by (dt date)
row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
with serdeproperties ('input.regex'='(\\S+)\\s+([0-9]{4}-[0-9]{2}-[0-9]{2}<file:///\\S+)\s+(%5b0-9%5d%7b4%7d-%5b0-9%5d%7b2%7d-%5b0-9%5d%7b2%7d> [0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})\\s+(\\S+)\\s+\\[([^]]+)\\]\\s+(\\S+)\\s+:\\s+(TID:\\s\\d+)?\\s*(.*)')<file:///\\s+(\S+)\s+\%5b(%5b%5e%5d%5d+)\%5d\s+(\S+)\s+:\s+(TID:\s\d+)%3f\s*(.*)')>
stored as textfile
location '/tmp/log'
;
alter table log add partition (dt=date '2016-06-21') location '/tmp/log/20160621';
select * from log;
----------------------------------------------------------------------------------------------------
/*
View log_v
Base view for all other views
*/
drop view if exists log_v;
create view log_v
as
select c1
,cast (concat_ws ('.',ts,ts_frac) as timestamp) as ts
,log_rec_level
,c4
,c5
,cast (ltrim(substr (tid,5)) as bigint) as tid
,txt
from log
;
select * from log_v;
----------------------------------------------------------------------------------------------------
drop view if exists log_v_reaping_path;
create view log_v_reaping_path
as
select c1
,ts
,log_rec_level
,c4
,c5
,substr (txt,15) as reaping_path
from log_V
where txt like 'Reaping path: %'
;
select * from log_v_reaping_path;
----------------------------------------------------------------------------------------------------
drop view if exists log_v_published_to_kafka;
create view log_v_published_to_kafka
as
select c1
,ts
,log_rec_level
,c4
,c5
,tid
, ltrim (kv [' Key'] ) as key
,cast (ltrim (kv [' size'] ) as bigint ) as size
, ltrim (kv [' topic'] ) as topic
,cast (ltrim (kv [' partition']) as int ) as partition
,cast (ltrim (kv [' offset'] ) as bigint ) as offset
from (select c1
,ts
,log_rec_level
,c4
,c5
,tid
,str_to_map (substr (txt ,locate ('.',txt)+1),',',':') as kv
from log_V
where txt like 'Published to Kafka. %'
)
as t
;
select * from log_v_published_to_kafka;
----------------------------------------------------------------------------------------------------
drop view if exists log_v_get_request;
create view log_v_get_request
as
select c1
,ts
,log_rec_level
,c4
,c5
,tid
,substr (txt,31) as path
from log_V
where txt like 'GET request received for path %'
;
select * from log_v_get_request;
----------------------------------------------------------------------------------------------------
drop view if exists log_v_unlock_request;
create view log_v_unlock_request
as
select c1
,ts
,log_rec_level
,c4
,c5
,tid
,regexp_extract (txt,'rowkey (\\S+)',1<file:///\\S+)',1>) as rowkey
,regexp_extract (txt,'lock id (\\S+)',1<file:///\\S+)',1>) as lock_id
from log_V
where txt like 'Unlock request for schema DU %'
;
From: Markovitz, Dudu [mailto:dmarkovitz@paypal.com<ma...@paypal.com>]
Sent: Tuesday, June 21, 2016 2:26 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: RE: RegexSerDe with Filters
Hi
I would suggest creating a single external table with daily partitions and multiple views each with the appropriate filtering.
If you’ll send me log sample (~100 rows) I’ll send you an example.
Dudu
From: Arun Patel [mailto:arunp.bigdata@gmail.com]
Sent: Tuesday, June 21, 2016 1:51 AM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: RegexSerDe with Filters
Hello Hive Experts,
I use flume to ingest application specific logs from Syslog to HDFS. Currently, I grep the HDFS directory for specific patterns (for multiple types of requests) and then create reports. However, generating reports for Weekly and Monthly are not salable.
I would like to create multiple external on the daily HDFS directory partitioned by date with RegexSerde and then create separate Parquet tables for every kind of request.
Question is - How do I create multiple (about 20) RegexSerde tables on same data applying filters? This will be just like 20 grep commands I am running today.
Example: hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'STORE Request Received for APPXXXX' | awk '{print $4, $13, $14, $17, $20}'
hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'SCAN Request Received for APPYYYY' | awk '{print $4, $14, $19, $21, $22}'
hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'TOTAL TIME' | awk '{print $4, $24}'
I would like to create a tables which does this kind of job and then writes output to Parquet tables.
Please let me know how this can be done. Thank you!
Regards,
Arun