You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Johan Oskarsson (JIRA)" <ji...@apache.org> on 2009/05/27 12:36:45 UTC

[jira] Created: (HIVE-519) Regex processing gets stuck when querying weblogs

Regex processing gets stuck when querying weblogs
-------------------------------------------------

                 Key: HIVE-519
                 URL: https://issues.apache.org/jira/browse/HIVE-519
             Project: Hadoop Hive
          Issue Type: Bug
          Components: Serializers/Deserializers
    Affects Versions: 0.4.0
            Reporter: Johan Oskarsson
            Priority: Critical
             Fix For: 0.4.0


When running a simple query on a table similar to the apachelog table here: http://wiki.apache.org/hadoop/Hive/UserGuide the regular expression matcher gets stuck in an infinite loop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-519) Regex processing gets stuck when querying weblogs

Posted by "Martin Dittus (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714102#action_12714102 ] 

Martin Dittus commented on HIVE-519:
------------------------------------

To reiterate -- the behaviour reported is that records with certain character sequences (in this example: an HTTP request line with parentheses in the requested path) take many orders of magnitude longer to process than usual. 

This looks like it's a result of Java's regex implementation using a Non-deterministic Finite Automaton, which performs badly in worst-case scenarios (like probably this one.) Check e.g. http://weblogs.java.net/blog/tomwhite/archive/2006/03/a_faster_java_r.html for some background.

There are essentially two options to avoid this: Alter the expression, or use a different regex library. There may be a way to do the former. 

I'll use TestTCTLSeparatedProtocol.test1ApacheLogFormat() as an example.

This is the pattern generated for this test: (?:^| )(("|\[|\])(?:[^("|\[|\])]+|("|\[|\])("|\[|\]))*("|\[|\])|[^ ]*)
Note the sub-expression: ...[^("|\[|\])]+...

I.e., it builds a pattern for a negating character class ("[^...]") that unfortunately doesn't contain a list of characters, but instead another pattern group in parentheses. Namely, the value of QUOTE_CHAR: ("|\[|\])

After I manually converted this sub-expression to the legitimate "[^"\[\]]+" character class the pattern matcher performed admirably against Johan's test case.

Tbh I'm not sure if the character class that currently gets generated is valid; at minimum it may have some unintended side-effects. To implement this properly the pattern builder would need to have access to two different representations of the QUOTE_CHAR parameter (as a grouped expression and as a character class) when there currently only is one. 

(You probably need to apply HIVE-520 first to make the TestTCTLSeparatedProtocol unit test run.)


> Regex processing gets stuck when querying weblogs
> -------------------------------------------------
>
>                 Key: HIVE-519
>                 URL: https://issues.apache.org/jira/browse/HIVE-519
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.4.0
>            Reporter: Johan Oskarsson
>            Priority: Critical
>             Fix For: 0.4.0
>
>         Attachments: hive-stdout
>
>
> When running a simple query on a table similar to the apachelog table here: http://wiki.apache.org/hadoop/Hive/UserGuide the regular expression matcher gets stuck in an infinite loop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-519) Regex processing gets stuck when querying weblogs

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734986#action_12734986 ] 

Zheng Shao commented on HIVE-519:
---------------------------------

By the way, we plan to deprecate DynamicSerDe (with all protocols) at some time, because a lot of protocols (like TCTLSeparatedProtocol) are not real Thrift Protocols and it's overly complicated to fit these serialization methods into Thrift.  We will still support the "ThriftDeserializer" which supports statically generated java classes.

What do you guys think? Are there any dependencies on DynamicSerDe on your end?


> Regex processing gets stuck when querying weblogs
> -------------------------------------------------
>
>                 Key: HIVE-519
>                 URL: https://issues.apache.org/jira/browse/HIVE-519
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.4.0
>            Reporter: Johan Oskarsson
>            Assignee: Zheng Shao
>            Priority: Critical
>             Fix For: 0.4.0
>
>         Attachments: hive-stdout
>
>
> When running a simple query on a table similar to the apachelog table here: http://wiki.apache.org/hadoop/Hive/UserGuide the regular expression matcher gets stuck in an infinite loop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-519) Regex processing gets stuck when querying weblogs

Posted by "Johan Oskarsson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Johan Oskarsson updated HIVE-519:
---------------------------------

    Attachment: hive-stdout

Attaching a thread dump from one of the map tasks.

Example query that triggers this behaviour:
select remote_ip, count(1) from weblogs  where insertdate='2009-05-26' group by remote_ip limit 10;

> Regex processing gets stuck when querying weblogs
> -------------------------------------------------
>
>                 Key: HIVE-519
>                 URL: https://issues.apache.org/jira/browse/HIVE-519
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.4.0
>            Reporter: Johan Oskarsson
>            Priority: Critical
>             Fix For: 0.4.0
>
>         Attachments: hive-stdout
>
>
> When running a simple query on a table similar to the apachelog table here: http://wiki.apache.org/hadoop/Hive/UserGuide the regular expression matcher gets stuck in an infinite loop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (HIVE-519) Regex processing gets stuck when querying weblogs

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12713824#action_12713824 ] 

Zheng Shao edited comment on HIVE-519 at 5/27/09 6:13 PM:
----------------------------------------------------------

For debugging, please try
{code:regex}
(?:^| )(("|\[|\])(?:[^("|\[|\])]+|("|\[|\])("|\[|\]))*("|\[|\])|[^ ]*)
{code}
and
your line of log
on
http://www.fileformat.info/tool/regex.htm

      was (Author: zshao):
    For debugging, please try
(?:^| )(("|\[|\])(?:[^("|\[|\])]+|("|\[|\])("|\[|\]))*("|\[|\])|[^ ]*)
and
your line of log
on
http://www.fileformat.info/tool/regex.htm
  
> Regex processing gets stuck when querying weblogs
> -------------------------------------------------
>
>                 Key: HIVE-519
>                 URL: https://issues.apache.org/jira/browse/HIVE-519
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.4.0
>            Reporter: Johan Oskarsson
>            Priority: Critical
>             Fix For: 0.4.0
>
>         Attachments: hive-stdout
>
>
> When running a simple query on a table similar to the apachelog table here: http://wiki.apache.org/hadoop/Hive/UserGuide the regular expression matcher gets stuck in an infinite loop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (HIVE-519) Regex processing gets stuck when querying weblogs

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12713824#action_12713824 ] 

Zheng Shao edited comment on HIVE-519 at 5/27/09 6:13 PM:
----------------------------------------------------------

For debugging, please try
{code}
(?:^| )(("|\[|\])(?:[^("|\[|\])]+|("|\[|\])("|\[|\]))*("|\[|\])|[^ ]*)
{code}
and
your line of log
on
http://www.fileformat.info/tool/regex.htm

      was (Author: zshao):
    For debugging, please try
{code:regex}
(?:^| )(("|\[|\])(?:[^("|\[|\])]+|("|\[|\])("|\[|\]))*("|\[|\])|[^ ]*)
{code}
and
your line of log
on
http://www.fileformat.info/tool/regex.htm
  
> Regex processing gets stuck when querying weblogs
> -------------------------------------------------
>
>                 Key: HIVE-519
>                 URL: https://issues.apache.org/jira/browse/HIVE-519
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.4.0
>            Reporter: Johan Oskarsson
>            Priority: Critical
>             Fix For: 0.4.0
>
>         Attachments: hive-stdout
>
>
> When running a simple query on a table similar to the apachelog table here: http://wiki.apache.org/hadoop/Hive/UserGuide the regular expression matcher gets stuck in an infinite loop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (HIVE-519) Regex processing gets stuck when querying weblogs

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao resolved HIVE-519.
-----------------------------

    Resolution: Fixed

Fixed as part of HIVE-167. Please see HIVE-662 for the example query for apache web log files.

> Regex processing gets stuck when querying weblogs
> -------------------------------------------------
>
>                 Key: HIVE-519
>                 URL: https://issues.apache.org/jira/browse/HIVE-519
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.4.0
>            Reporter: Johan Oskarsson
>            Assignee: Zheng Shao
>            Priority: Critical
>             Fix For: 0.4.0
>
>         Attachments: hive-stdout
>
>
> When running a simple query on a table similar to the apachelog table here: http://wiki.apache.org/hadoop/Hive/UserGuide the regular expression matcher gets stuck in an infinite loop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-519) Regex processing gets stuck when querying weblogs

Posted by "Johan Oskarsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12713947#action_12713947 ] 

Johan Oskarsson commented on HIVE-519:
--------------------------------------

127.0.0.1 - - [26/May/2009:00:00:00 +0000] "GET /someurl/?track=Blabla(Main) HTTP/1.1" 200 5864 - "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.65 Safari/525.19"

Example of a line that takes 15 seconds to process on my machine, it does eventually complete though. If I remove the (Main) part of the url it takes 29 milliseconds to finish.

> Regex processing gets stuck when querying weblogs
> -------------------------------------------------
>
>                 Key: HIVE-519
>                 URL: https://issues.apache.org/jira/browse/HIVE-519
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.4.0
>            Reporter: Johan Oskarsson
>            Priority: Critical
>             Fix For: 0.4.0
>
>         Attachments: hive-stdout
>
>
> When running a simple query on a table similar to the apachelog table here: http://wiki.apache.org/hadoop/Hive/UserGuide the regular expression matcher gets stuck in an infinite loop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HIVE-519) Regex processing gets stuck when querying weblogs

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao reassigned HIVE-519:
-------------------------------

    Assignee: Zheng Shao

> Regex processing gets stuck when querying weblogs
> -------------------------------------------------
>
>                 Key: HIVE-519
>                 URL: https://issues.apache.org/jira/browse/HIVE-519
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.4.0
>            Reporter: Johan Oskarsson
>            Assignee: Zheng Shao
>            Priority: Critical
>             Fix For: 0.4.0
>
>         Attachments: hive-stdout
>
>
> When running a simple query on a table similar to the apachelog table here: http://wiki.apache.org/hadoop/Hive/UserGuide the regular expression matcher gets stuck in an infinite loop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-519) Regex processing gets stuck when querying weblogs

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12713820#action_12713820 ] 

Zheng Shao commented on HIVE-519:
---------------------------------

Can you upload the line of data that caused the trouble?

> Regex processing gets stuck when querying weblogs
> -------------------------------------------------
>
>                 Key: HIVE-519
>                 URL: https://issues.apache.org/jira/browse/HIVE-519
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.4.0
>            Reporter: Johan Oskarsson
>            Priority: Critical
>             Fix For: 0.4.0
>
>         Attachments: hive-stdout
>
>
> When running a simple query on a table similar to the apachelog table here: http://wiki.apache.org/hadoop/Hive/UserGuide the regular expression matcher gets stuck in an infinite loop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-519) Regex processing gets stuck when querying weblogs

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12713824#action_12713824 ] 

Zheng Shao commented on HIVE-519:
---------------------------------

For debugging, please try
(?:^| )(("|\[|\])(?:[^("|\[|\])]+|("|\[|\])("|\[|\]))*("|\[|\])|[^ ]*)
and
your line of log
on
http://www.fileformat.info/tool/regex.htm

> Regex processing gets stuck when querying weblogs
> -------------------------------------------------
>
>                 Key: HIVE-519
>                 URL: https://issues.apache.org/jira/browse/HIVE-519
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.4.0
>            Reporter: Johan Oskarsson
>            Priority: Critical
>             Fix For: 0.4.0
>
>         Attachments: hive-stdout
>
>
> When running a simple query on a table similar to the apachelog table here: http://wiki.apache.org/hadoop/Hive/UserGuide the regular expression matcher gets stuck in an infinite loop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-519) Regex processing gets stuck when querying weblogs

Posted by "Johan Oskarsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735595#action_12735595 ] 

Johan Oskarsson commented on HIVE-519:
--------------------------------------

We don't depend on DynamicSerDe, so we don't mind if it is deprecated.

> Regex processing gets stuck when querying weblogs
> -------------------------------------------------
>
>                 Key: HIVE-519
>                 URL: https://issues.apache.org/jira/browse/HIVE-519
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.4.0
>            Reporter: Johan Oskarsson
>            Assignee: Zheng Shao
>            Priority: Critical
>             Fix For: 0.4.0
>
>         Attachments: hive-stdout
>
>
> When running a simple query on a table similar to the apachelog table here: http://wiki.apache.org/hadoop/Hive/UserGuide the regular expression matcher gets stuck in an infinite loop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.