You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Paul Baclace (JIRA)" <ji...@apache.org> on 2005/12/27 04:29:30 UTC

[jira] Created: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail
---------------------------------------------------------------------------------------------------------

         Key: NUTCH-153
         URL: http://issues.apache.org/jira/browse/NUTCH-153
     Project: Nutch
        Type: Bug
  Components: fetcher  
    Versions: 0.8-dev    
 Environment: all
    Reporter: Paul Baclace


If TextParser is given postscript, it can take hours and then fail.  This can be avoided with careful configuration, but if the server MIME type is wrong and the basename of the URL has no "file extension", then the this parser will take a long time and fail every time.

Analysis: The real problem is OutlinkExtractor.java as reported with bug NUTCH-150, but the problem cannot be entirely addressed with that patch since the first call to reg expr match() can take a long time, despite quantifier limits.  

Suggested fix: Reject files with "%!PS-Adobe" in the first 40 characters of the file.

Actual experience has shown that for safety and fail-safe reasons, it is worth protecting against GIGO directly in TextParse for this case, even though the suggested fix is not a general solution.  (A general solution would be a timeout on match().)



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

Posted by "Paul Baclace (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-153?page=all ]

Paul Baclace updated NUTCH-153:
-------------------------------

    Attachment: TextParser.java.patch

A patch to reject files with "%!PS-Adobe" in the first 40 characters of the file.


> TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail
> ---------------------------------------------------------------------------------------------------------
>
>          Key: NUTCH-153
>          URL: http://issues.apache.org/jira/browse/NUTCH-153
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.8-dev
>  Environment: all
>     Reporter: Paul Baclace
>  Attachments: TextParser.java.patch
>
> If TextParser is given postscript, it can take hours and then fail.  This can be avoided with careful configuration, but if the server MIME type is wrong and the basename of the URL has no "file extension", then the this parser will take a long time and fail every time.
> Analysis: The real problem is OutlinkExtractor.java as reported with bug NUTCH-150, but the problem cannot be entirely addressed with that patch since the first call to reg expr match() can take a long time, despite quantifier limits.  
> Suggested fix: Reject files with "%!PS-Adobe" in the first 40 characters of the file.
> Actual experience has shown that for safety and fail-safe reasons, it is worth protecting against GIGO directly in TextParse for this case, even though the suggested fix is not a general solution.  (A general solution would be a timeout on match().)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12362002 ] 

Doug Cutting commented on NUTCH-153:
------------------------------------

Paul,

Does http://issues.apache.org/jira/browse/NUTCH-160 address this issue too?  I.e., is at least part of the problem that oro has some slow cases that Java's built-in regex's do not?


> TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail
> ---------------------------------------------------------------------------------------------------------
>
>          Key: NUTCH-153
>          URL: http://issues.apache.org/jira/browse/NUTCH-153
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.8-dev
>  Environment: all
>     Reporter: Paul Baclace
>  Attachments: TextParser.java.patch
>
> If TextParser is given postscript, it can take hours and then fail.  This can be avoided with careful configuration, but if the server MIME type is wrong and the basename of the URL has no "file extension", then the this parser will take a long time and fail every time.
> Analysis: The real problem is OutlinkExtractor.java as reported with bug NUTCH-150, but the problem cannot be entirely addressed with that patch since the first call to reg expr match() can take a long time, despite quantifier limits.  
> Suggested fix: Reject files with "%!PS-Adobe" in the first 40 characters of the file.
> Actual experience has shown that for safety and fail-safe reasons, it is worth protecting against GIGO directly in TextParse for this case, even though the suggested fix is not a general solution.  (A general solution would be a timeout on match().)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

Posted by "KuroSaka TeruHiko (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12361997 ] 

KuroSaka TeruHiko commented on NUTCH-153:
-----------------------------------------

Actually, shouldn't turning on the mime.type.magic property do the job that the patch is trying to address?



> TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail
> ---------------------------------------------------------------------------------------------------------
>
>          Key: NUTCH-153
>          URL: http://issues.apache.org/jira/browse/NUTCH-153
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.8-dev
>  Environment: all
>     Reporter: Paul Baclace
>  Attachments: TextParser.java.patch
>
> If TextParser is given postscript, it can take hours and then fail.  This can be avoided with careful configuration, but if the server MIME type is wrong and the basename of the URL has no "file extension", then the this parser will take a long time and fail every time.
> Analysis: The real problem is OutlinkExtractor.java as reported with bug NUTCH-150, but the problem cannot be entirely addressed with that patch since the first call to reg expr match() can take a long time, despite quantifier limits.  
> Suggested fix: Reject files with "%!PS-Adobe" in the first 40 characters of the file.
> Actual experience has shown that for safety and fail-safe reasons, it is worth protecting against GIGO directly in TextParse for this case, even though the suggested fix is not a general solution.  (A general solution would be a timeout on match().)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12361887 ] 

Jerome Charron commented on NUTCH-153:
--------------------------------------

What you call the white list is in fact the actual parse-plugins.xml file:
If you remove the default mapping, then unknown file types will not be parsed.
Moreover, in a previous discussion, Andrzej spoke about committing a strings command line like parser that is able to extracts textual content from any file type: It is really the good candidate for the default parser, instead of the text parser.



> TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail
> ---------------------------------------------------------------------------------------------------------
>
>          Key: NUTCH-153
>          URL: http://issues.apache.org/jira/browse/NUTCH-153
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.8-dev
>  Environment: all
>     Reporter: Paul Baclace
>  Attachments: TextParser.java.patch
>
> If TextParser is given postscript, it can take hours and then fail.  This can be avoided with careful configuration, but if the server MIME type is wrong and the basename of the URL has no "file extension", then the this parser will take a long time and fail every time.
> Analysis: The real problem is OutlinkExtractor.java as reported with bug NUTCH-150, but the problem cannot be entirely addressed with that patch since the first call to reg expr match() can take a long time, despite quantifier limits.  
> Suggested fix: Reject files with "%!PS-Adobe" in the first 40 characters of the file.
> Actual experience has shown that for safety and fail-safe reasons, it is worth protecting against GIGO directly in TextParse for this case, even though the suggested fix is not a general solution.  (A general solution would be a timeout on match().)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

Posted by "Paul Baclace (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12362000 ] 

Paul Baclace commented on NUTCH-153:
------------------------------------


> mime.type.magic?

The particular run that had problems was using mime.type.magic=true.  It turns out that the magic "%!PS-Adobe"  was preceeded by some spaces so it was not recognized.

The intent of this bug is that no matter why some content is passed to TextParser, there should not be parasitic cases that take too long to process.  (Parsing one file for hours is equivalent to being fatal.) There are per-file space limits on parsing (first N bytes), but the only time limit is at the Task level (an hour of inactivity) and it is fatal on the third (default) attempt. 

It makes sense to have non-fatal per-file time limits on parsers when regular expressions (OutlinkExtractor) are used since some regexprs are prone to having parasitic cases that take a long time instead of blowing up a stack.

> strings command line like parser [filter]

This is a related and good idea, but a different beast.  The idea is to improve recall by grabbing marginal shreds of tokens out of files with unknown formats.  For this to be effective and not annoying, it needs a threshhold for minimal % of content found, or minimal density, to accept  any tokens from a particular file in order to reject binary files that just happen to hit upon reasonable strings. 

(Reasonableness depends on charset/language, as pointed out by KuroSaka TeruHiko, but minimal ascii, a.k.a. romanji would be the most effective worldwide.) 

It also should have a way to set the weight of the tokens found that would take into account the density of reasonable tokens.  That is, a similarly sized f.txt would rank higher than a mystery-format f.huh with the same number of token matches plus 70% binary.



> TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail
> ---------------------------------------------------------------------------------------------------------
>
>          Key: NUTCH-153
>          URL: http://issues.apache.org/jira/browse/NUTCH-153
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.8-dev
>  Environment: all
>     Reporter: Paul Baclace
>  Attachments: TextParser.java.patch
>
> If TextParser is given postscript, it can take hours and then fail.  This can be avoided with careful configuration, but if the server MIME type is wrong and the basename of the URL has no "file extension", then the this parser will take a long time and fail every time.
> Analysis: The real problem is OutlinkExtractor.java as reported with bug NUTCH-150, but the problem cannot be entirely addressed with that patch since the first call to reg expr match() can take a long time, despite quantifier limits.  
> Suggested fix: Reject files with "%!PS-Adobe" in the first 40 characters of the file.
> Actual experience has shown that for safety and fail-safe reasons, it is worth protecting against GIGO directly in TextParse for this case, even though the suggested fix is not a general solution.  (A general solution would be a timeout on match().)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

Posted by "KuroSaka TeruHiko (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12361995 ] 

KuroSaka TeruHiko commented on NUTCH-153:
-----------------------------------------

The strings command would work with mostly ASCII text content.  It is highly doubtful if we can have a universal strings command that works with any charset/languages.

The right thing here seems to me to make the text parser finish its job or fail when non-text data is given.  Why is it taking hours, do we know?


> TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail
> ---------------------------------------------------------------------------------------------------------
>
>          Key: NUTCH-153
>          URL: http://issues.apache.org/jira/browse/NUTCH-153
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.8-dev
>  Environment: all
>     Reporter: Paul Baclace
>  Attachments: TextParser.java.patch
>
> If TextParser is given postscript, it can take hours and then fail.  This can be avoided with careful configuration, but if the server MIME type is wrong and the basename of the URL has no "file extension", then the this parser will take a long time and fail every time.
> Analysis: The real problem is OutlinkExtractor.java as reported with bug NUTCH-150, but the problem cannot be entirely addressed with that patch since the first call to reg expr match() can take a long time, despite quantifier limits.  
> Suggested fix: Reject files with "%!PS-Adobe" in the first 40 characters of the file.
> Actual experience has shown that for safety and fail-safe reasons, it is worth protecting against GIGO directly in TextParse for this case, even though the suggested fix is not a general solution.  (A general solution would be a timeout on match().)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

Posted by "byron miller (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12361347 ] 

byron miller commented on NUTCH-153:
------------------------------------

I was thinking with the hundreds of file types out there and an infinite possibility coming in the future shouldn't we create a whitelist of supported file types and mime types instead of a blacklist of what to block/ignore such as we do in the regex filters? 

> TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail
> ---------------------------------------------------------------------------------------------------------
>
>          Key: NUTCH-153
>          URL: http://issues.apache.org/jira/browse/NUTCH-153
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.8-dev
>  Environment: all
>     Reporter: Paul Baclace
>  Attachments: TextParser.java.patch
>
> If TextParser is given postscript, it can take hours and then fail.  This can be avoided with careful configuration, but if the server MIME type is wrong and the basename of the URL has no "file extension", then the this parser will take a long time and fail every time.
> Analysis: The real problem is OutlinkExtractor.java as reported with bug NUTCH-150, but the problem cannot be entirely addressed with that patch since the first call to reg expr match() can take a long time, despite quantifier limits.  
> Suggested fix: Reject files with "%!PS-Adobe" in the first 40 characters of the file.
> Actual experience has shown that for safety and fail-safe reasons, it is worth protecting against GIGO directly in TextParse for this case, even though the suggested fix is not a general solution.  (A general solution would be a timeout on match().)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

Posted by "Paul Baclace (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12362272 ] 

Paul Baclace commented on NUTCH-153:
------------------------------------

> NUTCH-160?

There is slowness and then there is continental drift.  The quantifiers should be used with any regex package unless the quantifier itself is a significant cost during match().  

The general solution is non-fatal per-file time limits on parsers, at least when regular expressions (OutlinkExtractor) are used.  That is, spawn a daemon thread as an alarm to interrupt() the thread doing match().  

I could make a match() timeout patch, but I have also seen a case where tagsoup spent a huge amount of time parsing files of type text/vnd.viewcvs-markup; I don't know what causes the problem, but this MIME type must be high in tortuosity since Chandler's mime-torture tests includes many examples.  Thus, a general solution of non-fatal per-file time limits on parsing files would be better placed to take care of present and future problems of this type.



> TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail
> ---------------------------------------------------------------------------------------------------------
>
>          Key: NUTCH-153
>          URL: http://issues.apache.org/jira/browse/NUTCH-153
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.8-dev
>  Environment: all
>     Reporter: Paul Baclace
>  Attachments: TextParser.java.patch
>
> If TextParser is given postscript, it can take hours and then fail.  This can be avoided with careful configuration, but if the server MIME type is wrong and the basename of the URL has no "file extension", then the this parser will take a long time and fail every time.
> Analysis: The real problem is OutlinkExtractor.java as reported with bug NUTCH-150, but the problem cannot be entirely addressed with that patch since the first call to reg expr match() can take a long time, despite quantifier limits.  
> Suggested fix: Reject files with "%!PS-Adobe" in the first 40 characters of the file.
> Actual experience has shown that for safety and fail-safe reasons, it is worth protecting against GIGO directly in TextParse for this case, even though the suggested fix is not a general solution.  (A general solution would be a timeout on match().)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira