You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Alexis (JIRA)" <ji...@apache.org> on 2011/02/08 18:38:57 UTC

[jira] Created: (NUTCH-965) Parsing takes up 100% CPU

Parsing takes up 100% CPU
-------------------------

                 Key: NUTCH-965
                 URL: https://issues.apache.org/jira/browse/NUTCH-965
             Project: Nutch
          Issue Type: Improvement
          Components: parser
            Reporter: Alexis


The issue you're likely to run into when parsing truncated FLV files is described here:
http://www.mail-archive.com/user@nutch.apache.org/msg01880.html

The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Ferdy Galema (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13214443#comment-13214443 ] 

Ferdy Galema commented on NUTCH-965:
------------------------------------

OK should be done now. I crosschecked both branches and tried to keep the implementations as similar as possible.

(Please note that the v3 patches are now incorrect, obviously)
                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215655#comment-13215655 ] 

Hudson commented on NUTCH-965:
------------------------------

Integrated in nutch-trunk-maven #170 (See [https://builds.apache.org/job/nutch-trunk-maven/170/])
    RECOMMIT NUTCH-965 Skip parsing for truncated document (Revision 1293278)

     Result = SUCCESS
ferdy : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java
* /nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java

                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Ferdy Galema (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215650#comment-13215650 ] 

Ferdy Galema commented on NUTCH-965:
------------------------------------

Recommitted. Thanks all.
                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (NUTCH-965) Parsing takes up 100% CPU

Posted by "Alexis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexis updated NUTCH-965:
-------------------------

    Attachment: parserJob.patch

In the parser mapper, compare Content-Length header to the size of the content buffer to see if they match.

If this HTTP header is available and in the case that the file was truncated, skip the parsing step to avoid that the parser gets stuck in infinite loop taking up all the CPU resources.


Before, in the logs, we would see:

{noformat}2011-02-07 14:03:34,693 WARN  parse.ParseUtil - TIMEOUT parsing http://downtownjoes.com/botb1.flv with org.apache.nutch.parse.tika.TikaParser@8c0162
2011-02-07 14:03:34,693 WARN  parse.ParseUtil - Unable to successfully parse content http://downtownjoes.com/botb1.flv of type video/x-flv
2011-02-07 14:04:04,725 WARN  parse.ParseUtil - TIMEOUT parsing http://downtownjoes.com/dtj.flv with org.apache.nutch.parse.tika.TikaParser@8c0162
2011-02-07 14:04:04,725 WARN  parse.ParseUtil - Unable to successfully parse content http://downtownjoes.com/dtj.flv of type video/x-flv
2011-02-07 14:04:34,772 WARN  parse.ParseUtil - TIMEOUT parsing http://downtownjoes.com/botb2.flv with org.apache.nutch.parse.tika.TikaParser@8c0162
2011-02-07 14:04:34,772 WARN  parse.ParseUtil - Unable to successfully parse content http://downtownjoes.com/botb2.flv of type video/x-flv
{noformat} 

After:

{noformat}2011-02-08 09:06:54,482 INFO  parse.ParserJob - http://downtownjoes.com/botb1.flv skipped. Content of size 4527822 was truncated to 63980
2011-02-08 09:06:54,482 INFO  parse.ParserJob - http://downtownjoes.com/dtj.flv skipped. Content of size 2692082 was truncated to 63980
2011-02-08 09:06:54,482 INFO  parse.ParserJob - http://downtownjoes.com/botb2.flv skipped. Content of size 35496213 was truncated to 61058
{noformat} 




> Parsing takes up 100% CPU
> -------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>         Attachments: parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-965) Skip parsing for truncated documents

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-965:
--------------------------------

    Fix Version/s: 2.0
                   1.4

> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>             Fix For: 1.4, 2.0
>
>         Attachments: parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13216293#comment-13216293 ] 

Hudson commented on NUTCH-965:
------------------------------

Integrated in Nutch-nutchgora #174 (See [https://builds.apache.org/job/Nutch-nutchgora/174/])
    RECOMMIT NUTCH-965 Skip parsing for truncated document (Revision 1293277)
REVERT NUTCH-965 Skip parsing for truncated document (Revision 1293228)

     Result = SUCCESS
ferdy : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/conf/nutch-default.xml
* /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherReducer.java
* /nutch/branches/nutchgora/src/java/org/apache/nutch/parse/ParserJob.java

ferdy : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/conf/nutch-default.xml
* /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherReducer.java
* /nutch/branches/nutchgora/src/java/org/apache/nutch/parse/ParserJob.java

                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Ferdy Galema (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215583#comment-13215583 ] 

Ferdy Galema commented on NUTCH-965:
------------------------------------

Ok that's it, I have reverted the changes completely. I am not sure what your cause is exactly, but I give you benefit of the doubt. Trunk and nutchgora are back to their previous states. Sorry for the inconvenience.

Lewis, could you reopen this issue. Have my mind on some other matters now, but I will look later back at this one.
                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13117668#comment-13117668 ] 

Lewis John McGibbney commented on NUTCH-965:
--------------------------------------------

This would be great to get into 1.4. Do you have time to get this in Alexis? If not, I am willing to try and get it working. 
                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>             Fix For: 1.4, nutchgora
>
>         Attachments: parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Ferdy Galema (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215530#comment-13215530 ] 

Ferdy Galema commented on NUTCH-965:
------------------------------------

Hi Markus,

For nutchtrunk I performed the following testcrawls and it worked as expected (for urls that are NOT truncated)
-fetching and separate parsing (parser.skip.truncated to true)
-fetching with parsing (parser.skip.truncated to true)
-fetching and separate parsing (parser.skip.truncated to false)
-fetching with parsing (parser.skip.truncated to false)

I did the same for nutchgora. So this is to verify that for nontruncated urls everything works as before.

For urls that _are_ truncated, I debugged a crawl and artifically changed the size to check that parsing is skipped. But only when the parser.skip.truncated is set to true. This works too.

In short, yes it has been fixed.

                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13216298#comment-13216298 ] 

Hudson commented on NUTCH-965:
------------------------------

Integrated in Nutch-trunk #1768 (See [https://builds.apache.org/job/Nutch-trunk/1768/])
    RECOMMIT NUTCH-965 Skip parsing for truncated document (Revision 1293278)
REVERT NUTCH-965 Skip parsing for truncated document (Revision 1293225)

     Result = SUCCESS
ferdy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293278
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java
* /nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java

ferdy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293225
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java
* /nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java

                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (NUTCH-965) Skip parsing for truncated documents

Posted by "Lewis John McGibbney (Closed) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney closed NUTCH-965.
--------------------------------------

    
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (NUTCH-965) Skip parsing for truncated documents

Posted by "Alexis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexis updated NUTCH-965:
-------------------------

    Summary: Skip parsing for truncated documents  (was: Parsing takes up 100% CPU)

> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>         Attachments: parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Ferdy Galema (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13214595#comment-13214595 ] 

Ferdy Galema commented on NUTCH-965:
------------------------------------

Please note that the failure in Nutch-nutchgora #171 is unrelated. (It's the cursed TestAPI)
                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13214282#comment-13214282 ] 

Hudson commented on NUTCH-965:
------------------------------

Integrated in Nutch-nutchgora #170 (See [https://builds.apache.org/job/Nutch-nutchgora/170/])
    NUTCH-965 Skip parsing for truncated documents (Revision 1292184)

     Result = SUCCESS
ferdy : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/conf/nutch-default.xml
* /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherReducer.java
* /nutch/branches/nutchgora/src/java/org/apache/nutch/parse/ParserJob.java

                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215366#comment-13215366 ] 

Hudson commented on NUTCH-965:
------------------------------

Integrated in Nutch-trunk #1767 (See [https://builds.apache.org/job/Nutch-trunk/1767/])
    integrate NUTCH-965 Skip parsing for truncated documents (commit 3) (Revision 1292686)
integrate NUTCH-965 Skip parsing for truncated documents (commit 2) (Revision 1292667)

     Result = SUCCESS
ferdy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1292686
Files : 
* /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java
* /nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java

ferdy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1292667
Files : 
* /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java

                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215596#comment-13215596 ] 

Hudson commented on NUTCH-965:
------------------------------

Integrated in nutch-trunk-maven #169 (See [https://builds.apache.org/job/nutch-trunk-maven/169/])
    REVERT NUTCH-965 Skip parsing for truncated document (Revision 1293225)

     Result = SUCCESS
ferdy : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java
* /nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java

                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215507#comment-13215507 ] 

Markus Jelsma commented on NUTCH-965:
-------------------------------------

has this been fixed now?
                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190545#comment-13190545 ] 

Lewis John McGibbney commented on NUTCH-965:
--------------------------------------------

Hi can anyone advise if I should be looking @ ParseUtil class in trunk? I'm a bit confused and Eclipse doesn't seem to be helping out much.
                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-965) Skip parsing for truncated documents

Posted by "Ferdy Galema (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy Galema updated NUTCH-965:
-------------------------------

    Attachment: NUTCH-965-v3-trunk.txt
                NUTCH-965-v3-nutchgora.txt
    
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-965) Parsing takes up 100% CPU

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992440#comment-12992440 ] 

Julien Nioche commented on NUTCH-965:
-------------------------------------

this should be optional but activated by default
the parsing is also done within the fetching so it would need modifying there as well
would be nice to have that in 1.3 
note : change the title to something like "skip parsing for truncated documents" would be more accurate description

> Parsing takes up 100% CPU
> -------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>         Attachments: parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13213471#comment-13213471 ] 

Hudson commented on NUTCH-965:
------------------------------

Integrated in nutch-trunk-maven #162 (See [https://builds.apache.org/job/nutch-trunk-maven/162/])
    NUTCH-965 Skip parsing for truncated documents (Revision 1292185)

     Result = SUCCESS
ferdy : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java
* /nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java

                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215586#comment-13215586 ] 

Markus Jelsma commented on NUTCH-965:
-------------------------------------

I couldn't find an issue in your code either so i finally assumed it has to be my build messing things up. It works as expected now.

                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13214287#comment-13214287 ] 

Hudson commented on NUTCH-965:
------------------------------

Integrated in Nutch-trunk #1766 (See [https://builds.apache.org/job/Nutch-trunk/1766/])
    NUTCH-965 Skip parsing for truncated documents (Revision 1292185)

     Result = FAILURE
ferdy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1292185
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java
* /nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java

                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Ferdy Galema (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211875#comment-13211875 ] 

Ferdy Galema commented on NUTCH-965:
------------------------------------

Hi Lewis,

FYI: I'm currently looking into this one, for both nutchgora and trunk. Picked up your patch and changed a few things here and there to make work, including Julien's remarks. Side note: your patch was generated with a hardcoded project, namely on line 2 in the patch file "#P nutchgora". I couldn't apply it to a project that is named differently without removing this line. (Not too much of a worry, but just to let you know.. hadn't seen anything like that before).

Will get back on this.
                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215585#comment-13215585 ] 

Markus Jelsma commented on NUTCH-965:
-------------------------------------

Hmm, cleaning and rebuilding the job fixes that issue here. Please ignore :)
                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13214566#comment-13214566 ] 

Hudson commented on NUTCH-965:
------------------------------

Integrated in Nutch-nutchgora #171 (See [https://builds.apache.org/job/Nutch-nutchgora/171/])
    integrate NUTCH-965 Skip parsing for truncated documents (commit 3) (Revision 1292684)
integrate NUTCH-965 Skip parsing for truncated documents (commit 2) (Revision 1292679)

     Result = FAILURE
ferdy : 
Files : 
* /nutch/branches/nutchgora/src/java/org/apache/nutch/parse/ParserJob.java

ferdy : 
Files : 
* /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherReducer.java

                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-965) Skip parsing for truncated documents

Posted by "Lewis John McGibbney (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney updated NUTCH-965:
---------------------------------------

    Attachment: NUTCH-965-v2.patch

Hi Guys,

I would ask you's to comment as this patch is not finished yet. Although I've made the functionality a boolean configurable, I've also intentionally neglected to address the second of your points Julien, regarding FetcherJob.java.

I see that the boolean parsing value is set in this class [1], but would like you to confirm if the code I'm writing should live under the public Collection object on line 138.

Once this is addressed it would be great to get a patch for trunk.

Thanks for anyone that can comment on this. 

[1] http://svn.apache.org/viewvc/nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherJob.java?view=markup
                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-965) Skip parsing for truncated documents

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-965:
--------------------------------

    Patch Info: [Patch Available]

> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>             Fix For: 1.4, 2.0
>
>         Attachments: parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13083064#comment-13083064 ] 

Markus Jelsma commented on NUTCH-965:
-------------------------------------

Can you provide a patch for 1.4 as well Alexis?

> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>             Fix For: 1.4, 2.0
>
>         Attachments: parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (NUTCH-965) Skip parsing for truncated documents

Posted by "Lewis John McGibbney (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney reassigned NUTCH-965:
------------------------------------------

    Assignee: Lewis John McGibbney
    
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215561#comment-13215561 ] 

Markus Jelsma commented on NUTCH-965:
-------------------------------------

Hi Ferdy,

With a parsing fetcher on trunk we see the ParseStatus.success counter rarely being incremented. A test crawl succesfully fetches 10.000 records but the success counter hangs around 15 records. Most, if not all, fetched pages are well below the truncating threshold.

Cheers
                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13214598#comment-13214598 ] 

Lewis John McGibbney commented on NUTCH-965:
--------------------------------------------

Yeah this is confirmed Ferdy. I spun a build and your right. Another headache to deal with :) Relentless!
                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Ferdy Galema (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13214390#comment-13214390 ] 

Ferdy Galema commented on NUTCH-965:
------------------------------------

The test works now. The pretty obvious fix was about the invertion of the "isTruncated(content)" check. (Not sure what went wrong yesterday as I stated that I had verified the changes; probably made a small modification afterwards with the assumption that it could not break the code...)
                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Ferdy Galema (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215591#comment-13215591 ] 

Ferdy Galema commented on NUTCH-965:
------------------------------------

Ok I will recommit it. Luckily I did notice a minor thing in nutchtrunk in Fetcher:

When a fetch is skipped because of truncation, the following code was also skipped:

if (parseResult == null) {
              byte[] signature =
                SignatureFactory.getSignature(getConf()).calculate(content,
                    new ParseStatus().getEmptyParse(conf));
              datum.setSignature(signature);
            }


                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Ferdy Galema (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13214415#comment-13214415 ] 

Ferdy Galema commented on NUTCH-965:
------------------------------------

Doublechecked and it seems I made a few other bugs.

Nutchgora:Parsing will not be performed at all when 'checkTruncated' boolean is false.
Nutchtrunk:checkTruncated flag is ignored in ParseSegment.

Sorry about that. I ought to expand my test coverage. Fixing them now.
                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13214434#comment-13214434 ] 

Hudson commented on NUTCH-965:
------------------------------

Integrated in nutch-trunk-maven #165 (See [https://builds.apache.org/job/nutch-trunk-maven/165/])
    integrate NUTCH-965 Skip parsing for truncated documents (commit 3) (Revision 1292686)

     Result = SUCCESS
ferdy : 
Files : 
* /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java
* /nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java

                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Ferdy Galema (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13213462#comment-13213462 ] 

Ferdy Galema commented on NUTCH-965:
------------------------------------

Tested, verified and committed with both trunk and branch. (Both with parsing during fetch and separate parsing).
                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13214393#comment-13214393 ] 

Hudson commented on NUTCH-965:
------------------------------

Integrated in nutch-trunk-maven #164 (See [https://builds.apache.org/job/nutch-trunk-maven/164/])
    integrate NUTCH-965 Skip parsing for truncated documents (commit 2) (Revision 1292667)

     Result = SUCCESS
ferdy : 
Files : 
* /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java

                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-965) Skip parsing for truncated documents

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-965:
--------------------------------

    Fix Version/s:     (was: 1.4)
                   1.5
    
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>             Fix For: nutchgora, 1.5
>
>         Attachments: parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (NUTCH-965) Skip parsing for truncated documents

Posted by "Lewis John McGibbney (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney resolved NUTCH-965.
----------------------------------------

    Resolution: Fixed

Patch for Nutchgora is much more comprehensive and a far cleaner implementation. In all honesty the patch for trunk shadowns this. Thanks Ferdy.
                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13135917#comment-13135917 ] 

Lewis John McGibbney commented on NUTCH-965:
--------------------------------------------

I'll address this for both versions after we release 1.4
                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

Posted by "Ferdy Galema (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13214383#comment-13214383 ] 

Ferdy Galema commented on NUTCH-965:
------------------------------------

Will fix the test right away.
                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira