You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2010/08/02 12:34:16 UTC

[jira] Created: (NUTCH-869) Add back parse-html

Add back parse-html
-------------------

                 Key: NUTCH-869
                 URL: https://issues.apache.org/jira/browse/NUTCH-869
             Project: Nutch
          Issue Type: Improvement
          Components: parser
    Affects Versions: 2.0, nutchbase
            Reporter: Andrzej Bialecki 
            Assignee: Andrzej Bialecki 


We need to add back parse-html. There are a few serious problems with HTML parsing in Tika 0.7, so it's not possible to do a quality crawl using parse-tika alone. The necessary improvements to Tika are on the way, so if a future version of Tika > 0.7 has a chance of passing our tests we can again remove this plugin and use parse-tika alone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-869) Add back parse-html

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-869:
--------------------------------

        Fix Version/s: 1.2
                       2.0
                       nutchbase
    Affects Version/s: 1.2

> Add back parse-html
> -------------------
>
>                 Key: NUTCH-869
>                 URL: https://issues.apache.org/jira/browse/NUTCH-869
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.2, 2.0, nutchbase
>            Reporter: Andrzej Bialecki 
>            Assignee: Julien Nioche
>             Fix For: 1.2, 2.0, nutchbase
>
>
> We need to add back parse-html. There are a few serious problems with HTML parsing in Tika 0.7, so it's not possible to do a quality crawl using parse-tika alone. The necessary improvements to Tika are on the way, so if a future version of Tika > 0.7 has a chance of passing our tests we can again remove this plugin and use parse-tika alone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (NUTCH-869) Add back parse-html

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche reassigned NUTCH-869:
-----------------------------------

    Assignee: Julien Nioche  (was: Andrzej Bialecki )

> Add back parse-html
> -------------------
>
>                 Key: NUTCH-869
>                 URL: https://issues.apache.org/jira/browse/NUTCH-869
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.2, 2.0, nutchbase
>            Reporter: Andrzej Bialecki 
>            Assignee: Julien Nioche
>             Fix For: 1.2, 2.0, nutchbase
>
>
> We need to add back parse-html. There are a few serious problems with HTML parsing in Tika 0.7, so it's not possible to do a quality crawl using parse-tika alone. The necessary improvements to Tika are on the way, so if a future version of Tika > 0.7 has a chance of passing our tests we can again remove this plugin and use parse-tika alone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (NUTCH-869) Add back parse-html

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche resolved NUTCH-869.
---------------------------------

    Resolution: Fixed

Nutchbase : Committed revision 982184
1.2 : Committed revision 982185
trunk (2.0) : Committed revision 982197

> Add back parse-html
> -------------------
>
>                 Key: NUTCH-869
>                 URL: https://issues.apache.org/jira/browse/NUTCH-869
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.2, 2.0, nutchbase
>            Reporter: Andrzej Bialecki 
>            Assignee: Julien Nioche
>             Fix For: 1.2, 2.0, nutchbase
>
>
> We need to add back parse-html. There are a few serious problems with HTML parsing in Tika 0.7, so it's not possible to do a quality crawl using parse-tika alone. The necessary improvements to Tika are on the way, so if a future version of Tika > 0.7 has a chance of passing our tests we can again remove this plugin and use parse-tika alone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (NUTCH-869) Add back parse-html

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche closed NUTCH-869.
-------------------------------


> Add back parse-html
> -------------------
>
>                 Key: NUTCH-869
>                 URL: https://issues.apache.org/jira/browse/NUTCH-869
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.2, 2.0, nutchbase
>            Reporter: Andrzej Bialecki 
>            Assignee: Julien Nioche
>             Fix For: 1.2, 2.0, nutchbase
>
>
> We need to add back parse-html. There are a few serious problems with HTML parsing in Tika 0.7, so it's not possible to do a quality crawl using parse-tika alone. The necessary improvements to Tika are on the way, so if a future version of Tika > 0.7 has a chance of passing our tests we can again remove this plugin and use parse-tika alone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-869) Add back parse-html

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894525#action_12894525 ] 

Julien Nioche commented on NUTCH-869:
-------------------------------------

+1

> Add back parse-html
> -------------------
>
>                 Key: NUTCH-869
>                 URL: https://issues.apache.org/jira/browse/NUTCH-869
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.0, nutchbase
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>
> We need to add back parse-html. There are a few serious problems with HTML parsing in Tika 0.7, so it's not possible to do a quality crawl using parse-tika alone. The necessary improvements to Tika are on the way, so if a future version of Tika > 0.7 has a chance of passing our tests we can again remove this plugin and use parse-tika alone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.