You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Marcin Okraszewski (JIRA)" <ji...@apache.org> on 2007/05/22 14:18:16 UTC

[jira] Created: (NUTCH-490) Extension point with filters for Neko HTML parser (with patch)

Extension point with filters for Neko HTML parser (with patch)
--------------------------------------------------------------

                 Key: NUTCH-490
                 URL: https://issues.apache.org/jira/browse/NUTCH-490
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
    Affects Versions: 0.9.0
         Environment: Any
            Reporter: Marcin Okraszewski
            Priority: Minor
         Attachments: HtmlParser.java.diff

In my project I need to set filters for Neko HTML parser. So instead of adding it hard coded, I made an extension point to define filters for Neko. I was fallowing the code for HtmlParser filters. In fact the method to get filters I think could be generalized to handle both cases. But I didn't want to make too big mess.

The attached patch is for Nutch 0.9. This part of code wasn't changed in trunk, so should be applicable easily.

BTW. I wonder if it wouldn't be best to have HTML DOM Parsing defined by extension point itself. Now there are options for Neko and TagSoap. But if someone would like to use something else or set give different settings for the parser, he would need to modify HtmlParser class, instead of replacing a plugin.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-490) Extension point with filters for Neko HTML parser (with patch)

Posted by "Marcin Okraszewski (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marcin Okraszewski updated NUTCH-490:
-------------------------------------

    Attachment: HtmlParser.java.diff

Patch for HtmlParser.

> Extension point with filters for Neko HTML parser (with patch)
> --------------------------------------------------------------
>
>                 Key: NUTCH-490
>                 URL: https://issues.apache.org/jira/browse/NUTCH-490
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: Any
>            Reporter: Marcin Okraszewski
>            Priority: Minor
>         Attachments: HtmlParser.java.diff
>
>
> In my project I need to set filters for Neko HTML parser. So instead of adding it hard coded, I made an extension point to define filters for Neko. I was fallowing the code for HtmlParser filters. In fact the method to get filters I think could be generalized to handle both cases. But I didn't want to make too big mess.
> The attached patch is for Nutch 0.9. This part of code wasn't changed in trunk, so should be applicable easily.
> BTW. I wonder if it wouldn't be best to have HTML DOM Parsing defined by extension point itself. Now there are options for Neko and TagSoap. But if someone would like to use something else or set give different settings for the parser, he would need to modify HtmlParser class, instead of replacing a plugin.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-490) Extension point with filters for Neko HTML parser (with patch)

Posted by "Marcin Okraszewski (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marcin Okraszewski updated NUTCH-490:
-------------------------------------

    Attachment: nutch-extensionpoins_plugin.xml.diff

Patch for plugin.xml in nutch-extensionpoins.

BTW. Why extension points are declared in this plugin? Normally I would define this extension point in plugin.xml of parse-html plugin. But I saw all extension points defined here, so I fallowed this policy.

> Extension point with filters for Neko HTML parser (with patch)
> --------------------------------------------------------------
>
>                 Key: NUTCH-490
>                 URL: https://issues.apache.org/jira/browse/NUTCH-490
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: Any
>            Reporter: Marcin Okraszewski
>            Priority: Minor
>         Attachments: HtmlParser.java.diff, nutch-extensionpoins_plugin.xml.diff
>
>
> In my project I need to set filters for Neko HTML parser. So instead of adding it hard coded, I made an extension point to define filters for Neko. I was fallowing the code for HtmlParser filters. In fact the method to get filters I think could be generalized to handle both cases. But I didn't want to make too big mess.
> The attached patch is for Nutch 0.9. This part of code wasn't changed in trunk, so should be applicable easily.
> BTW. I wonder if it wouldn't be best to have HTML DOM Parsing defined by extension point itself. Now there are options for Neko and TagSoap. But if someone would like to use something else or set give different settings for the parser, he would need to modify HtmlParser class, instead of replacing a plugin.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-490) Extension point with filters for Neko HTML parser (with patch)

Posted by "Marcin Okraszewski (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marcin Okraszewski updated NUTCH-490:
-------------------------------------

    Attachment: NekoFilters_for_1.0.patch

Patch ported to Nutch 1.0. It includes the two previous patches.

> Extension point with filters for Neko HTML parser (with patch)
> --------------------------------------------------------------
>
>                 Key: NUTCH-490
>                 URL: https://issues.apache.org/jira/browse/NUTCH-490
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: Any
>            Reporter: Marcin Okraszewski
>            Priority: Minor
>         Attachments: HtmlParser.java.diff, NekoFilters_for_1.0.patch, nutch-extensionpoins_plugin.xml.diff
>
>
> In my project I need to set filters for Neko HTML parser. So instead of adding it hard coded, I made an extension point to define filters for Neko. I was fallowing the code for HtmlParser filters. In fact the method to get filters I think could be generalized to handle both cases. But I didn't want to make too big mess.
> The attached patch is for Nutch 0.9. This part of code wasn't changed in trunk, so should be applicable easily.
> BTW. I wonder if it wouldn't be best to have HTML DOM Parsing defined by extension point itself. Now there are options for Neko and TagSoap. But if someone would like to use something else or set give different settings for the parser, he would need to modify HtmlParser class, instead of replacing a plugin.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.