You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Chris A. Mattmann (JIRA)" <ji...@apache.org> on 2010/08/08 21:08:21 UTC

[jira] Created: (NUTCH-874) Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora

Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora
--------------------------------------------------------------------------

                 Key: NUTCH-874
                 URL: https://issues.apache.org/jira/browse/NUTCH-874
             Project: Nutch
          Issue Type: Bug
          Components: parser
         Environment: Nutch 2.0
            Reporter: Chris A. Mattmann
            Assignee: Chris A. Mattmann
            Priority: Critical
             Fix For: 2.0


I just noticed while fixing NUTCH-564 that the ExtParser hasn't been brought up to date with Nutch 2.0 trunk. We should review the plugins in src/plugin to make sure they all work with Gora/Nutchbase now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-874) Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897190#action_12897190 ] 

Julien Nioche commented on NUTCH-874:
-------------------------------------

{quote}
I think Jukka already worked on something really similar to the ExtParser in Tika. See: http://tika.apache.org/0.7/api/org/apache/tika/parser/ExternalParser.html
{quote}
yes, that's the one I had in mind

One of the plugins which hasn't been ported yet is the feed parser. We could rely on the one we recently added to Tika, knowing that there is a substantial difference in the sense that the Tika feed parser generates a simple XHTML representation of the document where the feeds are simply represented as anchors whereas the Nutch version created new documents for each feed.

There is also the parse-rss plugin in Nutch which is quite similar - what's the difference with the feed one again? Since the Tika parser would handle all sorts of feed formats why not simply rely on it? 

> Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-874
>                 URL: https://issues.apache.org/jira/browse/NUTCH-874
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>         Environment: Nutch 2.0
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>            Priority: Critical
>             Fix For: 2.0
>
>
> I just noticed while fixing NUTCH-564 that the ExtParser hasn't been brought up to date with Nutch 2.0 trunk. We should review the plugins in src/plugin to make sure they all work with Gora/Nutchbase now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-874) Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896552#action_12896552 ] 

Chris A. Mattmann commented on NUTCH-874:
-----------------------------------------

Hey Julien,

I think Jukka already worked on something really similar to the ExtParser in Tika. See: http://tika.apache.org/0.7/api/org/apache/tika/parser/ExternalParser.html

If we go that route here in Nutch, then I think we should add an encoding attribute similar to NUTCH-564 and flow it through in parse-tika then. If we can do that, I think we're good!

Cheers,
Chris


> Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-874
>                 URL: https://issues.apache.org/jira/browse/NUTCH-874
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>         Environment: Nutch 2.0
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>            Priority: Critical
>             Fix For: 2.0
>
>
> I just noticed while fixing NUTCH-564 that the ExtParser hasn't been brought up to date with Nutch 2.0 trunk. We should review the plugins in src/plugin to make sure they all work with Gora/Nutchbase now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-874) Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896479#action_12896479 ] 

Julien Nioche commented on NUTCH-874:
-------------------------------------

Some plugins have not been ported to the new API as it does not provide multi valued parse results. See See http://search.lucidimagination.com/search/document/844c48289f2d07db/nutchbase_multi_value_parseresult_missing#4ed6f352ebcce8ef

This is probably not the case for the ExtParser though. We could rely on Tika's mechanism for external parsing instead of maintaining ours. WDYT?

> Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-874
>                 URL: https://issues.apache.org/jira/browse/NUTCH-874
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>         Environment: Nutch 2.0
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>            Priority: Critical
>             Fix For: 2.0
>
>
> I just noticed while fixing NUTCH-564 that the ExtParser hasn't been brought up to date with Nutch 2.0 trunk. We should review the plugins in src/plugin to make sure they all work with Gora/Nutchbase now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.