You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2010/08/14 10:31:16 UTC

[jira] Created: (NUTCH-887) Delegate parsing of feeds to Tika

Delegate parsing of feeds to Tika
---------------------------------

                 Key: NUTCH-887
                 URL: https://issues.apache.org/jira/browse/NUTCH-887
             Project: Nutch
          Issue Type: Wish
          Components: parser
    Affects Versions: 2.0
            Reporter: Julien Nioche
             Fix For: 2.0


[Starting a new thread from https://issues.apache.org/jira/browse/NUTCH-874]

One of the plugins which hasn't been ported yet is the feed parser. We could rely on the one we recently added to Tika, knowing that there is a substantial difference in the sense that the Tika feed parser generates a simple XHTML representation of the document where the feeds are simply represented as anchors whereas the Nutch version created new documents for each feed.

There is also the parse-rss plugin in Nutch which is quite similar - what's the difference with the feed one again? Since the Tika parser would handle all sorts of feed formats why not simply rely on it? 

Any thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-887) Delegate parsing of feeds to Tika

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898652#action_12898652 ] 

Chris A. Mattmann commented on NUTCH-887:
-----------------------------------------

bq. There is something missing in Tika, and it's the support for compound documents, but it's not likely to be added in 0.8

Huh, what do you mean? Nick just added a bunch of code to handle Compound document detection, and parsing, see TIKA-447 and the discussions on the wiki here: http://wiki.apache.org/tika/MetadataDiscussion. It may not be complete yet, but neither is 0.8. 

bq. I'd keep the "feed" plugin around for a while still, as an interim solution until Tika supports compound documents. +1 to getting rid of parse-rss.

+1, I agree, but I still believe our goal should be to delegate this to Tika. I'm starting to feel the creep of parsing plugins make their way back into Nutch instead of just jumping over into Tika and working the process over there. In the end, if we start to add back all the parsing plugins, I'm not sure we've accomplished our goal...



> Delegate parsing of feeds to Tika
> ---------------------------------
>
>                 Key: NUTCH-887
>                 URL: https://issues.apache.org/jira/browse/NUTCH-887
>             Project: Nutch
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 2.0
>            Reporter: Julien Nioche
>             Fix For: 2.0
>
>
> [Starting a new thread from https://issues.apache.org/jira/browse/NUTCH-874]
> One of the plugins which hasn't been ported yet is the feed parser. We could rely on the one we recently added to Tika, knowing that there is a substantial difference in the sense that the Tika feed parser generates a simple XHTML representation of the document where the feeds are simply represented as anchors whereas the Nutch version created new documents for each feed.
> There is also the parse-rss plugin in Nutch which is quite similar - what's the difference with the feed one again? Since the Tika parser would handle all sorts of feed formats why not simply rely on it? 
> Any thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-887) Delegate parsing of feeds to Tika

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898667#action_12898667 ] 

Andrzej Bialecki  commented on NUTCH-887:
-----------------------------------------

bq. Huh, what do you mean? Nick just added a bunch of code to handle Compound document detection, and parsing

Ah, good - I missed that, I need to take a closer look at this...

bq. I'm starting to feel the creep of parsing plugins make their way back into Nutch instead of just jumping over into Tika

The "creep" so far is just parse-html, which we were forced to add back because Tika HTML parsing was totally inadequate to our needs. I know there have been some progress on this front, but I suspect it's still not sufficient. The ultimate goal is still to use Tika for all formats that it can handle, preferrably "all formats" without further qualifiers ;)

> Delegate parsing of feeds to Tika
> ---------------------------------
>
>                 Key: NUTCH-887
>                 URL: https://issues.apache.org/jira/browse/NUTCH-887
>             Project: Nutch
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 2.0
>            Reporter: Julien Nioche
>             Fix For: 2.0
>
>
> [Starting a new thread from https://issues.apache.org/jira/browse/NUTCH-874]
> One of the plugins which hasn't been ported yet is the feed parser. We could rely on the one we recently added to Tika, knowing that there is a substantial difference in the sense that the Tika feed parser generates a simple XHTML representation of the document where the feeds are simply represented as anchors whereas the Nutch version created new documents for each feed.
> There is also the parse-rss plugin in Nutch which is quite similar - what's the difference with the feed one again? Since the Tika parser would handle all sorts of feed formats why not simply rely on it? 
> Any thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-887) Delegate parsing of feeds to Tika

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898647#action_12898647 ] 

Andrzej Bialecki  commented on NUTCH-887:
-----------------------------------------

bq. If there's something missing that Nutch needs, we'll add it to Tika and roll it into 0.8.

There is something missing in Tika, and it's the support for compound documents, but it's not likely to be added in 0.8... not that we have such support in Nutch at the moment - it fell victim to the trunk/nutchbase switch, but it should be added back soon. I'd keep the "feed" plugin around for a while still, as an interim solution until Tika supports compound documents. +1 to getting rid of parse-rss.

> Delegate parsing of feeds to Tika
> ---------------------------------
>
>                 Key: NUTCH-887
>                 URL: https://issues.apache.org/jira/browse/NUTCH-887
>             Project: Nutch
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 2.0
>            Reporter: Julien Nioche
>             Fix For: 2.0
>
>
> [Starting a new thread from https://issues.apache.org/jira/browse/NUTCH-874]
> One of the plugins which hasn't been ported yet is the feed parser. We could rely on the one we recently added to Tika, knowing that there is a substantial difference in the sense that the Tika feed parser generates a simple XHTML representation of the document where the feeds are simply represented as anchors whereas the Nutch version created new documents for each feed.
> There is also the parse-rss plugin in Nutch which is quite similar - what's the difference with the feed one again? Since the Tika parser would handle all sorts of feed formats why not simply rely on it? 
> Any thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-887) Delegate parsing of feeds to Tika

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898706#action_12898706 ] 

Chris A. Mattmann commented on NUTCH-887:
-----------------------------------------

bq. Ah, good - I missed that, I need to take a closer look at this...

Np, let me know what you think. If it needs improvement, I'll be happy to pick up a shovel, and help out.

bq. The "creep" so far is just parse-html, which we were forced to add back because Tika HTML parsing was totally inadequate to our needs. I know there have been some progress on this front, but I suspect it's still not sufficient. The ultimate goal is still to use Tika for all formats that it can handle, preferrably "all formats" without further qualifiers  ;)

Coo coo, thanks Andrzej!

Cheers,
Chris


> Delegate parsing of feeds to Tika
> ---------------------------------
>
>                 Key: NUTCH-887
>                 URL: https://issues.apache.org/jira/browse/NUTCH-887
>             Project: Nutch
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 2.0
>            Reporter: Julien Nioche
>             Fix For: 2.0
>
>
> [Starting a new thread from https://issues.apache.org/jira/browse/NUTCH-874]
> One of the plugins which hasn't been ported yet is the feed parser. We could rely on the one we recently added to Tika, knowing that there is a substantial difference in the sense that the Tika feed parser generates a simple XHTML representation of the document where the feeds are simply represented as anchors whereas the Nutch version created new documents for each feed.
> There is also the parse-rss plugin in Nutch which is quite similar - what's the difference with the feed one again? Since the Tika parser would handle all sorts of feed formats why not simply rely on it? 
> Any thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-887) Delegate parsing of feeds to Tika

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898827#action_12898827 ] 

Julien Nioche commented on NUTCH-887:
-------------------------------------

Have created https://issues.apache.org/jira/browse/NUTCH-888 and will remove parse-rss tomorrow.



> Delegate parsing of feeds to Tika
> ---------------------------------
>
>                 Key: NUTCH-887
>                 URL: https://issues.apache.org/jira/browse/NUTCH-887
>             Project: Nutch
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 2.0
>            Reporter: Julien Nioche
>             Fix For: 2.0
>
>
> [Starting a new thread from https://issues.apache.org/jira/browse/NUTCH-874]
> One of the plugins which hasn't been ported yet is the feed parser. We could rely on the one we recently added to Tika, knowing that there is a substantial difference in the sense that the Tika feed parser generates a simple XHTML representation of the document where the feeds are simply represented as anchors whereas the Nutch version created new documents for each feed.
> There is also the parse-rss plugin in Nutch which is quite similar - what's the difference with the feed one again? Since the Tika parser would handle all sorts of feed formats why not simply rely on it? 
> Any thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-887) Delegate parsing of feeds to Tika

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898620#action_12898620 ] 

Chris A. Mattmann commented on NUTCH-887:
-----------------------------------------

Hey Julien:

+1 to relying on Tika for RSS parsing. If there's something missing that Nutch needs, we'll add it to Tika and roll it into 0.8.

{quote}
There is also the parse-rss plugin in Nutch which is quite similar - what's the difference with the feed one again? Since the Tika parser would handle all sorts of feed formats why not simply rely on it? 
{quote}

I wrote parse-rss back in 2005, and used commons-feedparser from Kevin Burton and his crew. At the time it was well developed, and a little more flexible and easier for me to pick up than Rome. Since then however, its development has really become stagnant and it is no longer maintained.

In terms of real differences in terms of functionality, they are roughly equivalent so there isn't much difference. I would suggest we move forward with the feed plugin in Tika and roll it back in through Nutch.

> Delegate parsing of feeds to Tika
> ---------------------------------
>
>                 Key: NUTCH-887
>                 URL: https://issues.apache.org/jira/browse/NUTCH-887
>             Project: Nutch
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 2.0
>            Reporter: Julien Nioche
>             Fix For: 2.0
>
>
> [Starting a new thread from https://issues.apache.org/jira/browse/NUTCH-874]
> One of the plugins which hasn't been ported yet is the feed parser. We could rely on the one we recently added to Tika, knowing that there is a substantial difference in the sense that the Tika feed parser generates a simple XHTML representation of the document where the feeds are simply represented as anchors whereas the Nutch version created new documents for each feed.
> There is also the parse-rss plugin in Nutch which is quite similar - what's the difference with the feed one again? Since the Tika parser would handle all sorts of feed formats why not simply rely on it? 
> Any thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.