You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2010/11/04 14:59:54 UTC
[jira] Commented: (TIKA-466) Feed Parser
[ https://issues.apache.org/jira/browse/TIKA-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928216#action_12928216 ]
Ken Krugler commented on TIKA-466:
----------------------------------
Hi Julien & Chris,
See TIKA-543 for an issue with the Rome 1.0 dependency.
Thanks,
-- Ken
> Feed Parser
> -----------
>
> Key: TIKA-466
> URL: https://issues.apache.org/jira/browse/TIKA-466
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Reporter: Julien Nioche
> Assignee: Chris A. Mattmann
> Priority: Minor
> Fix For: 0.8
>
> Attachments: TIKA-466.patch
>
>
> We currently have no parsers for feeds in Tika and since we are progressively getting rid of our legacy parsers in Nutch I thought it could make sense to have one.
> The patch attached is based on the ROME feed parser (https://rome.dev.java.net/) which is under Apache License. Rome provides a unified API for different feed formats and seems well maintained.
> The implementation of the FeedParser is by no means complete but should serve as a basis for further improvements. It currently stores the title and description from the feed and stores them in the metadata and uses the following XHTML representation for the entries :
> <A href="ENTRY_URL">ENTRY_TITLE</A>
> <P>
> ENTRY_DESCRIPTION
> </P>
> This is pretty basic but should at least allow us to retrieve the outlinks in Nutch as well as some text.
> J.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.