You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Rida Benjelloun (JIRA)" <ji...@apache.org> on 2008/01/14 18:38:34 UTC
[jira] Created: (TIKA-113) Metadata (such as title) should not be
part of content
Metadata (such as title) should not be part of content
------------------------------------------------------
Key: TIKA-113
URL: https://issues.apache.org/jira/browse/TIKA-113
Project: Tika
Issue Type: Wish
Components: parser
Affects Versions: 0.2-incubating
Reporter: Rida Benjelloun
Metadata (such as title) is added in the content. In my opinion it would be preferable that the toString () on the writer return only the content of the document and not metadata. The metadata are already stored in the metadata object
Rida.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (TIKA-113) Metadata (such as title) should not be
part of content
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting updated TIKA-113:
-------------------------------
Affects Version/s: (was: 0.2-incubating)
Fix Version/s: 0.2-incubating
Issue Type: Improvement (was: Wish)
I think the SAX event stream should still contain selected metadata in the <head/> section. For example the current XHTMLContentHandler outputs the TITLE metadata field (if available) as the <title/> of the generated XML document.
Instead of changing that pattern, we should probably either change WriteOutContentHandler to only output content of the <body/> element or add a new ContentHandler utility class with that feature.
> Metadata (such as title) should not be part of content
> ------------------------------------------------------
>
> Key: TIKA-113
> URL: https://issues.apache.org/jira/browse/TIKA-113
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Rida Benjelloun
> Fix For: 0.2-incubating
>
>
> Metadata (such as title) is added in the content. In my opinion it would be preferable that the toString () on the writer return only the content of the document and not metadata. The metadata are already stored in the metadata object
> Rida.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (TIKA-113) Metadata (such as title) should not be
part of content
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12569683#action_12569683 ]
Jukka Zitting commented on TIKA-113:
------------------------------------
A solution based on the current code is:
Writer writer = ...;
XPathParser xpath = new XPathParser("xhtml", "http://www.w3.org/1999/xhtml");
ContentHandler handler = new MatchingContentHandler(
new WriteOutContentHandler(writer),
xpath.parse("/xhtml:html/xhtml:body//*"));
I'm not sure if we should to codify that into a helper class or a method.
> Metadata (such as title) should not be part of content
> ------------------------------------------------------
>
> Key: TIKA-113
> URL: https://issues.apache.org/jira/browse/TIKA-113
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Rida Benjelloun
> Fix For: 0.2-incubating
>
>
> Metadata (such as title) is added in the content. In my opinion it would be preferable that the toString () on the writer return only the content of the document and not metadata. The metadata are already stored in the metadata object
> Rida.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (TIKA-113) Metadata (such as title) should not be
part of content
Posted by "Rida Benjelloun (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560908#action_12560908 ]
Rida Benjelloun commented on TIKA-113:
--------------------------------------
+1, I agree with Jukka suggestion.
Rida.
> Metadata (such as title) should not be part of content
> ------------------------------------------------------
>
> Key: TIKA-113
> URL: https://issues.apache.org/jira/browse/TIKA-113
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Rida Benjelloun
> Fix For: 0.2-incubating
>
>
> Metadata (such as title) is added in the content. In my opinion it would be preferable that the toString () on the writer return only the content of the document and not metadata. The metadata are already stored in the metadata object
> Rida.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (TIKA-113) Metadata (such as title) should not be
part of content
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-113.
--------------------------------
Resolution: Fixed
Assignee: Jukka Zitting
Resolved in revision 646748 by implementing a BodyContentHandler class for getting just the XHTML body content.
> Metadata (such as title) should not be part of content
> ------------------------------------------------------
>
> Key: TIKA-113
> URL: https://issues.apache.org/jira/browse/TIKA-113
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Rida Benjelloun
> Assignee: Jukka Zitting
> Fix For: 0.2-incubating
>
>
> Metadata (such as title) is added in the content. In my opinion it would be preferable that the toString () on the writer return only the content of the document and not metadata. The metadata are already stored in the metadata object
> Rida.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.