You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@clerezza.apache.org by "Davide Palmisano (JIRA)" <ji...@apache.org> on 2010/09/26 21:21:33 UTC

[jira] Commented: (CLEREZZA-182) Integrate Apache Tika inside Apache Clerezza

    [ https://issues.apache.org/jira/browse/CLEREZZA-182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915021#action_12915021 ] 

Davide Palmisano commented on CLEREZZA-182:
-------------------------------------------

Dear Tommaso,

In the attached patch[1] (taken from /trunk/org.apache.clerezza.parent/org.apache.clerezza.uima/org.apache.clerezza.uima.metadata-generator) you can find an attempt to integrate Apache Tika 0.7 implementing the MediaTypeTextExtractor interface. My modifies foresee:

1) tika dependency added to the pom.xml
2) two tests (one for my implementation, TikaTextExtractor, and one for your PlainTextExtractor class)
3) some added javadocs on the MediaTypeTextExtractor interface.
4) a couple of new constructors for the UnsupportedMediaTypeException exception.

let me know if it fits your needs.

Davide

[1] CLEREZZA-182.patch

> Integrate Apache Tika inside Apache Clerezza
> --------------------------------------------
>
>                 Key: CLEREZZA-182
>                 URL: https://issues.apache.org/jira/browse/CLEREZZA-182
>             Project: Clerezza
>          Issue Type: New Feature
>            Reporter: Tommaso Teofili
>         Attachments: CLEREZZA-182.patch
>
>
> Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries and it would be nice to have it integrated inside Apache Clerezza so that Resources could be easily enriched and auto-tagged with Metadata once inside Clerezza

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.