You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Thomas (Jira)" <ji...@apache.org> on 2019/08/30 08:49:00 UTC

[jira] [Created] (TIKA-2932) Filter Documents Meta Data

Thomas created TIKA-2932:
----------------------------

             Summary: Filter Documents Meta Data
                 Key: TIKA-2932
                 URL: https://issues.apache.org/jira/browse/TIKA-2932
             Project: Tika
          Issue Type: Wish
          Components: parser
    Affects Versions: 1.22
            Reporter: Thomas


Hello!

Is there a way so that I can filter out tags like , *[image: ]* [bookmark] from the text I get while parsing the Docs? I need it because sometimes the Metadata does not returns number of words from a document if it contains images or tables

*MetaData*

{"title":"Complete name,","description":null,"keywords":[],"language":"en","encoding":null,"author":"","generator":"Microsoft Office Word","pages":0,"words":0 ...

*Text*

[image: ] Certified Translation Certificate of Accuracy Your name here Translator/Interpreter Translated document: [bookmark: _GoBack]As a translator for Your Spanish Translation, Inc., I, Your name here, declare that I am a bilingual translator who is thoroughly familiar with the English and source language languages. I have translated the attached document to the best of my knowledge from source language into English and the English text is an accurate and true translation of the original document presented to the best of my knowledge and belief. Signed on June 1, 201 Sign here in blue ink Your name here Professional Translator for Day Translations, Inc. [bookmark: _MailAutoSig]

Please help!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)