You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Cihad Guzel <cg...@gmail.com> on 2022/10/20 23:32:15 UTC

Re: Custom filter

Hi Tim,

The document you shared is good , but there are some issues that I don't
understand. You suggested me to use TeeContentHandler. This actually solves
my need. On the other hand,I still don't know how I can use it.

In the wiki document you said as follows:
"Programmatically, users have control to use any of the ContentHandlers in
tika-core or they can write their own ContentHandlers."
- So I understand that I will write a new custom handler. Let it be called
"MyCustomHandler". (It is similar as PhoneExtractingContentHandler)

"If doing this, make sure to consider the ContentHandlerDecorator which
allows overriding only the methods you need;"
- I did not understand this sentence. Should I also write a new decorator
class for each content handler? (MyCustomHandlerDecorator)

"also consider using the TeeContentHandler, which allows multiple handlers
to be run during the parse."
 - TeeContentHandler is best for my need. With TeeContentHandler I can use
both PhoneExtractingContentHandler and MyCustomHandler in the document like
yours.

 "An example of using the TeeContentHandler to add a language detection
handler to the regular ToXMLContentHandler:"
 - Your example is good but I don't know where to write these codes. Is it
in the MyCustomHandlerDecorator class? (Or I'm not sure if it should be
called MyCustomTeeContentHandlerDecorator.)

 "<autoDetectParserConfig>
    <contentHandlerDecoratorFactory
class="org.apache.tika.sax.UpcasingContentHandlerDecoratorFactory"/>
  </autoDetectParserConfig>"
- Should I write the ContentHandlerDecoratorFactory class besides
ContentHandlerDecorator?

Regards,
Cihad Guzel


Tim Allison <ta...@apache.org>, 3 Haz 2022 Cum, 23:50 tarihinde şunu
yazdı:

> Done. Let me know if you have any questions.
>
> On Fri, Jun 3, 2022 at 3:59 PM Cihad Guzel <cg...@gmail.com> wrote:
>
>> Hi Tim,
>>
>> This document looks pretty good. Maybe an example can be added for
>> TeeContentHandler as well.
>>
>> Regards,
>> Cihad Guzel
>>
>>
>> Tim Allison <ta...@apache.org>, 3 Haz 2022 Cum, 22:24 tarihinde şunu
>> yazdı:
>>
>>> First draft of that page is up.  Let me know if you have any questions.
>>>
>>> On Fri, Jun 3, 2022 at 2:03 PM Tim Allison <ta...@apache.org> wrote:
>>>
>>>> I just added the ability to wrap a content handler via tika-config.xml
>>>> and it will be out in 2.4.1 shortly.  Let me document it on our wiki.  I've
>>>> started a stub here:
>>>> https://cwiki.apache.org/confluence/display/TIKA/ModifyingContentWithHandlersAndMetadataFilters
>>>>
>>>> On Fri, Jun 3, 2022 at 1:41 PM Cihad Guzel <cg...@gmail.com> wrote:
>>>>
>>>>> Hi Nick,
>>>>>
>>>>> Thanks for your information.
>>>>>
>>>>> If i use embedded tika, i think that i can set the custom content
>>>>> handler using the api.
>>>>>
>>>>> On the other hand If i use tika server, how can i set the custom
>>>>> content handler to the tika server? Is there a way to the it from the
>>>>> config file?
>>>>>
>>>>> Regards,
>>>>> Cihad Guzel
>>>>>
>>>>>
>>>>> 3 Haz 2022 Cum 19:09 tarihinde Nick Burch <ap...@gagravarr.org> şunu
>>>>> yazdı:
>>>>>
>>>>>> On Fri, 3 Jun 2022, Cihad Guzel wrote:
>>>>>> > I want to pass the content's words through some filters while
>>>>>> parsing in
>>>>>> > Tika. How can I add custom filtering?
>>>>>> >
>>>>>> > Does the content handler work for this? Is there a document about
>>>>>> this?
>>>>>>
>>>>>> A custom content handler is a pretty good way to do that. Tika just
>>>>>> uses
>>>>>> regular Java XML content handlers, so you don't need a Tika-specific
>>>>>> tutorial on writing one
>>>>>>
>>>>>> Depending on what you're wanting to do, you can use Tika's
>>>>>> TeeContentHandler to send the events to both your custom handler and
>>>>>> a
>>>>>> normal one. ContentHandlerDecorator can also be used to override just
>>>>>> some
>>>>>> bits
>>>>>>
>>>>>> Nick
>>>>>>
>>>>>>