You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Cihad Guzel <cg...@gmail.com> on 2022/06/03 11:30:53 UTC

Custom filter

Hi,

I want to pass the content's words through some filters while parsing in
Tika. How can I add custom filtering?

Does the content handler work for this? Is there a document about this?

Regards,
Cihad Guzel

Re: Custom filter

Posted by Cihad Guzel <cg...@gmail.com>.
Hi Tim,

The document you shared is good , but there are some issues that I don't
understand. You suggested me to use TeeContentHandler. This actually solves
my need. On the other hand,I still don't know how I can use it.

In the wiki document you said as follows:
"Programmatically, users have control to use any of the ContentHandlers in
tika-core or they can write their own ContentHandlers."
- So I understand that I will write a new custom handler. Let it be called
"MyCustomHandler". (It is similar as PhoneExtractingContentHandler)

"If doing this, make sure to consider the ContentHandlerDecorator which
allows overriding only the methods you need;"
- I did not understand this sentence. Should I also write a new decorator
class for each content handler? (MyCustomHandlerDecorator)

"also consider using the TeeContentHandler, which allows multiple handlers
to be run during the parse."
 - TeeContentHandler is best for my need. With TeeContentHandler I can use
both PhoneExtractingContentHandler and MyCustomHandler in the document like
yours.

 "An example of using the TeeContentHandler to add a language detection
handler to the regular ToXMLContentHandler:"
 - Your example is good but I don't know where to write these codes. Is it
in the MyCustomHandlerDecorator class? (Or I'm not sure if it should be
called MyCustomTeeContentHandlerDecorator.)

 "<autoDetectParserConfig>
    <contentHandlerDecoratorFactory
class="org.apache.tika.sax.UpcasingContentHandlerDecoratorFactory"/>
  </autoDetectParserConfig>"
- Should I write the ContentHandlerDecoratorFactory class besides
ContentHandlerDecorator?

Regards,
Cihad Guzel


Tim Allison <ta...@apache.org>, 3 Haz 2022 Cum, 23:50 tarihinde şunu
yazdı:

> Done. Let me know if you have any questions.
>
> On Fri, Jun 3, 2022 at 3:59 PM Cihad Guzel <cg...@gmail.com> wrote:
>
>> Hi Tim,
>>
>> This document looks pretty good. Maybe an example can be added for
>> TeeContentHandler as well.
>>
>> Regards,
>> Cihad Guzel
>>
>>
>> Tim Allison <ta...@apache.org>, 3 Haz 2022 Cum, 22:24 tarihinde şunu
>> yazdı:
>>
>>> First draft of that page is up.  Let me know if you have any questions.
>>>
>>> On Fri, Jun 3, 2022 at 2:03 PM Tim Allison <ta...@apache.org> wrote:
>>>
>>>> I just added the ability to wrap a content handler via tika-config.xml
>>>> and it will be out in 2.4.1 shortly.  Let me document it on our wiki.  I've
>>>> started a stub here:
>>>> https://cwiki.apache.org/confluence/display/TIKA/ModifyingContentWithHandlersAndMetadataFilters
>>>>
>>>> On Fri, Jun 3, 2022 at 1:41 PM Cihad Guzel <cg...@gmail.com> wrote:
>>>>
>>>>> Hi Nick,
>>>>>
>>>>> Thanks for your information.
>>>>>
>>>>> If i use embedded tika, i think that i can set the custom content
>>>>> handler using the api.
>>>>>
>>>>> On the other hand If i use tika server, how can i set the custom
>>>>> content handler to the tika server? Is there a way to the it from the
>>>>> config file?
>>>>>
>>>>> Regards,
>>>>> Cihad Guzel
>>>>>
>>>>>
>>>>> 3 Haz 2022 Cum 19:09 tarihinde Nick Burch <ap...@gagravarr.org> şunu
>>>>> yazdı:
>>>>>
>>>>>> On Fri, 3 Jun 2022, Cihad Guzel wrote:
>>>>>> > I want to pass the content's words through some filters while
>>>>>> parsing in
>>>>>> > Tika. How can I add custom filtering?
>>>>>> >
>>>>>> > Does the content handler work for this? Is there a document about
>>>>>> this?
>>>>>>
>>>>>> A custom content handler is a pretty good way to do that. Tika just
>>>>>> uses
>>>>>> regular Java XML content handlers, so you don't need a Tika-specific
>>>>>> tutorial on writing one
>>>>>>
>>>>>> Depending on what you're wanting to do, you can use Tika's
>>>>>> TeeContentHandler to send the events to both your custom handler and
>>>>>> a
>>>>>> normal one. ContentHandlerDecorator can also be used to override just
>>>>>> some
>>>>>> bits
>>>>>>
>>>>>> Nick
>>>>>>
>>>>>>

Re: Custom filter

Posted by Tim Allison <ta...@apache.org>.
Done. Let me know if you have any questions.

On Fri, Jun 3, 2022 at 3:59 PM Cihad Guzel <cg...@gmail.com> wrote:

> Hi Tim,
>
> This document looks pretty good. Maybe an example can be added for
> TeeContentHandler as well.
>
> Regards,
> Cihad Guzel
>
>
> Tim Allison <ta...@apache.org>, 3 Haz 2022 Cum, 22:24 tarihinde şunu
> yazdı:
>
>> First draft of that page is up.  Let me know if you have any questions.
>>
>> On Fri, Jun 3, 2022 at 2:03 PM Tim Allison <ta...@apache.org> wrote:
>>
>>> I just added the ability to wrap a content handler via tika-config.xml
>>> and it will be out in 2.4.1 shortly.  Let me document it on our wiki.  I've
>>> started a stub here:
>>> https://cwiki.apache.org/confluence/display/TIKA/ModifyingContentWithHandlersAndMetadataFilters
>>>
>>> On Fri, Jun 3, 2022 at 1:41 PM Cihad Guzel <cg...@gmail.com> wrote:
>>>
>>>> Hi Nick,
>>>>
>>>> Thanks for your information.
>>>>
>>>> If i use embedded tika, i think that i can set the custom content
>>>> handler using the api.
>>>>
>>>> On the other hand If i use tika server, how can i set the custom
>>>> content handler to the tika server? Is there a way to the it from the
>>>> config file?
>>>>
>>>> Regards,
>>>> Cihad Guzel
>>>>
>>>>
>>>> 3 Haz 2022 Cum 19:09 tarihinde Nick Burch <ap...@gagravarr.org> şunu
>>>> yazdı:
>>>>
>>>>> On Fri, 3 Jun 2022, Cihad Guzel wrote:
>>>>> > I want to pass the content's words through some filters while
>>>>> parsing in
>>>>> > Tika. How can I add custom filtering?
>>>>> >
>>>>> > Does the content handler work for this? Is there a document about
>>>>> this?
>>>>>
>>>>> A custom content handler is a pretty good way to do that. Tika just
>>>>> uses
>>>>> regular Java XML content handlers, so you don't need a Tika-specific
>>>>> tutorial on writing one
>>>>>
>>>>> Depending on what you're wanting to do, you can use Tika's
>>>>> TeeContentHandler to send the events to both your custom handler and a
>>>>> normal one. ContentHandlerDecorator can also be used to override just
>>>>> some
>>>>> bits
>>>>>
>>>>> Nick
>>>>>
>>>>>

Re: Custom filter

Posted by Cihad Guzel <cg...@gmail.com>.
Hi Tim,

This document looks pretty good. Maybe an example can be added for
TeeContentHandler as well.

Regards,
Cihad Guzel


Tim Allison <ta...@apache.org>, 3 Haz 2022 Cum, 22:24 tarihinde şunu
yazdı:

> First draft of that page is up.  Let me know if you have any questions.
>
> On Fri, Jun 3, 2022 at 2:03 PM Tim Allison <ta...@apache.org> wrote:
>
>> I just added the ability to wrap a content handler via tika-config.xml
>> and it will be out in 2.4.1 shortly.  Let me document it on our wiki.  I've
>> started a stub here:
>> https://cwiki.apache.org/confluence/display/TIKA/ModifyingContentWithHandlersAndMetadataFilters
>>
>> On Fri, Jun 3, 2022 at 1:41 PM Cihad Guzel <cg...@gmail.com> wrote:
>>
>>> Hi Nick,
>>>
>>> Thanks for your information.
>>>
>>> If i use embedded tika, i think that i can set the custom content
>>> handler using the api.
>>>
>>> On the other hand If i use tika server, how can i set the custom content
>>> handler to the tika server? Is there a way to the it from the config file?
>>>
>>> Regards,
>>> Cihad Guzel
>>>
>>>
>>> 3 Haz 2022 Cum 19:09 tarihinde Nick Burch <ap...@gagravarr.org> şunu
>>> yazdı:
>>>
>>>> On Fri, 3 Jun 2022, Cihad Guzel wrote:
>>>> > I want to pass the content's words through some filters while parsing
>>>> in
>>>> > Tika. How can I add custom filtering?
>>>> >
>>>> > Does the content handler work for this? Is there a document about
>>>> this?
>>>>
>>>> A custom content handler is a pretty good way to do that. Tika just
>>>> uses
>>>> regular Java XML content handlers, so you don't need a Tika-specific
>>>> tutorial on writing one
>>>>
>>>> Depending on what you're wanting to do, you can use Tika's
>>>> TeeContentHandler to send the events to both your custom handler and a
>>>> normal one. ContentHandlerDecorator can also be used to override just
>>>> some
>>>> bits
>>>>
>>>> Nick
>>>>
>>>>

Re: Custom filter

Posted by Tim Allison <ta...@apache.org>.
First draft of that page is up.  Let me know if you have any questions.

On Fri, Jun 3, 2022 at 2:03 PM Tim Allison <ta...@apache.org> wrote:

> I just added the ability to wrap a content handler via tika-config.xml and
> it will be out in 2.4.1 shortly.  Let me document it on our wiki.  I've
> started a stub here:
> https://cwiki.apache.org/confluence/display/TIKA/ModifyingContentWithHandlersAndMetadataFilters
>
> On Fri, Jun 3, 2022 at 1:41 PM Cihad Guzel <cg...@gmail.com> wrote:
>
>> Hi Nick,
>>
>> Thanks for your information.
>>
>> If i use embedded tika, i think that i can set the custom content handler
>> using the api.
>>
>> On the other hand If i use tika server, how can i set the custom content
>> handler to the tika server? Is there a way to the it from the config file?
>>
>> Regards,
>> Cihad Guzel
>>
>>
>> 3 Haz 2022 Cum 19:09 tarihinde Nick Burch <ap...@gagravarr.org> şunu
>> yazdı:
>>
>>> On Fri, 3 Jun 2022, Cihad Guzel wrote:
>>> > I want to pass the content's words through some filters while parsing
>>> in
>>> > Tika. How can I add custom filtering?
>>> >
>>> > Does the content handler work for this? Is there a document about this?
>>>
>>> A custom content handler is a pretty good way to do that. Tika just uses
>>> regular Java XML content handlers, so you don't need a Tika-specific
>>> tutorial on writing one
>>>
>>> Depending on what you're wanting to do, you can use Tika's
>>> TeeContentHandler to send the events to both your custom handler and a
>>> normal one. ContentHandlerDecorator can also be used to override just
>>> some
>>> bits
>>>
>>> Nick
>>>
>>>

Re: Custom filter

Posted by Tim Allison <ta...@apache.org>.
I just added the ability to wrap a content handler via tika-config.xml and
it will be out in 2.4.1 shortly.  Let me document it on our wiki.  I've
started a stub here:
https://cwiki.apache.org/confluence/display/TIKA/ModifyingContentWithHandlersAndMetadataFilters

On Fri, Jun 3, 2022 at 1:41 PM Cihad Guzel <cg...@gmail.com> wrote:

> Hi Nick,
>
> Thanks for your information.
>
> If i use embedded tika, i think that i can set the custom content handler
> using the api.
>
> On the other hand If i use tika server, how can i set the custom content
> handler to the tika server? Is there a way to the it from the config file?
>
> Regards,
> Cihad Guzel
>
>
> 3 Haz 2022 Cum 19:09 tarihinde Nick Burch <ap...@gagravarr.org> şunu
> yazdı:
>
>> On Fri, 3 Jun 2022, Cihad Guzel wrote:
>> > I want to pass the content's words through some filters while parsing
>> in
>> > Tika. How can I add custom filtering?
>> >
>> > Does the content handler work for this? Is there a document about this?
>>
>> A custom content handler is a pretty good way to do that. Tika just uses
>> regular Java XML content handlers, so you don't need a Tika-specific
>> tutorial on writing one
>>
>> Depending on what you're wanting to do, you can use Tika's
>> TeeContentHandler to send the events to both your custom handler and a
>> normal one. ContentHandlerDecorator can also be used to override just
>> some
>> bits
>>
>> Nick
>>
>>

Re: Custom filter

Posted by Cihad Guzel <cg...@gmail.com>.
Hi Nick,

Thanks for your information.

If i use embedded tika, i think that i can set the custom content handler
using the api.

On the other hand If i use tika server, how can i set the custom content
handler to the tika server? Is there a way to the it from the config file?

Regards,
Cihad Guzel


3 Haz 2022 Cum 19:09 tarihinde Nick Burch <ap...@gagravarr.org> şunu yazdı:

> On Fri, 3 Jun 2022, Cihad Guzel wrote:
> > I want to pass the content's words through some filters while parsing in
> > Tika. How can I add custom filtering?
> >
> > Does the content handler work for this? Is there a document about this?
>
> A custom content handler is a pretty good way to do that. Tika just uses
> regular Java XML content handlers, so you don't need a Tika-specific
> tutorial on writing one
>
> Depending on what you're wanting to do, you can use Tika's
> TeeContentHandler to send the events to both your custom handler and a
> normal one. ContentHandlerDecorator can also be used to override just some
> bits
>
> Nick
>
>

Re: Custom filter

Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 3 Jun 2022, Cihad Guzel wrote:
> I want to pass the content's words through some filters while parsing in 
> Tika. How can I add custom filtering?
>
> Does the content handler work for this? Is there a document about this?

A custom content handler is a pretty good way to do that. Tika just uses 
regular Java XML content handlers, so you don't need a Tika-specific 
tutorial on writing one

Depending on what you're wanting to do, you can use Tika's 
TeeContentHandler to send the events to both your custom handler and a 
normal one. ContentHandlerDecorator can also be used to override just some 
bits

Nick