You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "Thamme Gowda N." <tg...@gmail.com> on 2016/03/23 22:53:29 UTC

Fwd: How to enable multiple parsers for content type ?

Hi Tika experts,

Question : How to enable multiple parsers for specific mimetypes?

I am using tika to parse html pages.

My requirement is that both *NamedEntityParser* and *HtmlParser* has to be
enabled for specific web related MIME types like *text/html, *
*application/xhtml+xml*.

>From my findings on tika wiki, this should be possible with CompositeParser
but I am not getting it right. Only the last parser registered for the mime
type seems to be working.

My configuration is given below.

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
        </parser>

        <parser class="org.apache.tika.parser.ner.NamedEntityParser">
            <mime>text/plain</mime>
            <mime>text/html</mime>
            <mime>text/x-php</mime>
            <mime>text/x-jsp</mime>
            <mime>application/atom+xml</mime>
            <mime>application/xhtml+xml</mime>
            <mime>application/xml</mime>
            <mime>application/rss+xml</mime>
            <mime>application/pdf</mime>
            <mime>application/atom+xml</mime>
            <mime>application/msword</mime>
            <mime>text/asp</mime>
        </parser>

        <parser class="org.apache.tika.parser.html.HtmlParser">
            <mime>text/html</mime>
            <mime>text/x-php</mime>
            <mime>text/x-jsp</mime>
            <mime>application/atom+xml</mime>
            <mime>application/xhtml+xml</mime>
            <mime>application/xml</mime>
            <mime>application/rss+xml</mime>
            <mime>application/atom+xml</mime>
            <mime>text/asp</mime>
        </parser>
    </parsers>
</properties>



-
Thanks in advance
Thamme.

--
*Thamme Gowda N. *
Grad Student at usc.edu
Twitter: @thammegowda  <https://twitter.com/thammegowda>
Website : http://scf.usc.edu/~tnarayan/

Re: Fwd: How to enable multiple parsers for content type ?

Posted by "Thamme Gowda N." <tg...@gmail.com>.
Thanks for clarifying Nick,

This will be a nice feature to have.
 I will have a look at the past discussions before proceeding.

-
Thamme

On Wed, Mar 23, 2016 at 3:01 PM, Nick Burch <ap...@gagravarr.org> wrote:

> On Wed, 23 Mar 2016, Thamme Gowda N. wrote:
>
>> Question : How to enable multiple parsers for specific mimetypes?
>>
>> I am using tika to parse html pages.
>>
>> My requirement is that both *NamedEntityParser* and *HtmlParser* has to be
>> enabled for specific web related MIME types like *text/html, *
>> *application/xhtml+xml*.
>>
>
> This is not currently supported.
>
> See http://wiki.apache.org/tika/CompositeParserDiscussion for the
> discussion on it. If you have ideas on how we can solve the issue of
> multiple parsers needing to output to the same write-once SAX stream,
> including for the fallback case, please shout!
>
> (You can chain multiple content handlers together, so one option might be
> to try to get the named entity stuff to enrich the html sax events stream
> rather than needing to be a standalone parser)
>
> Nick
>

Re: Fwd: How to enable multiple parsers for content type ?

Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 23 Mar 2016, Thamme Gowda N. wrote:
> Question : How to enable multiple parsers for specific mimetypes?
>
> I am using tika to parse html pages.
>
> My requirement is that both *NamedEntityParser* and *HtmlParser* has to be
> enabled for specific web related MIME types like *text/html, *
> *application/xhtml+xml*.

This is not currently supported.

See http://wiki.apache.org/tika/CompositeParserDiscussion for the 
discussion on it. If you have ideas on how we can solve the issue of 
multiple parsers needing to output to the same write-once SAX stream, 
including for the fallback case, please shout!

(You can chain multiple content handlers together, so one option might be 
to try to get the named entity stuff to enrich the html sax events stream 
rather than needing to be a standalone parser)

Nick