You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Nicholas DiPiazza <ni...@gmail.com> on 2021/06/23 18:27:30 UTC

Customizing HTML parser when using Tika-server

When we are using the Tika-Server and parsing an html

<html><title>hi there</title><body>woah</body></html>

The parser when called through the endpoing:

http://localhost:49309/rmeta/text

Will give you a basic result like this:

[
{
"Content-Encoding": "ISO-8859-1",
"Content-Type": "text/html; charset=ISO-8859-1",
"X-Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.html.HtmlParser"
],
"X-TIKA:content": "\n\n\n\n\n\n\nhi there\n\nwoah",
"X-TIKA:content_handler": "ToTextContentHandler",
"X-TIKA:embedded_depth": "0",
"X-TIKA:parse_time_millis": "284",
"dc:title": "hi there",
"title": "hi there"
}
]

Notice how the title is in the body content.

When using tika embedded in a java app, I know if you extend Tika's default
handler you can customize the XHTML attributes such as <title> so that you
could, for example, make it so that the content field does not have the
title in it.

Does anyone know when using Tika Server if there is a similar thing
possible?

Re: Customizing HTML parser when using Tika-server

Posted by Nicholas DiPiazza <ni...@gmail.com>.
actually

http://localhost:49309/rmeta/body

works.

Interesting! I need to read up on the difference between these two and see
if i can just switch to using this always then.

On Wed, Jun 23, 2021 at 1:48 PM Tim Allison <ta...@apache.org> wrote:

> > is possible in tika-server
>
> Currently, but this has been on my wishlist forever…
>
> On Wed, Jun 23, 2021 at 2:35 PM Tim Allison <ta...@apache.org> wrote:
>
> > I don’t think handler customization generally is possible in Tika-server.
> >
> > What happens w /rmeta/body?
> >
> > On Wed, Jun 23, 2021 at 2:27 PM Nicholas DiPiazza <
> > nicholas.dipiazza@gmail.com> wrote:
> >
> >> When we are using the Tika-Server and parsing an html
> >>
> >> <html><title>hi there</title><body>woah</body></html>
> >>
> >> The parser when called through the endpoing:
> >>
> >> http://localhost:49309/rmeta/text
> >>
> >> Will give you a basic result like this:
> >>
> >> [
> >> {
> >> "Content-Encoding": "ISO-8859-1",
> >> "Content-Type": "text/html; charset=ISO-8859-1",
> >> "X-Parsed-By": [
> >> "org.apache.tika.parser.DefaultParser",
> >> "org.apache.tika.parser.html.HtmlParser"
> >> ],
> >> "X-TIKA:content": "\n\n\n\n\n\n\nhi there\n\nwoah",
> >> "X-TIKA:content_handler": "ToTextContentHandler",
> >> "X-TIKA:embedded_depth": "0",
> >> "X-TIKA:parse_time_millis": "284",
> >> "dc:title": "hi there",
> >> "title": "hi there"
> >> }
> >> ]
> >>
> >> Notice how the title is in the body content.
> >>
> >> When using tika embedded in a java app, I know if you extend Tika's
> >> default
> >> handler you can customize the XHTML attributes such as <title> so that
> you
> >> could, for example, make it so that the content field does not have the
> >> title in it.
> >>
> >> Does anyone know when using Tika Server if there is a similar thing
> >> possible?
> >>
> >
>

Re: Customizing HTML parser when using Tika-server

Posted by Tim Allison <ta...@apache.org>.
> is possible in tika-server

Currently, but this has been on my wishlist forever…

On Wed, Jun 23, 2021 at 2:35 PM Tim Allison <ta...@apache.org> wrote:

> I don’t think handler customization generally is possible in Tika-server.
>
> What happens w /rmeta/body?
>
> On Wed, Jun 23, 2021 at 2:27 PM Nicholas DiPiazza <
> nicholas.dipiazza@gmail.com> wrote:
>
>> When we are using the Tika-Server and parsing an html
>>
>> <html><title>hi there</title><body>woah</body></html>
>>
>> The parser when called through the endpoing:
>>
>> http://localhost:49309/rmeta/text
>>
>> Will give you a basic result like this:
>>
>> [
>> {
>> "Content-Encoding": "ISO-8859-1",
>> "Content-Type": "text/html; charset=ISO-8859-1",
>> "X-Parsed-By": [
>> "org.apache.tika.parser.DefaultParser",
>> "org.apache.tika.parser.html.HtmlParser"
>> ],
>> "X-TIKA:content": "\n\n\n\n\n\n\nhi there\n\nwoah",
>> "X-TIKA:content_handler": "ToTextContentHandler",
>> "X-TIKA:embedded_depth": "0",
>> "X-TIKA:parse_time_millis": "284",
>> "dc:title": "hi there",
>> "title": "hi there"
>> }
>> ]
>>
>> Notice how the title is in the body content.
>>
>> When using tika embedded in a java app, I know if you extend Tika's
>> default
>> handler you can customize the XHTML attributes such as <title> so that you
>> could, for example, make it so that the content field does not have the
>> title in it.
>>
>> Does anyone know when using Tika Server if there is a similar thing
>> possible?
>>
>

Re: Customizing HTML parser when using Tika-server

Posted by Tim Allison <ta...@apache.org>.
I don’t think handler customization generally is possible in Tika-server.

What happens w /rmeta/body?

On Wed, Jun 23, 2021 at 2:27 PM Nicholas DiPiazza <
nicholas.dipiazza@gmail.com> wrote:

> When we are using the Tika-Server and parsing an html
>
> <html><title>hi there</title><body>woah</body></html>
>
> The parser when called through the endpoing:
>
> http://localhost:49309/rmeta/text
>
> Will give you a basic result like this:
>
> [
> {
> "Content-Encoding": "ISO-8859-1",
> "Content-Type": "text/html; charset=ISO-8859-1",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.html.HtmlParser"
> ],
> "X-TIKA:content": "\n\n\n\n\n\n\nhi there\n\nwoah",
> "X-TIKA:content_handler": "ToTextContentHandler",
> "X-TIKA:embedded_depth": "0",
> "X-TIKA:parse_time_millis": "284",
> "dc:title": "hi there",
> "title": "hi there"
> }
> ]
>
> Notice how the title is in the body content.
>
> When using tika embedded in a java app, I know if you extend Tika's default
> handler you can customize the XHTML attributes such as <title> so that you
> could, for example, make it so that the content field does not have the
> title in it.
>
> Does anyone know when using Tika Server if there is a similar thing
> possible?
>