You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by Jörn Franke <jo...@gmail.com> on 2020/01/02 21:30:00 UTC

Tika Extractor - extract document as (X)HTML not as textonly

Hi,

Is there a possibility to have instead of the text output in the Tika
Extractor (Manifold version, not the extract handler) the (X)HTML output?
How one can achieve this in Tika is pretty clear:
https://tika.apache.org/1.8/examples.html#Picking_different_output_formats

Reason: We need to extract very specific chapters from a word document and
index them as dedicated Solr documents (the latter part is probably still
to be done in an update chain).  There we currently already extract from
the HTML version created by Tika of the word document the (sub-)chapters we
need.

thank you.

best regards

Re: Tika Extractor - extract document as (X)HTML not as textonly

Posted by Jörn Franke <jo...@gmail.com>.

Thanks Karl a lot. I will look into this. 


> Am 03.01.2020 um 10:04 schrieb Karl Wright <da...@gmail.com>:
> 
> 
> The reason plain text is used is because otherwise standard text processing inside Lucene will index tags as terms, which is definitely not what you usually want.
> 
> If you want the Tika Extractor to be able to optionally generate an XHTML format, that sounds like an additional operating mode for the Tika Extractor.  To do that you'd need to add a flag, probably to the Output Specification, with associated UI components, and be sure to maintain backwards compatibility.
> 
> Karl
> 
> 
>> On Thu, Jan 2, 2020 at 4:30 PM Jörn Franke <jo...@gmail.com> wrote:
>> Hi,
>> 
>> Is there a possibility to have instead of the text output in the Tika Extractor (Manifold version, not the extract handler) the (X)HTML output? How one can achieve this in Tika is pretty clear:
>> https://tika.apache.org/1.8/examples.html#Picking_different_output_formats
>> 
>> Reason: We need to extract very specific chapters from a word document and index them as dedicated Solr documents (the latter part is probably still to be done in an update chain).  There we currently already extract from the HTML version created by Tika of the word document the (sub-)chapters we need.
>> 
>> thank you.
>> 
>> best regards

Re: Tika Extractor - extract document as (X)HTML not as textonly

Posted by Karl Wright <da...@gmail.com>.

The reason plain text is used is because otherwise standard text processing
inside Lucene will index tags as terms, which is definitely not what you
usually want.

If you want the Tika Extractor to be able to optionally generate an XHTML
format, that sounds like an additional operating mode for the Tika
Extractor.  To do that you'd need to add a flag, probably to the Output
Specification, with associated UI components, and be sure to maintain
backwards compatibility.

Karl

On Thu, Jan 2, 2020 at 4:30 PM Jörn Franke <jo...@gmail.com> wrote:

> Hi,
>
> Is there a possibility to have instead of the text output in the Tika
> Extractor (Manifold version, not the extract handler) the (X)HTML output?
> How one can achieve this in Tika is pretty clear:
> https://tika.apache.org/1.8/examples.html#Picking_different_output_formats
>
> Reason: We need to extract very specific chapters from a word document and
> index them as dedicated Solr documents (the latter part is probably still
> to be done in an update chain).  There we currently already extract from
> the HTML version created by Tika of the word document the (sub-)chapters we
> need.
>
> thank you.
>
> best regards
>