You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Eric Pugh <ep...@opensourceconnections.com> on 2019/08/09 20:52:39 UTC

Surfacing hOCR output from Tika Server

I’m working with the Tika Server directly instead of working with the Tika code directly, and I have the it set so that when I post a PDF to the server that I get back the xml instead of the text version by specifying in TesseractOCRConfig.properties file that I want outputType=hocr.

However, I’m looking to get back all the hOCR metadata as well, ie the bounding boxes around each word.  Is returning that up the chain possible?   



Eric

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>	
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.

Re: Surfacing hOCR output from Tika Server

Posted by Tim Allison <ta...@apache.org>.

Self serve!  Perfect!  Y, that's what I was going to recommend.

If you don't want the metadata (/rmeta), try the basic /tika handler
(if you haven't already!).

On Mon, Aug 12, 2019 at 8:53 AM Eric Pugh
<ep...@opensourceconnections.com> wrote:
>
> I wanted to share the magic set of parameters that worked for me:
>
> curl -T mypdf.pdf http://localhost:9998/rmeta --header "X-Tika-OCRLanguage: eng" --header "X-Tika-PDFOcrStrategy: ocr_only" --header "X-Tika-OCRoutputType: hocr”
>
> This returns the output in a JSON format, and under the key X-TIKA:content in a awful escaped XML format is the HOCR output:
>
> \u003cspan class\u003d\"ocrx_word\" id\u003d\"word_1_11\" title\u003d\"bbox 400 453 518 475; x_wconf 96\"\u003ePerspectives\u003c/span\u003e
>
> I’m going to play around some more and see if maybe I can get a nicer structure to be returned!
>
> Eric
>
> On Aug 9, 2019, at 4:52 PM, Eric Pugh <ep...@opensourceconnections.com> wrote:
>
> I’m working with the Tika Server directly instead of working with the Tika code directly, and I have the it set so that when I post a PDF to the server that I get back the xml instead of the text version by specifying in TesseractOCRConfig.properties file that I want outputType=hocr.
>
> However, I’m looking to get back all the hOCR metadata as well, ie the bounding boxes around each word.  Is returning that up the chain possible?
>
>
>
> Eric
>
> _______________________
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com | My Free/Busy
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed
> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
>
>
> _______________________
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com | My Free/Busy
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed
> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
>

Re: Surfacing hOCR output from Tika Server

Posted by Eric Pugh <ep...@opensourceconnections.com>.

I wanted to share the magic set of parameters that worked for me:

curl -T mypdf.pdf http://localhost:9998/rmeta --header "X-Tika-OCRLanguage: eng" --header "X-Tika-PDFOcrStrategy: ocr_only" --header "X-Tika-OCRoutputType: hocr”

This returns the output in a JSON format, and under the key X-TIKA:content in a awful escaped XML format is the HOCR output:

\u003cspan class\u003d\"ocrx_word\" id\u003d\"word_1_11\" title\u003d\"bbox 400 453 518 475; x_wconf 96\"\u003ePerspectives\u003c/span\u003e

I’m going to play around some more and see if maybe I can get a nicer structure to be returned!

Eric

> On Aug 9, 2019, at 4:52 PM, Eric Pugh <ep...@opensourceconnections.com> wrote:
> 
> I’m working with the Tika Server directly instead of working with the Tika code directly, and I have the it set so that when I post a PDF to the server that I get back the xml instead of the text version by specifying in TesseractOCRConfig.properties file that I want outputType=hocr.
> 
> However, I’m looking to get back all the hOCR metadata as well, ie the bounding boxes around each word.  Is returning that up the chain possible?   
> 
> 
> 
> Eric
> 
> _______________________
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>  
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>	
> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
> 

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>	
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.