You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Peter Kronenberg <pe...@torch.ai> on 2021/10/29 16:54:55 UTC

OCR with bounding boxes

I'm pretty sure this is a capability of Tesseract, but does Tika have the ability to specify a bounding box when OCR'ing a page?  So if we want to give it the coordinates of a single paragraph or section of a document?


Thanks
Peter

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>

RE: OCR with bounding boxes

Posted by Peter Kronenberg <pe...@torch.ai>.

Probably a per-file/per-request thing

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>


From: Tim Allison <ta...@apache.org>
Sent: Friday, October 29, 2021 1:58 PM
To: user@tika.apache.org
Subject: Re: OCR with bounding boxes


That's not currently supported, and in fact, I don't think we even support running OCR on specific pages within PDFs (and I do remember we've had that request occasionally).  Would this be a per-file configuration or would you want to specify something for all files?

On Fri, Oct 29, 2021 at 12:55 PM Peter Kronenberg <pe...@torch.ai>> wrote:
I’m pretty sure this is a capability of Tesseract, but does Tika have the ability to specify a bounding box when OCR’ing a page?  So if we want to give it the coordinates of a single paragraph or section of a document?


Thanks
Peter

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=0fdc369ea7154ad8955d0a086d4f1f78>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=0fdc369ea7154ad8955d0a086d4f1f78>

Re: OCR with bounding boxes

Posted by Tim Allison <ta...@apache.org>.

That's not currently supported, and in fact, I don't think we even support
running OCR on specific pages within PDFs (and I do remember we've had that
request occasionally).  Would this be a per-file configuration or would you
want to specify something for all files?

On Fri, Oct 29, 2021 at 12:55 PM Peter Kronenberg <pe...@torch.ai>
wrote:

> I’m pretty sure this is a capability of Tesseract, but does Tika have the
> ability to specify a bounding box when OCR’ing a page?  So if we want to
> give it the coordinates of a single paragraph or section of a document?
>
>
>
>
>
> Thanks
>
> Peter
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623*
>
> [image: Torch AI] <http://www.torch.ai/>
>
> 4303 W. 119th St., Leawood, KS 66209
> WWW.TORCH.AI <http://www.torch.ai/>
>
>
>
>
>