You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Furkan KAMACI <fu...@gmail.com> on 2019/11/25 12:39:24 UTC

Token Coordinates at Image

Hi All,

I want to black out some particular texts at image (similar to described at
here:
https://helpx.adobe.com/acrobat/using/removing-sensitive-content-pdfs.html)

I know that I can find tokens at image via Tika. However, I need the
coordinates of a found token at image to automatically black out specific
texts.

How can I achieve this?

Kind Regards,
Furkan KAMACI

Re: Token Coordinates at Image

Posted by Eric Pugh <ep...@opensourceconnections.com>.

I could imagine a workflow where you take the original PDF (either binary text or an image in PDF), and run it through Tika w/ Tesseract.   You then get back the tokens with the bounding boxes and then use that data to find your text.  Go back, and use the image version of the PDF (which is created by Tika as well), and then overlay the black boxes…

Your output would be either a image or image only PDF…

I think contrary to what Tim said, we actually get the HOCR coordinates for either image only or underlying electronic text PDFs, as the PDF’s are converted on a page by page basis to an image first before Tesseract gets them, IIUC.

Eric

> On Nov 25, 2019, at 10:02 AM, Tim Allison <ta...@apache.org> wrote:
> 
> Hi Furkan,
> 
>   First, are you processing PDFs or actual image files?  If PDFs, be careful about blacking out images because there may be some record of the underlying text in the file, and while a user might not be able to see the sensitive information, that information may be available for inquiring minds.
> 
>   If PDFs, are these PDFs that are image-only or is there underlying electronic text.  If image-only, you could use the hocr output from tesseract, which reports coordinates in an html output file.
> 
>   Now, if there is underlying text, we aren't currently extracting text positions from PDFs...although we could.  
> 
> @Eric Pugh <ma...@opensourceconnections.com>, recommendations?
> 
>   Cheers,
> 
>                       Tim
> 
> On Mon, Nov 25, 2019 at 7:39 AM Furkan KAMACI <furkankamaci@gmail.com <ma...@gmail.com>> wrote:
> Hi All,
> 
> I want to black out some particular texts at image (similar to described at here: https://helpx.adobe.com/acrobat/using/removing-sensitive-content-pdfs.html <https://helpx.adobe.com/acrobat/using/removing-sensitive-content-pdfs.html>)
> 
> I know that I can find tokens at image via Tika. However, I need the coordinates of a found token at image to automatically black out specific texts. 
> 
> How can I achieve this?
> 
> Kind Regards,
> Furkan KAMACI

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>	
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.

Re: Token Coordinates at Image

Posted by Tim Allison <ta...@apache.org>.

Hi Furkan,

  First, are you processing PDFs or actual image files?  If PDFs, be
careful about blacking out images because there may be some record of the
underlying text in the file, and while a user might not be able to see the
sensitive information, that information may be available for inquiring
minds.

  If PDFs, are these PDFs that are image-only or is there underlying
electronic text.  If image-only, you could use the hocr output from
tesseract, which reports coordinates in an html output file.

  Now, if there is underlying text, we aren't currently extracting text
positions from PDFs...although we could.

@Eric Pugh <ep...@opensourceconnections.com>, recommendations?

  Cheers,

                      Tim

On Mon, Nov 25, 2019 at 7:39 AM Furkan KAMACI <fu...@gmail.com>
wrote:

> Hi All,
>
> I want to black out some particular texts at image (similar to described
> at here:
> https://helpx.adobe.com/acrobat/using/removing-sensitive-content-pdfs.html
> )
>
> I know that I can find tokens at image via Tika. However, I need the
> coordinates of a found token at image to automatically black out specific
> texts.
>
> How can I achieve this?
>
> Kind Regards,
> Furkan KAMACI
>