You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Nilton Monteiro <al...@hotmail.com> on 2021/01/18 12:08:20 UTC

Config Tika Server

Hello, I would like to know if its possible to extract the position of the texts, tables, graphs, and pages in PPT files.
I triied Tika-python to parse the ppt file, but I did not find options to get these informations.
I understand that I need to config tika server to obtain that. Could you please hep me with that?

Thanks,
Nilton

Re: Config Tika Server

Posted by Eric Pugh <ep...@opensourceconnections.com>.

I’ve done two projects around this.   

https://github.com/o19s/powerpoint-discovery-demo <https://github.com/o19s/powerpoint-discovery-demo> demonstrates hocr + converting PPT’s to static images for web friendlier (sorta!) highlighting in context.

https://github.com/o19s/pdf-discovery-demo/ <https://github.com/o19s/pdf-discovery-demo/> is similar but newer, and does the same think for PDF’s, however we use pdf.js to render the PDF natively in the web.

Eric

> On Jan 18, 2021, at 8:48 AM, Tim Allison <ta...@apache.org> wrote:
> 
> We aren’t currently extracting position in any formats. I _think_ it is
> straightforward to get coordinates from PDFs, but I’d have to look at the
> ppt/x apis for location.
> 
> What, specifically, are you trying to accomplish?
> 
> Tesseract in hocr mode does extract coordinates if that’s of any use...
> 
> On Mon, Jan 18, 2021 at 8:05 AM Nilton Monteiro <al...@hotmail.com>
> wrote:
> 
>> Hello, I would like to know if its possible to extract the position of the
>> texts, tables, graphs, and pages in PPT files.
>> I triied Tika-python to parse the ppt file, but I did not find options to
>> get these informations.
>> I understand that I need to config tika server to obtain that. Could you
>> please hep me with that?
>> 
>> Thanks,
>> Nilton
>> 

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>	
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.

Re: Config Tika Server

Posted by Tim Allison <ta...@apache.org>.

We aren’t currently extracting position in any formats. I _think_ it is
straightforward to get coordinates from PDFs, but I’d have to look at the
ppt/x apis for location.

What, specifically, are you trying to accomplish?

Tesseract in hocr mode does extract coordinates if that’s of any use...

On Mon, Jan 18, 2021 at 8:05 AM Nilton Monteiro <al...@hotmail.com>
wrote:

> Hello, I would like to know if its possible to extract the position of the
> texts, tables, graphs, and pages in PPT files.
> I triied Tika-python to parse the ppt file, but I did not find options to
> get these informations.
> I understand that I need to config tika server to obtain that. Could you
> please hep me with that?
>
> Thanks,
> Nilton
>