You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Nilton Monteiro <al...@hotmail.com> on 2021/01/18 12:08:20 UTC
Config Tika Server
Hello, I would like to know if its possible to extract the position of the texts, tables, graphs, and pages in PPT files.
I triied Tika-python to parse the ppt file, but I did not find options to get these informations.
I understand that I need to config tika server to obtain that. Could you please hep me with that?
Thanks,
Nilton
Re: Config Tika Server
Posted by Eric Pugh <ep...@opensourceconnections.com>.
I’ve done two projects around this.
https://github.com/o19s/powerpoint-discovery-demo <https://github.com/o19s/powerpoint-discovery-demo> demonstrates hocr + converting PPT’s to static images for web friendlier (sorta!) highlighting in context.
https://github.com/o19s/pdf-discovery-demo/ <https://github.com/o19s/pdf-discovery-demo/> is similar but newer, and does the same think for PDF’s, however we use pdf.js to render the PDF natively in the web.
Eric
> On Jan 18, 2021, at 8:48 AM, Tim Allison <ta...@apache.org> wrote:
>
> We aren’t currently extracting position in any formats. I _think_ it is
> straightforward to get coordinates from PDFs, but I’d have to look at the
> ppt/x apis for location.
>
> What, specifically, are you trying to accomplish?
>
> Tesseract in hocr mode does extract coordinates if that’s of any use...
>
> On Mon, Jan 18, 2021 at 8:05 AM Nilton Monteiro <al...@hotmail.com>
> wrote:
>
>> Hello, I would like to know if its possible to extract the position of the
>> texts, tables, graphs, and pages in PPT files.
>> I triied Tika-python to parse the ppt file, but I did not find options to
>> get these informations.
>> I understand that I need to config tika server to obtain that. Could you
>> please hep me with that?
>>
>> Thanks,
>> Nilton
>>
_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
Re: Config Tika Server
Posted by Tim Allison <ta...@apache.org>.
We aren’t currently extracting position in any formats. I _think_ it is
straightforward to get coordinates from PDFs, but I’d have to look at the
ppt/x apis for location.
What, specifically, are you trying to accomplish?
Tesseract in hocr mode does extract coordinates if that’s of any use...
On Mon, Jan 18, 2021 at 8:05 AM Nilton Monteiro <al...@hotmail.com>
wrote:
> Hello, I would like to know if its possible to extract the position of the
> texts, tables, graphs, and pages in PPT files.
> I triied Tika-python to parse the ppt file, but I did not find options to
> get these informations.
> I understand that I need to config tika server to obtain that. Could you
> please hep me with that?
>
> Thanks,
> Nilton
>