You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Albretch Mueller <lb...@gmail.com> on 2022/07/18 09:04:07 UTC

from pdf to some sort of XMLish ODT kind of file ...

 it is in its name: https://en.wikipedia.org/wiki/PDF
 but, as a corpora researcher, I have always wondered what exactly are
the "portable", "document" and "format" aspects of it.  PDF is just a
"visually appealing" GUI.

 The processes of conversion of the different kinds of PDFs to text is
not exactly straightforward, it is way too entropic (too much of the
necessary "information" to do the conversion is lost). Some pdf files
are image-based (no text at all), some are image-based, but include
(some of) the text, some of the image-based pdf files also contain
images, ...

 Do you know of any kind of prior art studying and/or explaining
possible solutions to these kinds of pdf to xmlish text conversion
problems? Any suggestion about how you would approach a solution to
them?

 Thank you,
 lbrtchx

Re: from pdf to some sort of XMLish ODT kind of file ...

Posted by ti...@cid.is.

Albrecht,

thanks for this perfect description: > PDF is just a "visually appealing" 
GUI.
We laughed heartily.

The spelling "Portable Data Format" is also incorrect in other respects: 
PDF is a "Portable *Pages* Format" because the page is the basis.
This is explained by the original purpose of PDFbeing a "data format" for 
prepress, and prepress is all about pages.
For our extensive "PDF to Solr" project we are now going a different way.
We prepare the PDFs of our "data suppliers" with a commercial, very good 
Windows program package in such a way that we receive a separate PDF file 
and a good text file from each page. "Good text file" means that the OCR 
only minimally checks the page formatting in the PDF file (paragraphs, 
boxes) and makes the text really usable with the help of dictionaries and 
perhaps some magic.

Best
Walter Claassen
cla@cid.is

PS Your first and last name sounds German. Mee too.




"Albretch Mueller" <lb...@gmail.com> schrieb am 18.07.2022 11:04:07:

> Von: "Albretch Mueller" <lb...@gmail.com>
> An: user@tika.apache.org, dev@tika.apache.org
> Datum: 18.07.2022 11:05
> Betreff: from pdf to some sort of XMLish ODT kind of file ...
> 
>  it is in its name: https://en.wikipedia.org/wiki/PDF
>  but, as a corpora researcher, I have always wondered what exactly are
> the "portable", "document" and "format" aspects of it.  PDF is just a
> "visually appealing" GUI.
> 
>  The processes of conversion of the different kinds of PDFs to text is
> not exactly straightforward, it is way too entropic (too much of the
> necessary "information" to do the conversion is lost). Some pdf files
> are image-based (no text at all), some are image-based, but include
> (some of) the text, some of the image-based pdf files also contain
> images, ...
> 
>  Do you know of any kind of prior art studying and/or explaining
> possible solutions to these kinds of pdf to xmlish text conversion
> problems? Any suggestion about how you would approach a solution to
> them?
> 
>  Thank you,
>  lbrtchx