You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Albretch Mueller <lb...@gmail.com> on 2022/10/02 20:14:10 UTC

ghostscript's -dFILTER options ...

I learned of ghostscript's -dFILTER options from:

https://askubuntu.com/questions/477663/how-to-remove-images-from-a-pdf-file

and tested it on what I consider to be more real life cases:

https://nysl.ptfs.com/data/Library1/Library1/pdf/39007765_US-History-and-Government-Russian-Edition_2004-JAN-28.pdf

https://nysl.ptfs.com/data/Library1/Library1/pdf/39007765_EARTH-SCIENCE-REFERENCE-TABLE-CHINESE-2010.pdf

https://download.archive.org/byte-magazine-1986-02/1986_02_BYTE_11-02_Text_Processing.pdf

https://nysl.ptfs.com/data/Library1/Library1/pdf/7590547_Physical-Geography-Nov-1884.pdf

https://nysl.ptfs.com/data/Library1/Library1/pdf/7590547_Astronomy-Mar-14-1894.pdf

Depending on your expectations it may not really work that well. The
output file still needs heavy "human" eyeballing intervention.

Do those options figure out where an image might be based on the pixel
arrangement on the layout of the page or do they actually work based
on the page's readily available metadata?

PDF files, contrary to what their name suggest aren't neither
portable, nor documents. Also there are a plethora of pdf file types
from page-to-page quasi textual to image-based ones, image-based
containing the actual text, ...

If we can't hope automatically to be able to fully algorithmically
textualize pdf files, why not designing GUIs to help "humans" to pick
up where algorithms end? To me that would be a much needed tika
subproject. A JAva-based "file cleansing/reformatting" GUI

lbrtchx