You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Paul Bergstrom <pa...@yahoo.com.INVALID> on 2016/07/06 11:22:57 UTC

PDF to PDF/A conversion

Hi!

I'm totally new to Apache PDFBox to please bear any stupid question:-)

In my work I do some digital archiving where I usually OCR scanned PDF-images with Tesseract and then do the conversion from PDF to PDF/A-1b with Ghostscript.

However, there has recently been a change in the OCR specifications - don't really know when and exactly how - but the consequences are that Ghostscript now is mangling and altering the OCR so it can't be used. As what I understand it has something to do with the ToUnicode CMap processing.

However I tried some other software to do the conversion and the problem does not occur there. That's why I also would like to try to do the conversion with PDFBox to see what happens.

The problem is I have absolutely no idea how to do this. I'm not really in to java-based software. Can it be done nad how is it done? Preferably from the Linux commandline.

I saw this https://pdfbox.apache.org/1.8/cookbook/pdfacreation.html but I can't make any sense out of it.

Is it possible something like this:

java -jar pdfbox-app-x.y.z.jar Convert [OPTIONS] <inputfile> [outputfile] (where options might be compability level)?

Many thanks for your effort!

Best regards

Paul Bergström
Sweden

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: PDF to PDF/A conversion

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 06.07.2016 um 13:22 schrieb Paul Bergstrom:
> Hi!
>
> I'm totally new to Apache PDFBox to please bear any stupid question:-)
>
> In my work I do some digital archiving where I usually OCR scanned PDF-images with Tesseract and then do the conversion from PDF to PDF/A-1b with Ghostscript.
>
> However, there has recently been a change in the OCR specifications - don't really know when and exactly how - but the consequences are that Ghostscript now is mangling and altering the OCR so it can't be used. As what I understand it has something to do with the ToUnicode CMap processing.
>
> However I tried some other software to do the conversion and the problem does not occur there. That's why I also would like to try to do the conversion with PDFBox to see what happens.
>
> The problem is I have absolutely no idea how to do this. I'm not really in to java-based software. Can it be done nad how is it done? Preferably from the Linux commandline.
>
> I saw this https://pdfbox.apache.org/1.8/cookbook/pdfacreation.html but I can't make any sense out of it.
>
> Is it possible something like this:
>
> java -jar pdfbox-app-x.y.z.jar Convert [OPTIONS] <inputfile> [outputfile] (where options might be compability level)?
We don't have a tool that converts PDF to PDF/A-1b (there are commercial 
tools that do that, e.g. from Callas or PDF-Tools). It might be possible 
to implement this if the flaws are known in advance. Usually the meta 
data and the output intent are missing, but there might be much more.

Tilman

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: PDF to PDF/A conversion

Posted by Mick Davis <tg...@gmail.com>.

Hi Paul,

I just finished doing exactly that for the same reason (archiving
documents).  I wrote a Java program that can be called from the command
line.   It would probably need to be tweaked to suit your exact needs but I
would be happy to share it and provide some help with customizing it.   I'm
away from my desk right now but I will follow up with you later today.

Mick Davis
On Jul 6, 2016 8:30 AM, "Paul Bergstrom"
<pa...@yahoo.com.invalid> wrote:

> Hi!
>
> I'm totally new to Apache PDFBox to please bear any stupid question:-)
>
> In my work I do some digital archiving where I usually OCR scanned
> PDF-images with Tesseract and then do the conversion from PDF to PDF/A-1b
> with Ghostscript.
>
> However, there has recently been a change in the OCR specifications -
> don't really know when and exactly how - but the consequences are that
> Ghostscript now is mangling and altering the OCR so it can't be used. As
> what I understand it has something to do with the ToUnicode CMap processing.
>
> However I tried some other software to do the conversion and the problem
> does not occur there. That's why I also would like to try to do the
> conversion with PDFBox to see what happens.
>
> The problem is I have absolutely no idea how to do this. I'm not really in
> to java-based software. Can it be done nad how is it done? Preferably from
> the Linux commandline.
>
> I saw this https://pdfbox.apache.org/1.8/cookbook/pdfacreation.html but I
> can't make any sense out of it.
>
> Is it possible something like this:
>
> java -jar pdfbox-app-x.y.z.jar Convert [OPTIONS] <inputfile> [outputfile]
> (where options might be compability level)?
>
> Many thanks for your effort!
>
> Best regards
>
> Paul Bergström
> Sweden
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>