You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Peter Murray-Rust <pm...@cam.ac.uk> on 2014/04/22 15:39:00 UTC

OCR and PDFBox/PDF2SVG

We have a need to carry out limited OCR in the PDF extraction process and
are thinking of adding it to PDF2SVG (
https://bitbucket.org/petermr/pdf2svg-dev/wiki/Home - our converter based
on PDFBox). In our work (converting technical documents and scientific
publications) there are two particular areas:

* when unknown and non-conformance font families are used. This is
unfortunately extremely common (most scientific publishers use non-Unicode
undocumented fonts). Our approach is to carry out "OCR" on the glyphs in
the font maps.

* in binarized image diagrams (e.g. plots), where characters in a (fairly
small) range of fonts are used (code points mainly in the ASCII range).

There seems to be no pure Java F/OSS OCR software that can be easily used
with PDFBox and PDF2SVG. We are therefore hacking our own and using bits of
"javaOCR" (http://sourceforge.net/projects/javaocr/ - which seems a stalled
project ) and Longan (https://github.com/Zarkonnen/Longan/tree/master/src -
 the author has recently mailed me and is interested in resuming the work).
We also have our own approach which involves thinning and topological
analysis.

This mail is to see if others either have a solution (which would save us
going further) or to see if anyone is interested in using such a facility

[Note that this is feasible mainly because the source is born-digital and
binarized (0/1) and so does not suffer from scanning artefacts such as
skewing, contrast, noise, etc.]

P.

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: OCR and PDFBox/PDF2SVG

Posted by Peter Murray-Rust <pm...@cam.ac.uk>.

Thanks, could be useful. (Note it hasn't started - presumably until summer).

We'll probably bash ahead anyway as we have to do other things and keep in
touch




On Tue, Apr 22, 2014 at 5:04 PM, Maruan Sahyoun <sa...@fileaffairs.de>wrote:

> Hi Peter,
>
> PDFBOX-1912 is an effort to add OCR to PDFBox as part of a GSoC engagement.
>
> Maybe that’s what you are looking for?
>
> BR
> Maruan
>
> Am 22.04.2014 um 15:39 schrieb Peter Murray-Rust <pm...@cam.ac.uk>:
>
> > We have a need to carry out limited OCR in the PDF extraction process and
> > are thinking of adding it to PDF2SVG (
> > https://bitbucket.org/petermr/pdf2svg-dev/wiki/Home - our converter
> based
> > on PDFBox). In our work (converting technical documents and scientific
> > publications) there are two particular areas:
> >
> > * when unknown and non-conformance font families are used. This is
> > unfortunately extremely common (most scientific publishers use
> non-Unicode
> > undocumented fonts). Our approach is to carry out "OCR" on the glyphs in
> > the font maps.
> >
> > * in binarized image diagrams (e.g. plots), where characters in a (fairly
> > small) range of fonts are used (code points mainly in the ASCII range).
> >
> > There seems to be no pure Java F/OSS OCR software that can be easily used
> > with PDFBox and PDF2SVG. We are therefore hacking our own and using bits
> of
> > "javaOCR" (http://sourceforge.net/projects/javaocr/ - which seems a
> stalled
> > project ) and Longan (
> https://github.com/Zarkonnen/Longan/tree/master/src -
> > the author has recently mailed me and is interested in resuming the
> work).
> > We also have our own approach which involves thinning and topological
> > analysis.
> >
> > This mail is to see if others either have a solution (which would save us
> > going further) or to see if anyone is interested in using such a facility
> >
> > [Note that this is feasible mainly because the source is born-digital and
> > binarized (0/1) and so does not suffer from scanning artefacts such as
> > skewing, contrast, noise, etc.]
> >
> > P.
> >
> > --
> > Peter Murray-Rust
> > Reader in Molecular Informatics
> > Unilever Centre, Dep. Of Chemistry
> > University of Cambridge
> > CB2 1EW, UK
> > +44-1223-763069
>
>


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: OCR and PDFBox/PDF2SVG

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hi Peter,

PDFBOX-1912 is an effort to add OCR to PDFBox as part of a GSoC engagement. 

Maybe that’s what you are looking for?

BR
Maruan

Am 22.04.2014 um 15:39 schrieb Peter Murray-Rust <pm...@cam.ac.uk>:

> We have a need to carry out limited OCR in the PDF extraction process and
> are thinking of adding it to PDF2SVG (
> https://bitbucket.org/petermr/pdf2svg-dev/wiki/Home - our converter based
> on PDFBox). In our work (converting technical documents and scientific
> publications) there are two particular areas:
> 
> * when unknown and non-conformance font families are used. This is
> unfortunately extremely common (most scientific publishers use non-Unicode
> undocumented fonts). Our approach is to carry out "OCR" on the glyphs in
> the font maps.
> 
> * in binarized image diagrams (e.g. plots), where characters in a (fairly
> small) range of fonts are used (code points mainly in the ASCII range).
> 
> There seems to be no pure Java F/OSS OCR software that can be easily used
> with PDFBox and PDF2SVG. We are therefore hacking our own and using bits of
> "javaOCR" (http://sourceforge.net/projects/javaocr/ - which seems a stalled
> project ) and Longan (https://github.com/Zarkonnen/Longan/tree/master/src -
> the author has recently mailed me and is interested in resuming the work).
> We also have our own approach which involves thinning and topological
> analysis.
> 
> This mail is to see if others either have a solution (which would save us
> going further) or to see if anyone is interested in using such a facility
> 
> [Note that this is feasible mainly because the source is born-digital and
> binarized (0/1) and so does not suffer from scanning artefacts such as
> skewing, contrast, noise, etc.]
> 
> P.
> 
> -- 
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069