You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Siva Kumar Ch <si...@gmail.com> on 2014/03/28 19:23:16 UTC

Eliminating super scripts while extracting text from pdf

Hi,

I am trying to extract text from pdf, and process the text. I have been
successful in extraction, but could not get much benefits out of it as the
extracted text treated the superscripts, usually numbers, as normal text.

A superscript to a word, which is the last word of a sentence, has been
placed after the period(.)

ex: Word: "test" with superscript "super"
When it appeared at the end of a sentence, has been extracted as -
"test.super"

Is there any way I can get rid of superscripts?

-- 
Br,
Siva.

Re: Eliminating super scripts while extracting text from pdf

Posted by Peter Murray-Rust <pm...@cam.ac.uk>.

The second link (on Bitbucket) has a download and you can download and run
the code. It's all under Maven. The documentation isn't great. You may
should find some PDF example documents  with subscripts in.


On Mon, Mar 31, 2014 at 10:17 PM, Siva Kumar <si...@knoesis.org> wrote:

> Hi Peter,
>
> Thanks.
>
> As suggested, I have gone through the links provided, but unfortunately
> could not get to the heuristics to detect the subsuperscripts.
>
> If possible, please attach or provide a link that can publicly be accessed.
>
> Appreciate your help.
>
>
> On Sat, Mar 29, 2014 at 4:19 AM, Peter Murray-Rust <pm...@cam.ac.uk>
> wrote:
>
> > As Olaf says there is no formal support for sub/superscripts in PDF.
> > Generally a lower font size is used and the characters are
> raised/lowered.
> >
> > We have written heuristics to detect subsuperscripts in the output of
> > PDFBox. See http://bitbucket.org/petermr/ami and
> > http://bitbucket.org/petermr/svg2xml-dev. This works well for scholarly
> > publishing - it's fairly general but may have to be tweaked for some
> other
> > applications. I have not commonly found Unicode subsuperscripts being
> used
> > - it's normally to use other fontsizes and shift.
> >
> >
> > On Fri, Mar 28, 2014 at 9:47 PM, Olaf Drümmer
> > <ol...@callassoftware.com>wrote:
> >
> > > Two thoughts:
> > >
> > > - keep track of the baseline and size of characters, if the baseline is
> > > slightly shifted (upwards -> superscript, downward -> subscript) and
> the
> > > size is smaller than surrounding characters, it's possibly a
> superscript
> > or
> > > subscript character
> > >
> > > - be aware of the fact that some fonts contain glyphs for superscripts
> -
> > > then baseline and text size would be the same; in such cases you'd have
> > to
> > > look up via the Unicode code point whether you have encountered a
> > > superscript.
> > >
> > > Olaf
> > >
> > > Am 28 Mar 2014 um 19:23 schrieb Siva Kumar Ch <sivakumarch51@gmail.com
> >:
> > >
> > > > Hi,
> > > >
> > > > I am trying to extract text from pdf, and process the text. I have
> been
> > > > successful in extraction, but could not get much benefits out of it
> as
> > > the
> > > > extracted text treated the superscripts, usually numbers, as normal
> > text.
> > > >
> > > > A superscript to a word, which is the last word of a sentence, has
> been
> > > > placed after the period(.)
> > > >
> > > > ex: Word: "test" with superscript "super"
> > > > When it appeared at the end of a sentence, has been extracted as -
> > > > "test.super"
> > > >
> > > > Is there any way I can get rid of superscripts?
> > > >
> > > > --
> > > > Br,
> > > > Siva.
> > >
> > >
> >
> >
> > --
> > Peter Murray-Rust
> > Reader in Molecular Informatics
> > Unilever Centre, Dep. Of Chemistry
> > University of Cambridge
> > CB2 1EW, UK
> > +44-1223-763069
> >
>



-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: Eliminating super scripts while extracting text from pdf

Posted by Siva Kumar <si...@knoesis.org>.

Hi Peter,

Thanks.

As suggested, I have gone through the links provided, but unfortunately
could not get to the heuristics to detect the subsuperscripts.

If possible, please attach or provide a link that can publicly be accessed.

Appreciate your help.


On Sat, Mar 29, 2014 at 4:19 AM, Peter Murray-Rust <pm...@cam.ac.uk> wrote:

> As Olaf says there is no formal support for sub/superscripts in PDF.
> Generally a lower font size is used and the characters are raised/lowered.
>
> We have written heuristics to detect subsuperscripts in the output of
> PDFBox. See http://bitbucket.org/petermr/ami and
> http://bitbucket.org/petermr/svg2xml-dev. This works well for scholarly
> publishing - it's fairly general but may have to be tweaked for some other
> applications. I have not commonly found Unicode subsuperscripts being used
> - it's normally to use other fontsizes and shift.
>
>
> On Fri, Mar 28, 2014 at 9:47 PM, Olaf Drümmer
> <ol...@callassoftware.com>wrote:
>
> > Two thoughts:
> >
> > - keep track of the baseline and size of characters, if the baseline is
> > slightly shifted (upwards -> superscript, downward -> subscript) and the
> > size is smaller than surrounding characters, it's possibly a superscript
> or
> > subscript character
> >
> > - be aware of the fact that some fonts contain glyphs for superscripts -
> > then baseline and text size would be the same; in such cases you'd have
> to
> > look up via the Unicode code point whether you have encountered a
> > superscript.
> >
> > Olaf
> >
> > Am 28 Mar 2014 um 19:23 schrieb Siva Kumar Ch <si...@gmail.com>:
> >
> > > Hi,
> > >
> > > I am trying to extract text from pdf, and process the text. I have been
> > > successful in extraction, but could not get much benefits out of it as
> > the
> > > extracted text treated the superscripts, usually numbers, as normal
> text.
> > >
> > > A superscript to a word, which is the last word of a sentence, has been
> > > placed after the period(.)
> > >
> > > ex: Word: "test" with superscript "super"
> > > When it appeared at the end of a sentence, has been extracted as -
> > > "test.super"
> > >
> > > Is there any way I can get rid of superscripts?
> > >
> > > --
> > > Br,
> > > Siva.
> >
> >
>
>
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
>

Re: Eliminating super scripts while extracting text from pdf

Posted by Peter Murray-Rust <pm...@cam.ac.uk>.

As Olaf says there is no formal support for sub/superscripts in PDF.
Generally a lower font size is used and the characters are raised/lowered.

We have written heuristics to detect subsuperscripts in the output of
PDFBox. See http://bitbucket.org/petermr/ami and
http://bitbucket.org/petermr/svg2xml-dev. This works well for scholarly
publishing - it's fairly general but may have to be tweaked for some other
applications. I have not commonly found Unicode subsuperscripts being used
- it's normally to use other fontsizes and shift.


On Fri, Mar 28, 2014 at 9:47 PM, Olaf Drümmer
<ol...@callassoftware.com>wrote:

> Two thoughts:
>
> - keep track of the baseline and size of characters, if the baseline is
> slightly shifted (upwards -> superscript, downward -> subscript) and the
> size is smaller than surrounding characters, it's possibly a superscript or
> subscript character
>
> - be aware of the fact that some fonts contain glyphs for superscripts -
> then baseline and text size would be the same; in such cases you'd have to
> look up via the Unicode code point whether you have encountered a
> superscript.
>
> Olaf
>
> Am 28 Mar 2014 um 19:23 schrieb Siva Kumar Ch <si...@gmail.com>:
>
> > Hi,
> >
> > I am trying to extract text from pdf, and process the text. I have been
> > successful in extraction, but could not get much benefits out of it as
> the
> > extracted text treated the superscripts, usually numbers, as normal text.
> >
> > A superscript to a word, which is the last word of a sentence, has been
> > placed after the period(.)
> >
> > ex: Word: "test" with superscript "super"
> > When it appeared at the end of a sentence, has been extracted as -
> > "test.super"
> >
> > Is there any way I can get rid of superscripts?
> >
> > --
> > Br,
> > Siva.
>
>


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: Eliminating super scripts while extracting text from pdf

Posted by Olaf Drümmer <ol...@callassoftware.com>.

Two thoughts:

- keep track of the baseline and size of characters, if the baseline is slightly shifted (upwards -> superscript, downward -> subscript) and the size is smaller than surrounding characters, it's possibly a superscript or subscript character

- be aware of the fact that some fonts contain glyphs for superscripts - then baseline and text size would be the same; in such cases you'd have to look up via the Unicode code point whether you have encountered a superscript.

Olaf

Am 28 Mar 2014 um 19:23 schrieb Siva Kumar Ch <si...@gmail.com>:

> Hi,
> 
> I am trying to extract text from pdf, and process the text. I have been
> successful in extraction, but could not get much benefits out of it as the
> extracted text treated the superscripts, usually numbers, as normal text.
> 
> A superscript to a word, which is the last word of a sentence, has been
> placed after the period(.)
> 
> ex: Word: "test" with superscript "super"
> When it appeared at the end of a sentence, has been extracted as -
> "test.super"
> 
> Is there any way I can get rid of superscripts?
> 
> -- 
> Br,
> Siva.