You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by HQS <hq...@gmail.com> on 2014/03/06 18:39:20 UTC

2 questions

Hello all,

1.
Have you ever seen PDFs having this kind of (pseudo) structure :

BT
<character>
Tj
ET

?

Which means, the strings are split into characters and there is one block of text per character ?
It seems to be ill-formed doesn't it ?

2. Reminder of my first mail, what is the library compliancy regarding PDF standards ? 1.3 to 1.7 ?


Thanks and regards

Julien


Re: 2 questions

Posted by Olaf Drümmer <ol...@callassoftware.com>.
How is 
	<character>
written exactly?


AFAICT 

BT
<41>
Tj
ET

and

BT
(A)
Tj
ET


would be equivalent and both valid (assuming the right encoding is in place etc.)

Whether this is efficient, is a completely different question. I have seen PDF creators setting all parameters of the whole graphic state for each character… 


Olaf


Am 6 Mar 2014 um 18:39 schrieb HQS <hq...@gmail.com>:

> Hello all,
> 
> 1.
> Have you ever seen PDFs having this kind of (pseudo) structure :
> 
> BT
> <character>
> Tj
> ET
> 
> ?
> 
> Which means, the strings are split into characters and there is one block of text per character ?
> It seems to be ill-formed doesn't it ?
> 
> 2. Reminder of my first mail, what is the library compliancy regarding PDF standards ? 1.3 to 1.7 ?
> 
> 
> Thanks and regards
> 
> Julien
> 


Re: 2 questions

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi Julien,

composing words reliably from individual characters may not be 100% sure method. As you have the advantage of being able to match a pattern you are looking for this will certainly help. Will it always certainly be a 100% accurate - maybe not. What you could do is try the ExtractText command line tool  http://pdfbox.apache.org/commandline/#extractText or PDFTextStripper to extract text from your PDF and see what the results are and if the words you are looking for are treated as such. 

BR
Maruan Sahyoun

Am 07.03.2014 um 12:16 schrieb Confidential Confidential <hq...@gmail.com>:

> Sirs,
> 
> I had already thought about this graphical approach to reconstruct the
> words. I've let it down because I'm a bit sceptical on the reliability of
> such a method. I can't help thinking that it will not be a 100% sure
> method. I understand why a CAD software would produce such an output,
> though (thank you for this new word that I didn't know "boustrophedonic",
> but it explains well the result obtained).
> 
> Supposing that the characters appear in a totally arbitrary order,
> detecting that they're on the same line is more or less piece of cake
> (except if I need to introduce a tolerance, which makes things more
> difficult), but grouping the characters according to their X position is
> not at all an easy task.
> 
> But this is not an issue, my problem is more the fact that this method may
> not be 100% reliable. What do you think ?
> 
> As for the technical part (overloading the processText), it's ok, thanks
> for the advice.
> 
> Best regards
> 
> Julien
> 
> 
> 
> 2014-03-06 18:39 GMT+01:00 HQS <hq...@gmail.com>:
> 
>> Hello all,
>> 
>> 1.
>> Have you ever seen PDFs having this kind of (pseudo) structure :
>> 
>> BT
>> <character>
>> Tj
>> ET
>> 
>> ?
>> 
>> Which means, the strings are split into characters and there is one block
>> of text per character ?
>> It seems to be ill-formed doesn't it ?
>> 
>> 2. Reminder of my first mail, what is the library compliancy regarding PDF
>> standards ? 1.3 to 1.7 ?
>> 
>> 
>> Thanks and regards
>> 
>> Julien
>> 
>> 


Re: 2 questions

Posted by Peter Murray-Rust <pm...@cam.ac.uk>.
On Sat, Mar 8, 2014 at 5:23 PM, HQS <hq...@gmail.com> wrote:

> Peter,
>
> What you said about the factor 1000 I've seen it on a website dealing with
> PDFBox so you might be right.
>

thanks


> I have tried the following assertion which, if true, makes 2 characters
> connected to the same word :
>
> leftChar.getX() + leftChar.getWidth() + space * .5f + X_TOLERANCE >=
> rightChar.getX()
>
> I tried with X_TOLERANCE = 0
>
> space is simply equal to leftChar.getWidthOfSpace() , a method in the
> TextPosition class.
> getWidth() is also a method of that class.
>
> The first results are very satisfying.
>

I think you have to involve leftChar.getFontSize() .  When *I* extract
characters the width is not scaled. It's possible you are calling other
methods that scale it...

>
> By the way, is there an << easy >> way to delete text from a PDF, apart from
> parsing the tokens
> and delete those preceding the << Tj >> / << TJ >> operators ? I need this to
> erase the reference strings
> that I have detected and create an hyperlink at the same location with the
> same font.
>

I can't comment as I only interpret PDFs, not edit them.

BTW I do not use low level operators like Tj - I let PDFBox do the work of
interpreting.


> When I've tested the PDF words extractor I will post the source code so
> that we can share our technics.
> The extractor I'm making is a bit more advanced than the one embedded in
> PDFBox as it creates a list of
> couples (XY position of a word, contents of a word) and not just give the
> list of words.
>

I do this in two stages - translate all chars to SVG (PDF2SVG) and in a
separate project (SVG2XML) do the character concatenation - I have to deal
with subscripts, etc. Most PDF2Text tools don't deal with subscripts


> Thanks all !
>
> Julien
>
>

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: 2 questions

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
The factor of 1000 is defined in the PDF specification and is to map from Glyph Space to Text Space. Maybe you should take a look in chap 9.1 - 9.4 of the ISO 32000 spec.

BR
Maruan Sahyoun

Am 08.03.2014 um 18:23 schrieb HQS <hq...@gmail.com>:

> Peter,
> 
> What you said about the factor 1000 I’ve seen it on a website dealing with PDFBox so you might be right.
> I have tried the following assertion which, if true, makes 2 characters connected to the same word :
> 
> leftChar.getX() + leftChar.getWidth() + space * .5f + X_TOLERANCE >= rightChar.getX()
> 
> I tried with X_TOLERANCE = 0
> 
> space is simply equal to leftChar.getWidthOfSpace() , a method in the TextPosition class.
> getWidth() is also a method of that class.
> 
> The first results are very satisfying.
> 
> By the way, is there an « easy » way to delete text from a PDF, apart from parsing the tokens
> and delete those preceding the « Tj » / « TJ » operators ? I need this to erase the reference strings
> that I have detected and create an hyperlink at the same location with the same font.
> 
> When I’ve tested the PDF words extractor I will post the source code so that we can share our technics.
> The extractor I’m making is a bit more advanced than the one embedded in PDFBox as it creates a list of
> couples (XY position of a word, contents of a word) and not just give the list of words.
> 
> Thanks all !
> 
> Julien
> 
> 
> Le 8 mars 2014 à 15:14, Peter Murray-Rust <pm...@cam.ac.uk> a écrit :
> 
>> The width appears to be a ratio, independent of size. It also seems to be
>> conventionally multiplied by 1000 (I have not found a definition for this -
>> I have only guessed it).
>> 
>> Thus a character "A" of width=600 and fontSize=10.5 appears to have
>> pixelwidth = 600. / 1000. * 10.5 = 6.3 pixels
>> 
>> I'd be grateful for confirmation or correction...
>> 
>> 
>> On Sat, Mar 8, 2014 at 11:12 AM, HQS <hq...@gmail.com> wrote:
>> 
>>> Well, I have a precision to ask to Peter, about this formula :
>>> 
>>> x(a) + width(a)*fontSize(a) + tolerance >= x(b)
>>> 
>>> What is the difference between « width(a) » and « fontSize(a) » ? Is it
>>> not enough
>>> to know the width of the character « a » in pixels given by the font, to
>>> check this assertion ?
>>> 
>>> Thanks !
>>> 
>>> 
>>> Le 7 mars 2014 à 18:46, Maruan Sahyoun <sa...@fileaffairs.de> a écrit :
>>> 
>>>> if you need further assistance please let us know.
>>>> 
>>>> BR
>>>> Maruan Sahyoun
>>>> 
>>>> Am 07.03.2014 um 18:24 schrieb HQS <hq...@gmail.com>:
>>>> 
>>>>> Thank you all for those accurate answers.
>>>>> I will give a try to the geometrical approach based on the (x, y)
>>> coordinates of the characters.
>>>>> 
>>>>> Best regards,
>>>>> 
>>>>> Julien
>>>>> 
>>>>> Le 7 mars 2014 à 13:25, Peter Murray-Rust <pm...@cam.ac.uk> a écrit :
>>>>> 
>>>>>> On Fri, Mar 7, 2014 at 11:16 AM, Confidential Confidential <
>>>>>> hqsoftwares@gmail.com> wrote:
>>>>>> 
>>>>>>> Sirs,
>>>>>>> 
>>>>>>> I had already thought about this graphical approach to reconstruct the
>>>>>>> words. I've let it down because I'm a bit sceptical on the
>>> reliability of
>>>>>>> such a method. I can't help thinking that it will not be a 100% sure
>>>>>>> method. I understand why a CAD software would produce such an output,
>>>>>>> though (thank you for this new word that I didn't know
>>> "boustrophedonic",
>>>>>>> but it explains well the result obtained).
>>>>>>> 
>>>>>> 
>>>>>> It's not as bad as you think. We have re-constructed the text from
>>> hundreds
>>>>>> of scientific papers (so probably nearly a million words) and found
>>> very
>>>>>> few problems. The reason we are doing this rather than using PDFBox
>>> tools
>>>>>> is that scientific (and especially maths) PDFs contain may diacritics,
>>> high
>>>>>> Unicode points, occasional graphics strokes, variable font size and
>>> style,
>>>>>> ligatures, non-horizontal text, etc.
>>>>>> 
>>>>>> For running text it works very well - assuming that the characters
>>> announce
>>>>>> their widths. Then - roughly - "ab" is a word if
>>>>>> 
>>>>>> x(a) + width(a)*fontSize(a) + tolerance >= x(b)
>>>>>> 
>>>>>> else we can *crudely* estimate the number of intervening spaces (this
>>> is
>>>>>> very suspect as publishers may elide concatenated spaces).
>>>>>> 
>>>>>> All standard Fonts (see PDF spec) should announce their widths.
>>>>>> Unfortunately scientific publishers use some of the worst constructed
>>> fonts
>>>>>> in the world and sometimes we have to guess - by surveying a body of
>>>>>> character positions and trying to work out spaces and font-type.
>>>>>> 
>>>>>> 
>>>>>>> Supposing that the characters appear in a totally arbitrary order,
>>>>>>> detecting that they're on the same line is more or less piece of cake
>>>>>>> (except if I need to introduce a tolerance, which makes things more
>>>>>>> difficult),
>>>>>> 
>>>>>> 
>>>>>> In a modern PDF we find that all characters on the same line tend to
>>> have
>>>>>> equal y-coords to at least 3 decimals. The problem is that OCR'ed
>>>>>> characters may have variable y because of rounding errors and
>>> antialiasing.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> but grouping the characters according to their X position is
>>>>>>> not at all an easy task.
>>>>>>> 
>>>>>> 
>>>>>> The order should be fairly clear. The problems are:
>>>>>> * spaces (see above)
>>>>>> * hyphens at line-end (this requires heuristics - maybe lookup in
>>> Wordnet)
>>>>>> - we generally solve > 90%. Hyphens in chemistry are meaningful
>>>>>> * diacritics. Some characters have diacritics with the same x (e.g. E
>>> and
>>>>>> acute). These can occur in variable order. Where possible we try to
>>>>>> recreate a single Unicode point.
>>>>>> * over and underbars
>>>>>> * ligatures (in "waffle") their may be 6 characters or only 4
>>> w-a-ffl-e. We
>>>>>> split the latter.
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> But this is not an issue, my problem is more the fact that this
>>> method may
>>>>>>> not be 100% reliable. What do you think ?
>>>>>>> 
>>>>>> 
>>>>>> We are committed to solving it for English-language science and
>>> European
>>>>>> personal names. The worst case is probably slanted text in diagrams.
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> As for the technical part (overloading the processText), it's ok,
>>> thanks
>>>>>>> for the advice.
>>>>>>> 
>>>>>>> Best regards
>>>>>>> 
>>>>>>> Julien
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>> Peter Murray-Rust
>>>>>> Reader in Molecular Informatics
>>>>>> Unilever Centre, Dep. Of Chemistry
>>>>>> University of Cambridge
>>>>>> CB2 1EW, UK
>>>>>> +44-1223-763069
>>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> -- 
>> Peter Murray-Rust
>> Reader in Molecular Informatics
>> Unilever Centre, Dep. Of Chemistry
>> University of Cambridge
>> CB2 1EW, UK
>> +44-1223-763069
> 


Re: 2 questions

Posted by HQS <hq...@gmail.com>.
Peter,

What you said about the factor 1000 I’ve seen it on a website dealing with PDFBox so you might be right.
I have tried the following assertion which, if true, makes 2 characters connected to the same word :

leftChar.getX() + leftChar.getWidth() + space * .5f + X_TOLERANCE >= rightChar.getX()

I tried with X_TOLERANCE = 0

space is simply equal to leftChar.getWidthOfSpace() , a method in the TextPosition class.
getWidth() is also a method of that class.

The first results are very satisfying.

By the way, is there an « easy » way to delete text from a PDF, apart from parsing the tokens
and delete those preceding the « Tj » / « TJ » operators ? I need this to erase the reference strings
that I have detected and create an hyperlink at the same location with the same font.

When I’ve tested the PDF words extractor I will post the source code so that we can share our technics.
The extractor I’m making is a bit more advanced than the one embedded in PDFBox as it creates a list of
couples (XY position of a word, contents of a word) and not just give the list of words.

Thanks all !

Julien


Le 8 mars 2014 à 15:14, Peter Murray-Rust <pm...@cam.ac.uk> a écrit :

> The width appears to be a ratio, independent of size. It also seems to be
> conventionally multiplied by 1000 (I have not found a definition for this -
> I have only guessed it).
> 
> Thus a character "A" of width=600 and fontSize=10.5 appears to have
> pixelwidth = 600. / 1000. * 10.5 = 6.3 pixels
> 
> I'd be grateful for confirmation or correction...
> 
> 
> On Sat, Mar 8, 2014 at 11:12 AM, HQS <hq...@gmail.com> wrote:
> 
>> Well, I have a precision to ask to Peter, about this formula :
>> 
>> x(a) + width(a)*fontSize(a) + tolerance >= x(b)
>> 
>> What is the difference between « width(a) » and « fontSize(a) » ? Is it
>> not enough
>> to know the width of the character « a » in pixels given by the font, to
>> check this assertion ?
>> 
>> Thanks !
>> 
>> 
>> Le 7 mars 2014 à 18:46, Maruan Sahyoun <sa...@fileaffairs.de> a écrit :
>> 
>>> if you need further assistance please let us know.
>>> 
>>> BR
>>> Maruan Sahyoun
>>> 
>>> Am 07.03.2014 um 18:24 schrieb HQS <hq...@gmail.com>:
>>> 
>>>> Thank you all for those accurate answers.
>>>> I will give a try to the geometrical approach based on the (x, y)
>> coordinates of the characters.
>>>> 
>>>> Best regards,
>>>> 
>>>> Julien
>>>> 
>>>> Le 7 mars 2014 à 13:25, Peter Murray-Rust <pm...@cam.ac.uk> a écrit :
>>>> 
>>>>> On Fri, Mar 7, 2014 at 11:16 AM, Confidential Confidential <
>>>>> hqsoftwares@gmail.com> wrote:
>>>>> 
>>>>>> Sirs,
>>>>>> 
>>>>>> I had already thought about this graphical approach to reconstruct the
>>>>>> words. I've let it down because I'm a bit sceptical on the
>> reliability of
>>>>>> such a method. I can't help thinking that it will not be a 100% sure
>>>>>> method. I understand why a CAD software would produce such an output,
>>>>>> though (thank you for this new word that I didn't know
>> "boustrophedonic",
>>>>>> but it explains well the result obtained).
>>>>>> 
>>>>> 
>>>>> It's not as bad as you think. We have re-constructed the text from
>> hundreds
>>>>> of scientific papers (so probably nearly a million words) and found
>> very
>>>>> few problems. The reason we are doing this rather than using PDFBox
>> tools
>>>>> is that scientific (and especially maths) PDFs contain may diacritics,
>> high
>>>>> Unicode points, occasional graphics strokes, variable font size and
>> style,
>>>>> ligatures, non-horizontal text, etc.
>>>>> 
>>>>> For running text it works very well - assuming that the characters
>> announce
>>>>> their widths. Then - roughly - "ab" is a word if
>>>>> 
>>>>> x(a) + width(a)*fontSize(a) + tolerance >= x(b)
>>>>> 
>>>>> else we can *crudely* estimate the number of intervening spaces (this
>> is
>>>>> very suspect as publishers may elide concatenated spaces).
>>>>> 
>>>>> All standard Fonts (see PDF spec) should announce their widths.
>>>>> Unfortunately scientific publishers use some of the worst constructed
>> fonts
>>>>> in the world and sometimes we have to guess - by surveying a body of
>>>>> character positions and trying to work out spaces and font-type.
>>>>> 
>>>>> 
>>>>>> Supposing that the characters appear in a totally arbitrary order,
>>>>>> detecting that they're on the same line is more or less piece of cake
>>>>>> (except if I need to introduce a tolerance, which makes things more
>>>>>> difficult),
>>>>> 
>>>>> 
>>>>> In a modern PDF we find that all characters on the same line tend to
>> have
>>>>> equal y-coords to at least 3 decimals. The problem is that OCR'ed
>>>>> characters may have variable y because of rounding errors and
>> antialiasing.
>>>>> 
>>>>> 
>>>>> 
>>>>>> but grouping the characters according to their X position is
>>>>>> not at all an easy task.
>>>>>> 
>>>>> 
>>>>> The order should be fairly clear. The problems are:
>>>>> * spaces (see above)
>>>>> * hyphens at line-end (this requires heuristics - maybe lookup in
>> Wordnet)
>>>>> - we generally solve > 90%. Hyphens in chemistry are meaningful
>>>>> * diacritics. Some characters have diacritics with the same x (e.g. E
>> and
>>>>> acute). These can occur in variable order. Where possible we try to
>>>>> recreate a single Unicode point.
>>>>> * over and underbars
>>>>> * ligatures (in "waffle") their may be 6 characters or only 4
>> w-a-ffl-e. We
>>>>> split the latter.
>>>>> 
>>>>> 
>>>>>> 
>>>>>> But this is not an issue, my problem is more the fact that this
>> method may
>>>>>> not be 100% reliable. What do you think ?
>>>>>> 
>>>>> 
>>>>> We are committed to solving it for English-language science and
>> European
>>>>> personal names. The worst case is probably slanted text in diagrams.
>>>>> 
>>>>> 
>>>>>> 
>>>>>> As for the technical part (overloading the processText), it's ok,
>> thanks
>>>>>> for the advice.
>>>>>> 
>>>>>> Best regards
>>>>>> 
>>>>>> Julien
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>> Peter Murray-Rust
>>>>> Reader in Molecular Informatics
>>>>> Unilever Centre, Dep. Of Chemistry
>>>>> University of Cambridge
>>>>> CB2 1EW, UK
>>>>> +44-1223-763069
>>>> 
>>> 
>> 
>> 
> 
> 
> -- 
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069


Re: 2 questions

Posted by Peter Murray-Rust <pm...@cam.ac.uk>.
The width appears to be a ratio, independent of size. It also seems to be
conventionally multiplied by 1000 (I have not found a definition for this -
I have only guessed it).

Thus a character "A" of width=600 and fontSize=10.5 appears to have
pixelwidth = 600. / 1000. * 10.5 = 6.3 pixels

I'd be grateful for confirmation or correction...


On Sat, Mar 8, 2014 at 11:12 AM, HQS <hq...@gmail.com> wrote:

> Well, I have a precision to ask to Peter, about this formula :
>
> x(a) + width(a)*fontSize(a) + tolerance >= x(b)
>
> What is the difference between « width(a) » and « fontSize(a) » ? Is it
> not enough
> to know the width of the character « a » in pixels given by the font, to
> check this assertion ?
>
> Thanks !
>
>
> Le 7 mars 2014 à 18:46, Maruan Sahyoun <sa...@fileaffairs.de> a écrit :
>
> > if you need further assistance please let us know.
> >
> > BR
> > Maruan Sahyoun
> >
> > Am 07.03.2014 um 18:24 schrieb HQS <hq...@gmail.com>:
> >
> >> Thank you all for those accurate answers.
> >> I will give a try to the geometrical approach based on the (x, y)
> coordinates of the characters.
> >>
> >> Best regards,
> >>
> >> Julien
> >>
> >> Le 7 mars 2014 à 13:25, Peter Murray-Rust <pm...@cam.ac.uk> a écrit :
> >>
> >>> On Fri, Mar 7, 2014 at 11:16 AM, Confidential Confidential <
> >>> hqsoftwares@gmail.com> wrote:
> >>>
> >>>> Sirs,
> >>>>
> >>>> I had already thought about this graphical approach to reconstruct the
> >>>> words. I've let it down because I'm a bit sceptical on the
> reliability of
> >>>> such a method. I can't help thinking that it will not be a 100% sure
> >>>> method. I understand why a CAD software would produce such an output,
> >>>> though (thank you for this new word that I didn't know
> "boustrophedonic",
> >>>> but it explains well the result obtained).
> >>>>
> >>>
> >>> It's not as bad as you think. We have re-constructed the text from
> hundreds
> >>> of scientific papers (so probably nearly a million words) and found
> very
> >>> few problems. The reason we are doing this rather than using PDFBox
> tools
> >>> is that scientific (and especially maths) PDFs contain may diacritics,
> high
> >>> Unicode points, occasional graphics strokes, variable font size and
> style,
> >>> ligatures, non-horizontal text, etc.
> >>>
> >>> For running text it works very well - assuming that the characters
> announce
> >>> their widths. Then - roughly - "ab" is a word if
> >>>
> >>> x(a) + width(a)*fontSize(a) + tolerance >= x(b)
> >>>
> >>> else we can *crudely* estimate the number of intervening spaces (this
> is
> >>> very suspect as publishers may elide concatenated spaces).
> >>>
> >>> All standard Fonts (see PDF spec) should announce their widths.
> >>> Unfortunately scientific publishers use some of the worst constructed
> fonts
> >>> in the world and sometimes we have to guess - by surveying a body of
> >>> character positions and trying to work out spaces and font-type.
> >>>
> >>>
> >>>> Supposing that the characters appear in a totally arbitrary order,
> >>>> detecting that they're on the same line is more or less piece of cake
> >>>> (except if I need to introduce a tolerance, which makes things more
> >>>> difficult),
> >>>
> >>>
> >>> In a modern PDF we find that all characters on the same line tend to
> have
> >>> equal y-coords to at least 3 decimals. The problem is that OCR'ed
> >>> characters may have variable y because of rounding errors and
> antialiasing.
> >>>
> >>>
> >>>
> >>>> but grouping the characters according to their X position is
> >>>> not at all an easy task.
> >>>>
> >>>
> >>> The order should be fairly clear. The problems are:
> >>> * spaces (see above)
> >>> * hyphens at line-end (this requires heuristics - maybe lookup in
> Wordnet)
> >>> - we generally solve > 90%. Hyphens in chemistry are meaningful
> >>> * diacritics. Some characters have diacritics with the same x (e.g. E
> and
> >>> acute). These can occur in variable order. Where possible we try to
> >>> recreate a single Unicode point.
> >>> * over and underbars
> >>> * ligatures (in "waffle") their may be 6 characters or only 4
> w-a-ffl-e. We
> >>> split the latter.
> >>>
> >>>
> >>>>
> >>>> But this is not an issue, my problem is more the fact that this
> method may
> >>>> not be 100% reliable. What do you think ?
> >>>>
> >>>
> >>> We are committed to solving it for English-language science and
> European
> >>> personal names. The worst case is probably slanted text in diagrams.
> >>>
> >>>
> >>>>
> >>>> As for the technical part (overloading the processText), it's ok,
> thanks
> >>>> for the advice.
> >>>>
> >>>> Best regards
> >>>>
> >>>> Julien
> >>>>
> >>>>
> >>>>
> >>>> --
> >>> Peter Murray-Rust
> >>> Reader in Molecular Informatics
> >>> Unilever Centre, Dep. Of Chemistry
> >>> University of Cambridge
> >>> CB2 1EW, UK
> >>> +44-1223-763069
> >>
> >
>
>


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: 2 questions

Posted by HQS <hq...@gmail.com>.
Well, I have a precision to ask to Peter, about this formula :

x(a) + width(a)*fontSize(a) + tolerance >= x(b)

What is the difference between « width(a) » and « fontSize(a) » ? Is it not enough
to know the width of the character « a » in pixels given by the font, to check this assertion ?

Thanks !


Le 7 mars 2014 à 18:46, Maruan Sahyoun <sa...@fileaffairs.de> a écrit :

> if you need further assistance please let us know.
> 
> BR
> Maruan Sahyoun
> 
> Am 07.03.2014 um 18:24 schrieb HQS <hq...@gmail.com>:
> 
>> Thank you all for those accurate answers.
>> I will give a try to the geometrical approach based on the (x, y) coordinates of the characters.
>> 
>> Best regards,
>> 
>> Julien
>> 
>> Le 7 mars 2014 à 13:25, Peter Murray-Rust <pm...@cam.ac.uk> a écrit :
>> 
>>> On Fri, Mar 7, 2014 at 11:16 AM, Confidential Confidential <
>>> hqsoftwares@gmail.com> wrote:
>>> 
>>>> Sirs,
>>>> 
>>>> I had already thought about this graphical approach to reconstruct the
>>>> words. I've let it down because I'm a bit sceptical on the reliability of
>>>> such a method. I can't help thinking that it will not be a 100% sure
>>>> method. I understand why a CAD software would produce such an output,
>>>> though (thank you for this new word that I didn't know "boustrophedonic",
>>>> but it explains well the result obtained).
>>>> 
>>> 
>>> It's not as bad as you think. We have re-constructed the text from hundreds
>>> of scientific papers (so probably nearly a million words) and found very
>>> few problems. The reason we are doing this rather than using PDFBox tools
>>> is that scientific (and especially maths) PDFs contain may diacritics, high
>>> Unicode points, occasional graphics strokes, variable font size and style,
>>> ligatures, non-horizontal text, etc.
>>> 
>>> For running text it works very well - assuming that the characters announce
>>> their widths. Then - roughly - "ab" is a word if
>>> 
>>> x(a) + width(a)*fontSize(a) + tolerance >= x(b)
>>> 
>>> else we can *crudely* estimate the number of intervening spaces (this is
>>> very suspect as publishers may elide concatenated spaces).
>>> 
>>> All standard Fonts (see PDF spec) should announce their widths.
>>> Unfortunately scientific publishers use some of the worst constructed fonts
>>> in the world and sometimes we have to guess - by surveying a body of
>>> character positions and trying to work out spaces and font-type.
>>> 
>>> 
>>>> Supposing that the characters appear in a totally arbitrary order,
>>>> detecting that they're on the same line is more or less piece of cake
>>>> (except if I need to introduce a tolerance, which makes things more
>>>> difficult),
>>> 
>>> 
>>> In a modern PDF we find that all characters on the same line tend to have
>>> equal y-coords to at least 3 decimals. The problem is that OCR'ed
>>> characters may have variable y because of rounding errors and antialiasing.
>>> 
>>> 
>>> 
>>>> but grouping the characters according to their X position is
>>>> not at all an easy task.
>>>> 
>>> 
>>> The order should be fairly clear. The problems are:
>>> * spaces (see above)
>>> * hyphens at line-end (this requires heuristics - maybe lookup in Wordnet)
>>> - we generally solve > 90%. Hyphens in chemistry are meaningful
>>> * diacritics. Some characters have diacritics with the same x (e.g. E and
>>> acute). These can occur in variable order. Where possible we try to
>>> recreate a single Unicode point.
>>> * over and underbars
>>> * ligatures (in "waffle") their may be 6 characters or only 4 w-a-ffl-e. We
>>> split the latter.
>>> 
>>> 
>>>> 
>>>> But this is not an issue, my problem is more the fact that this method may
>>>> not be 100% reliable. What do you think ?
>>>> 
>>> 
>>> We are committed to solving it for English-language science and European
>>> personal names. The worst case is probably slanted text in diagrams.
>>> 
>>> 
>>>> 
>>>> As for the technical part (overloading the processText), it's ok, thanks
>>>> for the advice.
>>>> 
>>>> Best regards
>>>> 
>>>> Julien
>>>> 
>>>> 
>>>> 
>>>> --
>>> Peter Murray-Rust
>>> Reader in Molecular Informatics
>>> Unilever Centre, Dep. Of Chemistry
>>> University of Cambridge
>>> CB2 1EW, UK
>>> +44-1223-763069
>> 
> 


Re: 2 questions

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
if you need further assistance please let us know.

BR
Maruan Sahyoun

Am 07.03.2014 um 18:24 schrieb HQS <hq...@gmail.com>:

> Thank you all for those accurate answers.
> I will give a try to the geometrical approach based on the (x, y) coordinates of the characters.
> 
> Best regards,
> 
> Julien
> 
> Le 7 mars 2014 à 13:25, Peter Murray-Rust <pm...@cam.ac.uk> a écrit :
> 
>> On Fri, Mar 7, 2014 at 11:16 AM, Confidential Confidential <
>> hqsoftwares@gmail.com> wrote:
>> 
>>> Sirs,
>>> 
>>> I had already thought about this graphical approach to reconstruct the
>>> words. I've let it down because I'm a bit sceptical on the reliability of
>>> such a method. I can't help thinking that it will not be a 100% sure
>>> method. I understand why a CAD software would produce such an output,
>>> though (thank you for this new word that I didn't know "boustrophedonic",
>>> but it explains well the result obtained).
>>> 
>> 
>> It's not as bad as you think. We have re-constructed the text from hundreds
>> of scientific papers (so probably nearly a million words) and found very
>> few problems. The reason we are doing this rather than using PDFBox tools
>> is that scientific (and especially maths) PDFs contain may diacritics, high
>> Unicode points, occasional graphics strokes, variable font size and style,
>> ligatures, non-horizontal text, etc.
>> 
>> For running text it works very well - assuming that the characters announce
>> their widths. Then - roughly - "ab" is a word if
>> 
>> x(a) + width(a)*fontSize(a) + tolerance >= x(b)
>> 
>> else we can *crudely* estimate the number of intervening spaces (this is
>> very suspect as publishers may elide concatenated spaces).
>> 
>> All standard Fonts (see PDF spec) should announce their widths.
>> Unfortunately scientific publishers use some of the worst constructed fonts
>> in the world and sometimes we have to guess - by surveying a body of
>> character positions and trying to work out spaces and font-type.
>> 
>> 
>>> Supposing that the characters appear in a totally arbitrary order,
>>> detecting that they're on the same line is more or less piece of cake
>>> (except if I need to introduce a tolerance, which makes things more
>>> difficult),
>> 
>> 
>> In a modern PDF we find that all characters on the same line tend to have
>> equal y-coords to at least 3 decimals. The problem is that OCR'ed
>> characters may have variable y because of rounding errors and antialiasing.
>> 
>> 
>> 
>>> but grouping the characters according to their X position is
>>> not at all an easy task.
>>> 
>> 
>> The order should be fairly clear. The problems are:
>> * spaces (see above)
>> * hyphens at line-end (this requires heuristics - maybe lookup in Wordnet)
>> - we generally solve > 90%. Hyphens in chemistry are meaningful
>> * diacritics. Some characters have diacritics with the same x (e.g. E and
>> acute). These can occur in variable order. Where possible we try to
>> recreate a single Unicode point.
>> * over and underbars
>> * ligatures (in "waffle") their may be 6 characters or only 4 w-a-ffl-e. We
>> split the latter.
>> 
>> 
>>> 
>>> But this is not an issue, my problem is more the fact that this method may
>>> not be 100% reliable. What do you think ?
>>> 
>> 
>> We are committed to solving it for English-language science and European
>> personal names. The worst case is probably slanted text in diagrams.
>> 
>> 
>>> 
>>> As for the technical part (overloading the processText), it's ok, thanks
>>> for the advice.
>>> 
>>> Best regards
>>> 
>>> Julien
>>> 
>>> 
>>> 
>>> --
>> Peter Murray-Rust
>> Reader in Molecular Informatics
>> Unilever Centre, Dep. Of Chemistry
>> University of Cambridge
>> CB2 1EW, UK
>> +44-1223-763069
> 


Re: 2 questions

Posted by HQS <hq...@gmail.com>.
Thank you all for those accurate answers.
I will give a try to the geometrical approach based on the (x, y) coordinates of the characters.

Best regards,

Julien

Le 7 mars 2014 à 13:25, Peter Murray-Rust <pm...@cam.ac.uk> a écrit :

> On Fri, Mar 7, 2014 at 11:16 AM, Confidential Confidential <
> hqsoftwares@gmail.com> wrote:
> 
>> Sirs,
>> 
>> I had already thought about this graphical approach to reconstruct the
>> words. I've let it down because I'm a bit sceptical on the reliability of
>> such a method. I can't help thinking that it will not be a 100% sure
>> method. I understand why a CAD software would produce such an output,
>> though (thank you for this new word that I didn't know "boustrophedonic",
>> but it explains well the result obtained).
>> 
> 
> It's not as bad as you think. We have re-constructed the text from hundreds
> of scientific papers (so probably nearly a million words) and found very
> few problems. The reason we are doing this rather than using PDFBox tools
> is that scientific (and especially maths) PDFs contain may diacritics, high
> Unicode points, occasional graphics strokes, variable font size and style,
> ligatures, non-horizontal text, etc.
> 
> For running text it works very well - assuming that the characters announce
> their widths. Then - roughly - "ab" is a word if
> 
> x(a) + width(a)*fontSize(a) + tolerance >= x(b)
> 
> else we can *crudely* estimate the number of intervening spaces (this is
> very suspect as publishers may elide concatenated spaces).
> 
> All standard Fonts (see PDF spec) should announce their widths.
> Unfortunately scientific publishers use some of the worst constructed fonts
> in the world and sometimes we have to guess - by surveying a body of
> character positions and trying to work out spaces and font-type.
> 
> 
>> Supposing that the characters appear in a totally arbitrary order,
>> detecting that they're on the same line is more or less piece of cake
>> (except if I need to introduce a tolerance, which makes things more
>> difficult),
> 
> 
> In a modern PDF we find that all characters on the same line tend to have
> equal y-coords to at least 3 decimals. The problem is that OCR'ed
> characters may have variable y because of rounding errors and antialiasing.
> 
> 
> 
>> but grouping the characters according to their X position is
>> not at all an easy task.
>> 
> 
> The order should be fairly clear. The problems are:
> * spaces (see above)
> * hyphens at line-end (this requires heuristics - maybe lookup in Wordnet)
> - we generally solve > 90%. Hyphens in chemistry are meaningful
> * diacritics. Some characters have diacritics with the same x (e.g. E and
> acute). These can occur in variable order. Where possible we try to
> recreate a single Unicode point.
> * over and underbars
> * ligatures (in "waffle") their may be 6 characters or only 4 w-a-ffl-e. We
> split the latter.
> 
> 
>> 
>> But this is not an issue, my problem is more the fact that this method may
>> not be 100% reliable. What do you think ?
>> 
> 
> We are committed to solving it for English-language science and European
> personal names. The worst case is probably slanted text in diagrams.
> 
> 
>> 
>> As for the technical part (overloading the processText), it's ok, thanks
>> for the advice.
>> 
>> Best regards
>> 
>> Julien
>> 
>> 
>> 
>> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069


Re: 2 questions

Posted by Peter Murray-Rust <pm...@cam.ac.uk>.
On Fri, Mar 7, 2014 at 11:16 AM, Confidential Confidential <
hqsoftwares@gmail.com> wrote:

> Sirs,
>
> I had already thought about this graphical approach to reconstruct the
> words. I've let it down because I'm a bit sceptical on the reliability of
> such a method. I can't help thinking that it will not be a 100% sure
> method. I understand why a CAD software would produce such an output,
> though (thank you for this new word that I didn't know "boustrophedonic",
> but it explains well the result obtained).
>

It's not as bad as you think. We have re-constructed the text from hundreds
of scientific papers (so probably nearly a million words) and found very
few problems. The reason we are doing this rather than using PDFBox tools
is that scientific (and especially maths) PDFs contain may diacritics, high
Unicode points, occasional graphics strokes, variable font size and style,
ligatures, non-horizontal text, etc.

For running text it works very well - assuming that the characters announce
their widths. Then - roughly - "ab" is a word if

x(a) + width(a)*fontSize(a) + tolerance >= x(b)

else we can *crudely* estimate the number of intervening spaces (this is
very suspect as publishers may elide concatenated spaces).

All standard Fonts (see PDF spec) should announce their widths.
Unfortunately scientific publishers use some of the worst constructed fonts
in the world and sometimes we have to guess - by surveying a body of
character positions and trying to work out spaces and font-type.


> Supposing that the characters appear in a totally arbitrary order,
> detecting that they're on the same line is more or less piece of cake
> (except if I need to introduce a tolerance, which makes things more
> difficult),


In a modern PDF we find that all characters on the same line tend to have
equal y-coords to at least 3 decimals. The problem is that OCR'ed
characters may have variable y because of rounding errors and antialiasing.



> but grouping the characters according to their X position is
> not at all an easy task.
>

The order should be fairly clear. The problems are:
* spaces (see above)
* hyphens at line-end (this requires heuristics - maybe lookup in Wordnet)
- we generally solve > 90%. Hyphens in chemistry are meaningful
* diacritics. Some characters have diacritics with the same x (e.g. E and
acute). These can occur in variable order. Where possible we try to
recreate a single Unicode point.
* over and underbars
* ligatures (in "waffle") their may be 6 characters or only 4 w-a-ffl-e. We
split the latter.


>
> But this is not an issue, my problem is more the fact that this method may
> not be 100% reliable. What do you think ?
>

We are committed to solving it for English-language science and European
personal names. The worst case is probably slanted text in diagrams.


>
> As for the technical part (overloading the processText), it's ok, thanks
> for the advice.
>
> Best regards
>
> Julien
>
>
>
> --
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: 2 questions

Posted by Confidential Confidential <hq...@gmail.com>.
Sirs,

I had already thought about this graphical approach to reconstruct the
words. I've let it down because I'm a bit sceptical on the reliability of
such a method. I can't help thinking that it will not be a 100% sure
method. I understand why a CAD software would produce such an output,
though (thank you for this new word that I didn't know "boustrophedonic",
but it explains well the result obtained).

Supposing that the characters appear in a totally arbitrary order,
detecting that they're on the same line is more or less piece of cake
(except if I need to introduce a tolerance, which makes things more
difficult), but grouping the characters according to their X position is
not at all an easy task.

But this is not an issue, my problem is more the fact that this method may
not be 100% reliable. What do you think ?

As for the technical part (overloading the processText), it's ok, thanks
for the advice.

Best regards

Julien



2014-03-06 18:39 GMT+01:00 HQS <hq...@gmail.com>:

> Hello all,
>
> 1.
> Have you ever seen PDFs having this kind of (pseudo) structure :
>
> BT
> <character>
> Tj
> ET
>
> ?
>
> Which means, the strings are split into characters and there is one block
> of text per character ?
> It seems to be ill-formed doesn't it ?
>
> 2. Reminder of my first mail, what is the library compliancy regarding PDF
> standards ? 1.3 to 1.7 ?
>
>
> Thanks and regards
>
> Julien
>
>

Re: 2 questions

Posted by Peter Murray-Rust <pm...@cam.ac.uk>.
Agreed.
In our reconstruction of the scientific content of technical documents from
PDF (AMI2, http://www.bitbucket.org/petermr/ami2 ) we throw away all
character groupings from the PDF - render each to SVG with its coordinates
and other attributes (stroke, font-size, etc.) This is because there is no
consistency in how PDF tools create character grouping - they are often
split at kerning points rather than whitespace - and they may be
boustrophedonic (http://dictionary.reference.com/browse/boustrophedonic ).
The only reliable strategy is the extract the coordinates, font size and
(hopefully) the width of the character. This allows phrases to be
generated. Creating sentences and paragraphs, lists and tables is hard and
discipline-dependent (think about hyphenation).

The positive side of doing this is that when you only have pixel
information (about half the diagrams we see) then you have to reconstruct
the characters by OCR. The result of this then merges with the
character-based approach.

BTW if anyone has a good pointer to an Open Pure Java OCR tool we'd be
delighted as I'm hacking my own (there are ancillary reasons for this).
Tesseract is not Pure Java, JavaOCR has become very complex and Lookup
doesn't seem to provide fonts. Currently we are hacking this from a few
high-quality sets of glyphs (such as Wikipedia entries). Maybe we should be
using the outline glyphs?


On Fri, Mar 7, 2014 at 4:55 AM, Maruan Sahyoun <sa...@fileaffairs.de>wrote:

> Hi,
>
> you could use PDFStreamEngine and overwrite
> http://pdfbox.apache.org/docs/1.8.4/javadocs/org/apache/pdfbox/util/PDFStreamEngine.html#processTextPosition%28org.apache.pdfbox.util.TextPosition%29
>
> this gives you the position of all characters. You would then need to
> match/compare these to the string pattern you are looking for accumulating
> the positions. After that you would have the area covered by the string
> which you could use to  e.g. overlay a button and/or link element.
>
> BR
> Maruan Sahyoun
>
> Am 06.03.2014 um 21:30 schrieb Olaf Drümmer <ol...@callassoftware.com>:
>
> > You could use x and y position and rotation information to determine
> whether two given characters - given their size - are relatively close to
> each other or not and are on the same line.
> >
> > BT / ET is not at all guaranteed to give you strings as perceived by a
> human.
> >
> > Olaf
> >
> >
> > Am 6 Mar 2014 um 21:06 schrieb HQS <hq...@gmail.com>:
> >
> >> Well, thanks sirs for your reactivity.
> >>
> >> The PDFs are generated by Autodesk Inventor (even the latest version
> produces that kind of output).
> >>
> >> It is for one of my clients who wants an automatic transformation
> >> of some specific strings in the PDF into a clickable link.
> >>
> >> My problem is very simple : with such a structure I have no way to know
> when the string ends.
> >>
> >> As a matter of fact all the references to be transformed are prefixed
> >> with an 'I-' but there is no termination character, for instance : <<
> I-HOIST-042 >>.
> >> Given that in the PDF I, -, H, O, (etc.), 2 are separated characters I
> cannot rebuild the original string.
> >>
> >> I was hoping that there is a block of text (BT ... ET) but, as I
> mentioned, each character is put in its own block...
> >>
> >> Regards,
> >>
> >>
> >> Le 6 mars 2014 à 18:57, Maruan Sahyoun <sa...@fileaffairs.de> a
> écrit :
> >>
> >>> Hi Julien,
> >>>
> >>> for 1) that's possible and supported - how was the document generated?
> DTP application?
> >>> for 2) PDFBox doesn't enforce a PDF version. In general it supports
> all PDF files but it doesn't have full coverage of all features defined
> within certain PDF versions but it should have a reasonable coverage. There
> is no documentation on coverage yet so I can't guarantee that a specific
> feature is supported. Is there something special you are looking for?
> >>>
> >>> BR
> >>> Maruan Sahyoun
> >>>
> >>> Am 06.03.2014 um 18:39 schrieb HQS <hq...@gmail.com>:
> >>>
> >>>> Hello all,
> >>>>
> >>>> 1.
> >>>> Have you ever seen PDFs having this kind of (pseudo) structure :
> >>>>
> >>>> BT
> >>>> <character>
> >>>> Tj
> >>>> ET
> >>>>
> >>>> ?
> >>>>
> >>>> Which means, the strings are split into characters and there is one
> block of text per character ?
> >>>> It seems to be ill-formed doesn't it ?
> >>>>
> >>>> 2. Reminder of my first mail, what is the library compliancy
> regarding PDF standards ? 1.3 to 1.7 ?
> >>>>
> >>>>
> >>>> Thanks and regards
> >>>>
> >>>> Julien
> >>>>
> >>>
> >>
> >
>
>


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: 2 questions

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi,

you could use PDFStreamEngine and overwrite http://pdfbox.apache.org/docs/1.8.4/javadocs/org/apache/pdfbox/util/PDFStreamEngine.html#processTextPosition%28org.apache.pdfbox.util.TextPosition%29

this gives you the position of all characters. You would then need to match/compare these to the string pattern you are looking for accumulating the positions. After that you would have the area covered by the string which you could use to  e.g. overlay a button and/or link element.

BR
Maruan Sahyoun

Am 06.03.2014 um 21:30 schrieb Olaf Drümmer <ol...@callassoftware.com>:

> You could use x and y position and rotation information to determine whether two given characters - given their size - are relatively close to each other or not and are on the same line. 
> 
> BT / ET is not at all guaranteed to give you strings as perceived by a human.
> 
> Olaf
> 
> 
> Am 6 Mar 2014 um 21:06 schrieb HQS <hq...@gmail.com>:
> 
>> Well, thanks sirs for your reactivity.
>> 
>> The PDFs are generated by Autodesk Inventor (even the latest version produces that kind of output).
>> 
>> It is for one of my clients who wants an automatic transformation
>> of some specific strings in the PDF into a clickable link.
>> 
>> My problem is very simple : with such a structure I have no way to know when the string ends.
>> 
>> As a matter of fact all the references to be transformed are prefixed
>> with an ‘I-‘ but there is no termination character, for instance : « I-HOIST-042 ».
>> Given that in the PDF I, -, H, O, (etc.), 2 are separated characters I cannot rebuild the original string.
>> 
>> I was hoping that there is a block of text (BT … ET) but, as I mentioned, each character is put in its own block...
>> 
>> Regards,
>> 
>> 
>> Le 6 mars 2014 à 18:57, Maruan Sahyoun <sa...@fileaffairs.de> a écrit :
>> 
>>> Hi Julien,
>>> 
>>> for 1) that’s possible and supported - how was the document generated? DTP application?
>>> for 2) PDFBox doesn’t enforce a PDF version. In general it supports all PDF files but it doesn’t have full coverage of all features defined within certain PDF versions but it should have a reasonable coverage. There is no documentation on coverage yet so I can’t guarantee that a specific feature is supported. Is there something special you are looking for?
>>> 
>>> BR
>>> Maruan Sahyoun
>>> 
>>> Am 06.03.2014 um 18:39 schrieb HQS <hq...@gmail.com>:
>>> 
>>>> Hello all,
>>>> 
>>>> 1.
>>>> Have you ever seen PDFs having this kind of (pseudo) structure :
>>>> 
>>>> BT
>>>> <character>
>>>> Tj
>>>> ET
>>>> 
>>>> ?
>>>> 
>>>> Which means, the strings are split into characters and there is one block of text per character ?
>>>> It seems to be ill-formed doesn't it ?
>>>> 
>>>> 2. Reminder of my first mail, what is the library compliancy regarding PDF standards ? 1.3 to 1.7 ?
>>>> 
>>>> 
>>>> Thanks and regards
>>>> 
>>>> Julien
>>>> 
>>> 
>> 
> 


Re: 2 questions

Posted by Olaf Drümmer <ol...@callassoftware.com>.
You could use x and y position and rotation information to determine whether two given characters - given their size - are relatively close to each other or not and are on the same line. 

BT / ET is not at all guaranteed to give you strings as perceived by a human.

Olaf


Am 6 Mar 2014 um 21:06 schrieb HQS <hq...@gmail.com>:

> Well, thanks sirs for your reactivity.
> 
> The PDFs are generated by Autodesk Inventor (even the latest version produces that kind of output).
> 
> It is for one of my clients who wants an automatic transformation
> of some specific strings in the PDF into a clickable link.
> 
> My problem is very simple : with such a structure I have no way to know when the string ends.
> 
> As a matter of fact all the references to be transformed are prefixed
> with an ‘I-‘ but there is no termination character, for instance : « I-HOIST-042 ».
> Given that in the PDF I, -, H, O, (etc.), 2 are separated characters I cannot rebuild the original string.
> 
> I was hoping that there is a block of text (BT … ET) but, as I mentioned, each character is put in its own block...
> 
> Regards,
> 
> 
> Le 6 mars 2014 à 18:57, Maruan Sahyoun <sa...@fileaffairs.de> a écrit :
> 
>> Hi Julien,
>> 
>> for 1) that’s possible and supported - how was the document generated? DTP application?
>> for 2) PDFBox doesn’t enforce a PDF version. In general it supports all PDF files but it doesn’t have full coverage of all features defined within certain PDF versions but it should have a reasonable coverage. There is no documentation on coverage yet so I can’t guarantee that a specific feature is supported. Is there something special you are looking for?
>> 
>> BR
>> Maruan Sahyoun
>> 
>> Am 06.03.2014 um 18:39 schrieb HQS <hq...@gmail.com>:
>> 
>>> Hello all,
>>> 
>>> 1.
>>> Have you ever seen PDFs having this kind of (pseudo) structure :
>>> 
>>> BT
>>> <character>
>>> Tj
>>> ET
>>> 
>>> ?
>>> 
>>> Which means, the strings are split into characters and there is one block of text per character ?
>>> It seems to be ill-formed doesn't it ?
>>> 
>>> 2. Reminder of my first mail, what is the library compliancy regarding PDF standards ? 1.3 to 1.7 ?
>>> 
>>> 
>>> Thanks and regards
>>> 
>>> Julien
>>> 
>> 
> 


Re: 2 questions

Posted by HQS <hq...@gmail.com>.
Well, thanks sirs for your reactivity.

The PDFs are generated by Autodesk Inventor (even the latest version produces that kind of output).

It is for one of my clients who wants an automatic transformation
of some specific strings in the PDF into a clickable link.

My problem is very simple : with such a structure I have no way to know when the string ends.

As a matter of fact all the references to be transformed are prefixed
with an ‘I-‘ but there is no termination character, for instance : « I-HOIST-042 ».
Given that in the PDF I, -, H, O, (etc.), 2 are separated characters I cannot rebuild the original string.

I was hoping that there is a block of text (BT … ET) but, as I mentioned, each character is put in its own block...

Regards,


Le 6 mars 2014 à 18:57, Maruan Sahyoun <sa...@fileaffairs.de> a écrit :

> Hi Julien,
> 
> for 1) that’s possible and supported - how was the document generated? DTP application?
> for 2) PDFBox doesn’t enforce a PDF version. In general it supports all PDF files but it doesn’t have full coverage of all features defined within certain PDF versions but it should have a reasonable coverage. There is no documentation on coverage yet so I can’t guarantee that a specific feature is supported. Is there something special you are looking for?
> 
> BR
> Maruan Sahyoun
> 
> Am 06.03.2014 um 18:39 schrieb HQS <hq...@gmail.com>:
> 
>> Hello all,
>> 
>> 1.
>> Have you ever seen PDFs having this kind of (pseudo) structure :
>> 
>> BT
>> <character>
>> Tj
>> ET
>> 
>> ?
>> 
>> Which means, the strings are split into characters and there is one block of text per character ?
>> It seems to be ill-formed doesn't it ?
>> 
>> 2. Reminder of my first mail, what is the library compliancy regarding PDF standards ? 1.3 to 1.7 ?
>> 
>> 
>> Thanks and regards
>> 
>> Julien
>> 
> 


Re: 2 questions

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi Julien,

for 1) that’s possible and supported - how was the document generated? DTP application?
for 2) PDFBox doesn’t enforce a PDF version. In general it supports all PDF files but it doesn’t have full coverage of all features defined within certain PDF versions but it should have a reasonable coverage. There is no documentation on coverage yet so I can’t guarantee that a specific feature is supported. Is there something special you are looking for?

BR
Maruan Sahyoun

Am 06.03.2014 um 18:39 schrieb HQS <hq...@gmail.com>:

> Hello all,
> 
> 1.
> Have you ever seen PDFs having this kind of (pseudo) structure :
> 
> BT
> <character>
> Tj
> ET
> 
> ?
> 
> Which means, the strings are split into characters and there is one block of text per character ?
> It seems to be ill-formed doesn't it ?
> 
> 2. Reminder of my first mail, what is the library compliancy regarding PDF standards ? 1.3 to 1.7 ?
> 
> 
> Thanks and regards
> 
> Julien
>