You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by "Hesham G." <he...@gmail.com> on 2015/04/21 21:33:00 UTC

Reading text using TextPosition

Hello ,

When reading PDF text using TextPosition, is there a way to know if the current character is a new line character ?

protected void processTextPosition( TextPosition text )  {
    System.out.println( text.getCharacter() );  // Prints space if this is a new line character in the PDF file.
}


Best regards ,
Hesham

Re: Reading text using TextPosition

Posted by John Hewson <jo...@jahewson.com>.

> On 21 Apr 2015, at 13:21, Hesham G. <he...@gmail.com> wrote:
> 
> Frank ,
> 
> Thanks for explaining this. 
> 
> What I am trying to do is reading sentences from the PDF using TextPosition. Your explanation is clear and I can detect the new line using X & Y, but what if a sentence is written on 2 lines ? ... Reading the Y-coordinate for the second line will result with dealing with it as a new sentence instead of considering it a completion for the first line of the sentence.

Could you just take output of PDFToText as a text file and then run it through an NLP sentence segmenter? Or is there some special case which you're trying to handle?

> Best regards ,
> Hesham
> 
> ------------------------------------------------------------------------
> Included message :
> 
> Hi Hesham,
> 
> There is no newline character in a PDF. Only printable characters are
> saved, each with its X and Y coordinates.
> If you sort the TextPositions by Y and X, you can detect 'newlines' by
> finding an increase in Y and a decrease in X. However, this isn't
> foolproof, since things like subscripts and superscripts are out of order
> when sorted by Y. Where there are multiple columns, this won't work.
> 
> Frank
> 
> 
>> On Wed, Apr 22, 2015 at 7:33 AM, Hesham G. <he...@gmail.com> wrote:
>> 
>> Hello ,
>> 
>> When reading PDF text using TextPosition, is there a way to know if the
>> current character is a new line character ?
>> 
>> protected void processTextPosition( TextPosition text )  {
>>    System.out.println( text.getCharacter() );  // Prints space if this is
>> a new line character in the PDF file.
>> }
>> 
>> 
>> Best regards ,
>> Hesham

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Reading text using TextPosition

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 21.04.2015 um 23:00 schrieb Hesham Gneady:
> A sentence could also end with a question mark, exclamation mark, ... Etc.
> I think there will be many cases to handle.
>
> I also wonder .. When reading text from the book using PDFTextStripper it
> can read the new line characters, right ? TextPosition seems to be reading
> the pdf text in a different way.

PDFTextStripper constructs these newline characters from the y positions 
of text glyph output, not from existing characters.

PDF isn't something like HTML. It is a complex format for graphic 
output. (I wish there was an english translation for "Eierlegende 
Wollmilchsau")

Tilman

> On Apr 21, 2015 10:40 PM, "Eric Douglas" <ed...@blockhouse.com> wrote:
>
>> A proper sentence ends with a period, so text that is one character height
>> below other text is assumed to be tacked onto the same sentence (with a
>> space between).
>> If you have the font, you know the font size, you should be able to
>> calculate one character height.
>> If sentences aren't ended with periods, text may be assumed to be a new
>> sentence on a new line if it's more than a character height down.
>>
>> ie
>> A sentence here
>>
>>
>> Another sentence here
>>
>> On Tue, Apr 21, 2015 at 4:21 PM, Hesham G. <he...@gmail.com> wrote:
>>
>>> Frank ,
>>>
>>> Thanks for explaining this.
>>>
>>> What I am trying to do is reading sentences from the PDF using
>>> TextPosition. Your explanation is clear and I can detect the new line
>> using
>>> X & Y, but what if a sentence is written on 2 lines ? ... Reading the
>>> Y-coordinate for the second line will result with dealing with it as a
>> new
>>> sentence instead of considering it a completion for the first line of the
>>> sentence.
>>>
>>>
>>> Best regards ,
>>> Hesham
>>>
>>> ------------------------------------------------------------------------
>>> Included message :
>>>
>>> Hi Hesham,
>>>
>>> There is no newline character in a PDF. Only printable characters are
>>> saved, each with its X and Y coordinates.
>>> If you sort the TextPositions by Y and X, you can detect 'newlines' by
>>> finding an increase in Y and a decrease in X. However, this isn't
>>> foolproof, since things like subscripts and superscripts are out of order
>>> when sorted by Y. Where there are multiple columns, this won't work.
>>>
>>> Frank
>>>
>>>
>>> On Wed, Apr 22, 2015 at 7:33 AM, Hesham G. <he...@gmail.com>
>> wrote:
>>>> Hello ,
>>>>
>>>> When reading PDF text using TextPosition, is there a way to know if the
>>>> current character is a new line character ?
>>>>
>>>> protected void processTextPosition( TextPosition text )  {
>>>>      System.out.println( text.getCharacter() );  // Prints space if this
>>> is
>>>> a new line character in the PDF file.
>>>> }
>>>>
>>>>
>>>> Best regards ,
>>>> Hesham


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Reading text using TextPosition

Posted by Hesham Gneady <he...@gmail.com>.
A sentence could also end with a question mark, exclamation mark, ... Etc.
I think there will be many cases to handle.

I also wonder .. When reading text from the book using PDFTextStripper it
can read the new line characters, right ? TextPosition seems to be reading
the pdf text in a different way.
On Apr 21, 2015 10:40 PM, "Eric Douglas" <ed...@blockhouse.com> wrote:

> A proper sentence ends with a period, so text that is one character height
> below other text is assumed to be tacked onto the same sentence (with a
> space between).
> If you have the font, you know the font size, you should be able to
> calculate one character height.
> If sentences aren't ended with periods, text may be assumed to be a new
> sentence on a new line if it's more than a character height down.
>
> ie
> A sentence here
>
>
> Another sentence here
>
> On Tue, Apr 21, 2015 at 4:21 PM, Hesham G. <he...@gmail.com> wrote:
>
> > Frank ,
> >
> > Thanks for explaining this.
> >
> > What I am trying to do is reading sentences from the PDF using
> > TextPosition. Your explanation is clear and I can detect the new line
> using
> > X & Y, but what if a sentence is written on 2 lines ? ... Reading the
> > Y-coordinate for the second line will result with dealing with it as a
> new
> > sentence instead of considering it a completion for the first line of the
> > sentence.
> >
> >
> > Best regards ,
> > Hesham
> >
> > ------------------------------------------------------------------------
> > Included message :
> >
> > Hi Hesham,
> >
> > There is no newline character in a PDF. Only printable characters are
> > saved, each with its X and Y coordinates.
> > If you sort the TextPositions by Y and X, you can detect 'newlines' by
> > finding an increase in Y and a decrease in X. However, this isn't
> > foolproof, since things like subscripts and superscripts are out of order
> > when sorted by Y. Where there are multiple columns, this won't work.
> >
> > Frank
> >
> >
> > On Wed, Apr 22, 2015 at 7:33 AM, Hesham G. <he...@gmail.com>
> wrote:
> >
> > > Hello ,
> > >
> > > When reading PDF text using TextPosition, is there a way to know if the
> > > current character is a new line character ?
> > >
> > > protected void processTextPosition( TextPosition text )  {
> > >     System.out.println( text.getCharacter() );  // Prints space if this
> > is
> > > a new line character in the PDF file.
> > > }
> > >
> > >
> > > Best regards ,
> > > Hesham
> >
>

Re: Reading text using TextPosition

Posted by Eric Douglas <ed...@blockhouse.com>.
A proper sentence ends with a period, so text that is one character height
below other text is assumed to be tacked onto the same sentence (with a
space between).
If you have the font, you know the font size, you should be able to
calculate one character height.
If sentences aren't ended with periods, text may be assumed to be a new
sentence on a new line if it's more than a character height down.

ie
A sentence here


Another sentence here

On Tue, Apr 21, 2015 at 4:21 PM, Hesham G. <he...@gmail.com> wrote:

> Frank ,
>
> Thanks for explaining this.
>
> What I am trying to do is reading sentences from the PDF using
> TextPosition. Your explanation is clear and I can detect the new line using
> X & Y, but what if a sentence is written on 2 lines ? ... Reading the
> Y-coordinate for the second line will result with dealing with it as a new
> sentence instead of considering it a completion for the first line of the
> sentence.
>
>
> Best regards ,
> Hesham
>
> ------------------------------------------------------------------------
> Included message :
>
> Hi Hesham,
>
> There is no newline character in a PDF. Only printable characters are
> saved, each with its X and Y coordinates.
> If you sort the TextPositions by Y and X, you can detect 'newlines' by
> finding an increase in Y and a decrease in X. However, this isn't
> foolproof, since things like subscripts and superscripts are out of order
> when sorted by Y. Where there are multiple columns, this won't work.
>
> Frank
>
>
> On Wed, Apr 22, 2015 at 7:33 AM, Hesham G. <he...@gmail.com> wrote:
>
> > Hello ,
> >
> > When reading PDF text using TextPosition, is there a way to know if the
> > current character is a new line character ?
> >
> > protected void processTextPosition( TextPosition text )  {
> >     System.out.println( text.getCharacter() );  // Prints space if this
> is
> > a new line character in the PDF file.
> > }
> >
> >
> > Best regards ,
> > Hesham
>

Re: Reading text using TextPosition

Posted by "Hesham G." <he...@gmail.com>.
Frank ,

Thanks for explaining this. 

What I am trying to do is reading sentences from the PDF using TextPosition. Your explanation is clear and I can detect the new line using X & Y, but what if a sentence is written on 2 lines ? ... Reading the Y-coordinate for the second line will result with dealing with it as a new sentence instead of considering it a completion for the first line of the sentence.


Best regards ,
Hesham

------------------------------------------------------------------------
Included message :

Hi Hesham,

There is no newline character in a PDF. Only printable characters are
saved, each with its X and Y coordinates.
If you sort the TextPositions by Y and X, you can detect 'newlines' by
finding an increase in Y and a decrease in X. However, this isn't
foolproof, since things like subscripts and superscripts are out of order
when sorted by Y. Where there are multiple columns, this won't work.

Frank


On Wed, Apr 22, 2015 at 7:33 AM, Hesham G. <he...@gmail.com> wrote:

> Hello ,
>
> When reading PDF text using TextPosition, is there a way to know if the
> current character is a new line character ?
>
> protected void processTextPosition( TextPosition text )  {
>     System.out.println( text.getCharacter() );  // Prints space if this is
> a new line character in the PDF file.
> }
>
>
> Best regards ,
> Hesham

Re: Reading text using TextPosition

Posted by "Hesham G." <he...@gmail.com>.
The NLP sentence segmenter was really a helpful idea.
Thanks a lot John & Frank.


Best regards ,
Hesham

------------------------------------------------------------------------
Included message :

What have you got so far?  Can you provide sample code to work with?

On Wed, Apr 22, 2015 at 12:02 PM, Hesham G. <he...@gmail.com> wrote:

> Frank ,
>
> I have handled TextPositions using X & Y coordinates as you have suggested
> to detect new lines. It works fine, but if a sentence is written on 2 
> lines
> I can't detect it. If you know a trick to detect that it will help a lot.
>
> Best regards ,
> Hesham
>
> ------------------------------------------------------------------------
>
> Hi Hesham,
>
> There is no newline character in a PDF. Only printable characters are
> saved, each with its X and Y coordinates.
> If you sort the TextPositions by Y and X, you can detect 'newlines' by
> finding an increase in Y and a decrease in X. However, this isn't
> foolproof, since things like subscripts and superscripts are out of order
> when sorted by Y. Where there are multiple columns, this won't work.
>
> Frank
>
>
> On Wed, Apr 22, 2015 at 7:33 AM, Hesham G. <he...@gmail.com> wrote:
>
>  Hello ,
>>
>> When reading PDF text using TextPosition, is there a way to know if the
>> current character is a new line character ?
>>
>> protected void processTextPosition( TextPosition text )  {
>>     System.out.println( text.getCharacter() );  // Prints space if this 
>> is
>> a new line character in the PDF file.
>> }
>>
>>
>> Best regards ,
>> Hesham
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Reading text using TextPosition

Posted by Eric Douglas <ed...@blockhouse.com>.
What have you got so far?  Can you provide sample code to work with?

On Wed, Apr 22, 2015 at 12:02 PM, Hesham G. <he...@gmail.com> wrote:

> Frank ,
>
> I have handled TextPositions using X & Y coordinates as you have suggested
> to detect new lines. It works fine, but if a sentence is written on 2 lines
> I can't detect it. If you know a trick to detect that it will help a lot.
>
> Best regards ,
> Hesham
>
> ------------------------------------------------------------------------
>
> Hi Hesham,
>
> There is no newline character in a PDF. Only printable characters are
> saved, each with its X and Y coordinates.
> If you sort the TextPositions by Y and X, you can detect 'newlines' by
> finding an increase in Y and a decrease in X. However, this isn't
> foolproof, since things like subscripts and superscripts are out of order
> when sorted by Y. Where there are multiple columns, this won't work.
>
> Frank
>
>
> On Wed, Apr 22, 2015 at 7:33 AM, Hesham G. <he...@gmail.com> wrote:
>
>  Hello ,
>>
>> When reading PDF text using TextPosition, is there a way to know if the
>> current character is a new line character ?
>>
>> protected void processTextPosition( TextPosition text )  {
>>     System.out.println( text.getCharacter() );  // Prints space if this is
>> a new line character in the PDF file.
>> }
>>
>>
>> Best regards ,
>> Hesham
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Reading text using TextPosition

Posted by "Hesham G." <he...@gmail.com>.
Frank ,

I have handled TextPositions using X & Y coordinates as you have suggested 
to detect new lines. It works fine, but if a sentence is written on 2 lines 
I can't detect it. If you know a trick to detect that it will help a lot.

Best regards ,
Hesham

------------------------------------------------------------------------

Hi Hesham,

There is no newline character in a PDF. Only printable characters are
saved, each with its X and Y coordinates.
If you sort the TextPositions by Y and X, you can detect 'newlines' by
finding an increase in Y and a decrease in X. However, this isn't
foolproof, since things like subscripts and superscripts are out of order
when sorted by Y. Where there are multiple columns, this won't work.

Frank


On Wed, Apr 22, 2015 at 7:33 AM, Hesham G. <he...@gmail.com> wrote:

> Hello ,
>
> When reading PDF text using TextPosition, is there a way to know if the
> current character is a new line character ?
>
> protected void processTextPosition( TextPosition text )  {
>     System.out.println( text.getCharacter() );  // Prints space if this is
> a new line character in the PDF file.
> }
>
>
> Best regards ,
> Hesham 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Reading text using TextPosition

Posted by Frank van der Hulst <dr...@gmail.com>.
Hi Hesham,

There is no newline character in a PDF. Only printable characters are
saved, each with its X and Y coordinates.
If you sort the TextPositions by Y and X, you can detect 'newlines' by
finding an increase in Y and a decrease in X. However, this isn't
foolproof, since things like subscripts and superscripts are out of order
when sorted by Y. Where there are multiple columns, this won't work.

Frank


On Wed, Apr 22, 2015 at 7:33 AM, Hesham G. <he...@gmail.com> wrote:

> Hello ,
>
> When reading PDF text using TextPosition, is there a way to know if the
> current character is a new line character ?
>
> protected void processTextPosition( TextPosition text )  {
>     System.out.println( text.getCharacter() );  // Prints space if this is
> a new line character in the PDF file.
> }
>
>
> Best regards ,
> Hesham