You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by "Hesham G." <he...@gmail.com> on 2016/03/17 07:12:05 UTC

Spaces are ignored when reading a PDF file

Hello ,

I have a PDF file created using Latex. I am trying to read and print all letters in that file using PDFBox, but when doing this all spaces in that file are ignored. Here is the code I am using:
PDPage page = (PDPage)allPages.get( 0 );
PDStream contents = page.getContents();
if ( contents != null ) {
    PDFTextStripperProcessor pdfTextStripperProcessor = new PDFTextStripperProcessor();
    pdfTextStripperProcessor.processStream( page, page.findResources(), contents.getStream() );
}

public class PDFTextStripperProcessor extends PDFTextStripper {
    @Override
    public void processTextPosition( TextPosition text )  {
        System.out.println( text.getCharacter() );
    }
}

And you can check a one page file sample here to test it:
https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf

What is the cause of this issue please?


Best regards ,
Hesham

Re: Spaces are ignored when reading a PDF file

Posted by "Hesham G." <he...@gmail.com>.

Clovis,

Thanks a lot :)

I will have to follow this solution if there is no alternative. The problem 
is that if I am extracting text of 500 or 600 pages PDF, that will consume 
much additional memory and time. In addition I guess it's only a special 
case for latex books only.

Best regards ,
Hesham

------------------------------------------------------------------------
Included message :


just an idea from whom is not fluent in pdfbox nor PDF.
if you just want to know there is a space in between the letters and not
the amount of spaces, you can use your code to get character details and
then use extractText to get the words.

2016-03-17 7:20 GMT-03:00 Hesham G. <he...@gmail.com>:

> Andreas,
>
> That is very helpful.
>
> I can get the x location of each character using TextPosition.getX(), ex:
> W: 102.88399
> i: 114.18165
> t: 117.660614
> h: 121.55801
> d: 133.09477
> u: 140.3994
> e: 147.60838
>
> So to detect the space between the 2 words "With" & "due" should I make
> subtraction calculations between X of the last letter(h) and the X of the
> first letter (d) and if the number is large than normal then this is a
> space? I think this way might be risky in the detection, or what?
>
>
> Best regards ,
> Hesham
>
> ------------------------------------------------------------------------
> Included message :
>
> Hi,
>
> Frank van der Hulst <dr...@gmail.com> hat am 17. März 2016 um
>> 08:34
>> geschrieben:
>>
>>
>> Spaces don't exist as characters in PDFs. To identify spaces, you have to
>> compare the X coordinates of adjacent characters against their widths.
>>
> That's not correct, spaces exist but in most cases pdf engines omit them
> and
> replace spaces by a splitted text with an appropriate positioning.
>
> BTW, latex uses the same strategy. Here is a excerpt from your pdf:
>
>   [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383
> (Article)
> -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has)
> -384
> (the) -383 (right) ] TJ
>
> The text is in between the braces and the numbers are used for horizontal
> positioning.
>
> BR
> Andreas
>
>
>> On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. <he...@gmail.com>
>> wrote:
>>
>> > Hello ,
>> >
>> > I have a PDF file created using Latex. I am trying to read and print 
>> > all
>> > letters in that file using PDFBox, but when doing this all spaces in >
>> that
>> > file are ignored. Here is the code I am using:
>> > PDPage page = (PDPage)allPages.get( 0 );
>> > PDStream contents = page.getContents();
>> > if ( contents != null ) {
>> >     PDFTextStripperProcessor pdfTextStripperProcessor = new
>> > PDFTextStripperProcessor();
>> >     pdfTextStripperProcessor.processStream( page, page.findResources(),
>> > contents.getStream() );
>> > }
>> >
>> > public class PDFTextStripperProcessor extends PDFTextStripper {
>> >     @Override
>> >     public void processTextPosition( TextPosition text )  {
>> >         System.out.println( text.getCharacter() );
>> >     }
>> > }
>> >
>> > And you can check a one page file sample here to test it:
>> >
>> >
>> https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf
>> >
>> > What is the cause of this issue please?
>> >
>> >
>> > Best regards ,
>> > Hesham
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Spaces are ignored when reading a PDF file

Posted by clovis <cl...@gmail.com>.

just an idea from whom is not fluent in pdfbox nor PDF.
if you just want to know there is a space in between the letters and not
the amount of spaces, you can use your code to get character details and
then use extractText to get the words.

2016-03-17 7:20 GMT-03:00 Hesham G. <he...@gmail.com>:

> Andreas,
>
> That is very helpful.
>
> I can get the x location of each character using TextPosition.getX(), ex:
> W: 102.88399
> i: 114.18165
> t: 117.660614
> h: 121.55801
> d: 133.09477
> u: 140.3994
> e: 147.60838
>
> So to detect the space between the 2 words "With" & "due" should I make
> subtraction calculations between X of the last letter(h) and the X of the
> first letter (d) and if the number is large than normal then this is a
> space? I think this way might be risky in the detection, or what?
>
>
> Best regards ,
> Hesham
>
> ------------------------------------------------------------------------
> Included message :
>
> Hi,
>
> Frank van der Hulst <dr...@gmail.com> hat am 17. März 2016 um
>> 08:34
>> geschrieben:
>>
>>
>> Spaces don't exist as characters in PDFs. To identify spaces, you have to
>> compare the X coordinates of adjacent characters against their widths.
>>
> That's not correct, spaces exist but in most cases pdf engines omit them
> and
> replace spaces by a splitted text with an appropriate positioning.
>
> BTW, latex uses the same strategy. Here is a excerpt from your pdf:
>
>   [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383
> (Article)
> -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has)
> -384
> (the) -383 (right) ] TJ
>
> The text is in between the braces and the numbers are used for horizontal
> positioning.
>
> BR
> Andreas
>
>
>> On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. <he...@gmail.com>
>> wrote:
>>
>> > Hello ,
>> >
>> > I have a PDF file created using Latex. I am trying to read and print all
>> > letters in that file using PDFBox, but when doing this all spaces in >
>> that
>> > file are ignored. Here is the code I am using:
>> > PDPage page = (PDPage)allPages.get( 0 );
>> > PDStream contents = page.getContents();
>> > if ( contents != null ) {
>> >     PDFTextStripperProcessor pdfTextStripperProcessor = new
>> > PDFTextStripperProcessor();
>> >     pdfTextStripperProcessor.processStream( page, page.findResources(),
>> > contents.getStream() );
>> > }
>> >
>> > public class PDFTextStripperProcessor extends PDFTextStripper {
>> >     @Override
>> >     public void processTextPosition( TextPosition text )  {
>> >         System.out.println( text.getCharacter() );
>> >     }
>> > }
>> >
>> > And you can check a one page file sample here to test it:
>> >
>> >
>> https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf
>> >
>> > What is the cause of this issue please?
>> >
>> >
>> > Best regards ,
>> > Hesham
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Spaces are ignored when reading a PDF file

Posted by "Hesham G." <he...@gmail.com>.

John ,

I have checked the PrintTextLocations.java example. I have tested using this code for the "With due" term in my book sample, using this code:
System.out.println( "String[" + text.getCharacter() + ": " + text.getXDirAdj() + "," +
                                text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale=" +
                                text.getXScale() + " height=" + text.getHeightDir() + " space=" +
                                text.getWidthOfSpace() + " width=" + text.getWidthDirAdj() + "]" );
And here are the results:
String[W: 102.88399,169.591 fs=11.9552 xscale=11.9552 height=7.328538 space=2.9888 width=11.9552]
String[i: 114.18165,169.591 fs=11.9552 xscale=11.9552 height=7.328538 space=2.9888 width=3.4789658]
String[t: 117.660614,169.591 fs=11.9552 xscale=11.9552 height=7.328538 space=2.9888 width=3.8973923]
String[h: 121.55801,169.591 fs=11.9552 xscale=11.9552 height=7.328538 space=2.9888 width=6.957924]
String[d: 133.09477,169.591 fs=11.9552 xscale=11.9552 height=7.328538 space=2.9888 width=7.3046265]
String[u: 140.3994,169.591 fs=11.9552 xscale=11.9552 height=7.328538 space=2.9888 width=7.2089844]
String[e: 147.60838,169.591 fs=11.9552 xscale=11.9552 height=7.328538 space=2.9888 width=5.7265472]

So which method do you mean? .. The getXDirAdj() ?


Best regards ,
Hesham

------------------------------------------------------------------------
Included message :

I’m rather confused by this thread, inferring spaces is one of the the main features of PDFTextStripper. I’m not sure why anyone is suggesting to process the text manually - there’s no need to do that. We do that already!

Looking at the original code the problem is right here:

> public class PDFTextStripperProcessor extends PDFTextStripper {
>    @Override
>    public void processTextPosition( TextPosition text )  {
>        System.out.println( text.getCharacter() );
>    }
> }

The processTextPosition method is used to pass an unprocessed TextPosition *in* to PDFTextStripper, but this override prevents that from happening, and is just printing the unprocessed token before PDFTextStripper has had a chance to do its job, such as inferring the missing spaces.

You should follow our PrintTextLocations.java example which shows you how to get the processed TextPositions from PDFTextStripper. It’s really easy to do.

— John

> On 17 Mar 2016, at 04:44, Hesham G. <he...@gmail.com> wrote:
> 
> Andreas,
> 
> You're absolutely right. I am testing it now, but it seems very complicated. I hope there might be another easier solution.
> 
> 
> Best regards ,
> Hesham
> 
> ------------------------------------------------------------------------
> Included message :
> 
>> "Hesham G." <he...@gmail.com> hat am 17. März 2016 um 11:20
>> geschrieben:
>> 
>> 
>> Andreas,
>> 
>> That is very helpful.
>> 
>> I can get the x location of each character using TextPosition.getX(), ex:
>> W: 102.88399
>> i: 114.18165
>> t: 117.660614
>> h: 121.55801
>> d: 133.09477
>> u: 140.3994
>> e: 147.60838
>> 
>> So to detect the space between the 2 words "With" & "due" should I make
>> subtraction calculations between X of the last letter(h) and the X of the
>> first letter (d) and if the number is large than normal then this is a
>> space? I think this way might be risky in the detection, or what?
> That's the short story. To decide what is normal could be quite tricky. You have
> to take the following facts into account:
> 
> - different fonts have different widths (important if the font before the space
> isn't the same than the font after the space)
> - keep in mind that you have to take a scaling and sometimes a rotation into
> account
> - the "space" between characters may vary if the text is jusitified
> 
> There are certainly some other details which may be important as well, so that
> you end up with some more or less heuristic.
> 
> BR
> Andreas
> 
>> Best regards ,
>> Hesham
>> 
>> ------------------------------------------------------------------------
>> Included message :
>> 
>> Hi,
>> 
>> > Frank van der Hulst <dr...@gmail.com> hat am 17. März 2016 um
>> > 08:34
>> > geschrieben:
>> >
>> >
>> > Spaces don't exist as characters in PDFs. To identify spaces, you have > to
>> > compare the X coordinates of adjacent characters against their widths.
>> That's not correct, spaces exist but in most cases pdf engines omit them and
>> replace spaces by a splitted text with an appropriate positioning.
>> 
>> BTW, latex uses the same strategy. Here is a excerpt from your pdf:
>> 
>>   [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383
>> (Article)
>> -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) -384
>> (the) -383 (right) ] TJ
>> 
>> The text is in between the braces and the numbers are used for horizontal
>> positioning.
>> 
>> BR
>> Andreas
>> 
>> >
>> > On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. <he...@gmail.com> > wrote:
>> >
>> > > Hello ,
>> > >
>> > > I have a PDF file created using Latex. I am trying to read and print > > all
>> > > letters in that file using PDFBox, but when doing this all spaces in
>> > > that
>> > > file are ignored. Here is the code I am using:
>> > > PDPage page = (PDPage)allPages.get( 0 );
>> > > PDStream contents = page.getContents();
>> > > if ( contents != null ) {
>> > >     PDFTextStripperProcessor pdfTextStripperProcessor = new
>> > > PDFTextStripperProcessor();
>> > >     pdfTextStripperProcessor.processStream( page, > > page.findResources(),
>> > > contents.getStream() );
>> > > }
>> > >
>> > > public class PDFTextStripperProcessor extends PDFTextStripper {
>> > >     @Override
>> > >     public void processTextPosition( TextPosition text )  {
>> > >         System.out.println( text.getCharacter() );
>> > >     }
>> > > }
>> > >
>> > > And you can check a one page file sample here to test it:
>> > >
>> > > https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf
>> > >
>> > > What is the cause of this issue please?
>> > >
>> > >
>> > > Best regards ,
>> > > Hesham
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: NullPointerException in multithreading

Posted by Tilman Hausherr <TH...@t-online.de>.

Sorry, I meant that the line

ICC_Profile profile = ICC_Profile.getInstance(input);

be enclosed by the "synchronized".

Tilman

Am 18.03.2016 um 20:05 schrieb Tilman Hausherr:
> Hello 风云天空,
>
> This is obviously not related to "Spaces are ignored when reading a 
> PDF file" so you should have created a new subject line instead of 
> hijacking an existing thread by pressing "reply".
>
> I did have the same problem while working on
> https://issues.apache.org/jira/browse/PDFBOX-3267
>
> What I did was to change the source code of PDICCBased.java, i.e. 
> change this line
>
> awtColorSpace = 
> (ICC_ColorSpace)ColorSpace.getInstance(ColorSpace.CS_sRGB);
>
>
> to
>
> synchronized(LOG)
>  {
>        awtColorSpace = 
> (ICC_ColorSpace)ColorSpace.getInstance(ColorSpace.CS_sRGB);
> }
>
>
> This is a java bug. I'm undecided whether the change above should be 
> committed. But try the change :-)
>
> Tilman
>
>
>
> Am 18.03.2016 um 12:02 schrieb 风云天空:
>> who can help me
>> i get this error in multithreading
>> java.lang.NullPointerException
>>     at 
>> java.awt.color.ICC_Profile.activateDeferredProfile(ICC_Profile.java:1086)
>>     at java.awt.color.ICC_Profile$1.activate(ICC_Profile.java:742)
>>     at 
>> sun.java2d.cmm.ProfileDeferralMgr.activateProfiles(ProfileDeferralMgr.java:95)
>>     at java.awt.color.ICC_Profile.getInstance(ICC_Profile.java:775)
>>     at java.awt.color.ICC_Profile.getInstance(ICC_Profile.java:1013)
>>     at 
>> org.apache.pdfbox.pdmodel.graphics.color.PDICCBased.loadICCProfile(PDICCBased.java:119)
>>     at 
>> org.apache.pdfbox.pdmodel.graphics.color.PDICCBased.<init>(PDICCBased.java:89)
>>     at 
>> org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace.create(PDColorSpace.java:182)
>>     at 
>> org.apache.pdfbox.pdmodel.PDResources.getColorSpace(PDResources.java:172)
>>     at 
>> org.apache.pdfbox.pdmodel.PDResources.getColorSpace(PDResources.java:142)
>>     at 
>> org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColorSpace.process(SetNonStrokingColorSpace.java:41)
>>     at 
>> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:814)
>>     at 
>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:471)
>>     at 
>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:445)
>>     at 
>> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
>>     at 
>> org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:187)
>>     at 
>> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:208)
>>     at 
>> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:139)
>>     at 
>> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:80)
>>     at 
>> com.liaoyoujin.pdfbox.doc.PdfExtractor.getFirstImage(PdfExtractor.java:109)
>>     at 
>> com.liaoyoujin.pdfbox.doc.PdfExtractor$Job.run(PdfExtractor.java:178)
>>     at 
>> com.liaoyoujin.thread.pool.BlockThreadPool$Worker.run(BlockThreadPool.java:53)
>>     at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>     at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>     at java.lang.Thread.run(Thread.java:745)
>> java.util.ConcurrentModificationException
>>     at java.util.Vector$Itr.checkForComodification(Vector.java:1156)
>>     at java.util.Vector$Itr.next(Vector.java:1133)
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

NullPointerException in multithreading

Posted by Tilman Hausherr <TH...@t-online.de>.

Hello 风云天空,

This is obviously not related to "Spaces are ignored when reading a PDF 
file" so you should have created a new subject line instead of hijacking 
an existing thread by pressing "reply".

I did have the same problem while working on
https://issues.apache.org/jira/browse/PDFBOX-3267

What I did was to change the source code of PDICCBased.java, i.e. change 
this line

awtColorSpace = (ICC_ColorSpace)ColorSpace.getInstance(ColorSpace.CS_sRGB);


to

synchronized(LOG)
  {
        awtColorSpace = 
(ICC_ColorSpace)ColorSpace.getInstance(ColorSpace.CS_sRGB);
}


This is a java bug. I'm undecided whether the change above should be 
committed. But try the change :-)

Tilman



Am 18.03.2016 um 12:02 schrieb 风云天空:
> who can help me
> i get this error in multithreading
> java.lang.NullPointerException
> 	at java.awt.color.ICC_Profile.activateDeferredProfile(ICC_Profile.java:1086)
> 	at java.awt.color.ICC_Profile$1.activate(ICC_Profile.java:742)
> 	at sun.java2d.cmm.ProfileDeferralMgr.activateProfiles(ProfileDeferralMgr.java:95)
> 	at java.awt.color.ICC_Profile.getInstance(ICC_Profile.java:775)
> 	at java.awt.color.ICC_Profile.getInstance(ICC_Profile.java:1013)
> 	at org.apache.pdfbox.pdmodel.graphics.color.PDICCBased.loadICCProfile(PDICCBased.java:119)
> 	at org.apache.pdfbox.pdmodel.graphics.color.PDICCBased.<init>(PDICCBased.java:89)
> 	at org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace.create(PDColorSpace.java:182)
> 	at org.apache.pdfbox.pdmodel.PDResources.getColorSpace(PDResources.java:172)
> 	at org.apache.pdfbox.pdmodel.PDResources.getColorSpace(PDResources.java:142)
> 	at org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColorSpace.process(SetNonStrokingColorSpace.java:41)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:814)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:471)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:445)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
> 	at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:187)
> 	at org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:208)
> 	at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:139)
> 	at org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:80)
> 	at com.liaoyoujin.pdfbox.doc.PdfExtractor.getFirstImage(PdfExtractor.java:109)
> 	at com.liaoyoujin.pdfbox.doc.PdfExtractor$Job.run(PdfExtractor.java:178)
> 	at com.liaoyoujin.thread.pool.BlockThreadPool$Worker.run(BlockThreadPool.java:53)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> java.util.ConcurrentModificationException
> 	at java.util.Vector$Itr.checkForComodification(Vector.java:1156)
> 	at java.util.Vector$Itr.next(Vector.java:1133)
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Spaces are ignored when reading a PDF file

Posted by John Hewson <jo...@jahewson.com>.

This subject of this thread is "Spaces are ignored when reading a PDF file. Please post new questions to a new thread.

— John

> On 18 Mar 2016, at 04:02, 风云天空 <10...@qq.com> wrote:
> 
> who can help me 
> i get this error in multithreading
> java.lang.NullPointerException
> 	at java.awt.color.ICC_Profile.activateDeferredProfile(ICC_Profile.java:1086)
> 	at java.awt.color.ICC_Profile$1.activate(ICC_Profile.java:742)
> 	at sun.java2d.cmm.ProfileDeferralMgr.activateProfiles(ProfileDeferralMgr.java:95)
> 	at java.awt.color.ICC_Profile.getInstance(ICC_Profile.java:775)
> 	at java.awt.color.ICC_Profile.getInstance(ICC_Profile.java:1013)
> 	at org.apache.pdfbox.pdmodel.graphics.color.PDICCBased.loadICCProfile(PDICCBased.java:119)
> 	at org.apache.pdfbox.pdmodel.graphics.color.PDICCBased.<init>(PDICCBased.java:89)
> 	at org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace.create(PDColorSpace.java:182)
> 	at org.apache.pdfbox.pdmodel.PDResources.getColorSpace(PDResources.java:172)
> 	at org.apache.pdfbox.pdmodel.PDResources.getColorSpace(PDResources.java:142)
> 	at org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColorSpace.process(SetNonStrokingColorSpace.java:41)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:814)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:471)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:445)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
> 	at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:187)
> 	at org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:208)
> 	at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:139)
> 	at org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:80)
> 	at com.liaoyoujin.pdfbox.doc.PdfExtractor.getFirstImage(PdfExtractor.java:109)
> 	at com.liaoyoujin.pdfbox.doc.PdfExtractor$Job.run(PdfExtractor.java:178)
> 	at com.liaoyoujin.thread.pool.BlockThreadPool$Worker.run(BlockThreadPool.java:53)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> java.util.ConcurrentModificationException
> 	at java.util.Vector$Itr.checkForComodification(Vector.java:1156)
> 	at java.util.Vector$Itr.next(Vector.java:1133)
> 
> 
> 
> ------------------ 原始邮件 ------------------
> 发件人: "Hesham G.";<he...@gmail.com>;
> 发送时间: 2016年3月18日(星期五) 下午4:44
> 收件人: "users"<us...@pdfbox.apache.org>; 
> 
> 主题: Re: Spaces are ignored when reading a PDF file
> 
> 
> 
>   John,
> 
> I think I have got the idea ... Thumps up 
> 
> 
> Best regards ,
> Hesham 
> 
> ------------------------------------------------------------------------
> Included message :
> 
> I’m rather confused by this thread, inferring spaces is one of the the main  features of PDFTextStripper. I’m not sure why anyone is suggesting to process  the text manually - there’s no need to do that. We do that already!
> 
> Looking at the original code the problem is right here:
> 
>> public class PDFTextStripperProcessor extends PDFTextStripper {
>>   @Override
>>   public void processTextPosition( TextPosition text  )  {
>>       System.out.println(  text.getCharacter() );
>>   }
>> }
> 
> The processTextPosition method is used to pass an unprocessed TextPosition  *in* to PDFTextStripper, but this override prevents that from happening, and is  just printing the unprocessed token before PDFTextStripper has had a chance to  do its job, such as inferring the missing spaces.
> 
> You should follow our PrintTextLocations.java example which shows you how  to get the processed TextPositions from PDFTextStripper. It’s really easy to  do.
> 
> — John
> 
>> On 17 Mar 2016, at 04:44, Hesham G. <he...@gmail.com>  wrote:
>> 
>> Andreas,
>> 
>> You're absolutely right. I am testing it now, but it seems very  complicated. I hope there might be another easier solution.
>> 
>> 
>> Best regards ,
>> Hesham
>> 
>> ------------------------------------------------------------------------
>> Included message :
>> 
>>> "Hesham G." <he...@gmail.com> hat am 17. März 2016 um  11:20
>>> geschrieben:
>>> 
>>> 
>>> Andreas,
>>> 
>>> That is very helpful.
>>> 
>>> I can get the x location of each character using  TextPosition.getX(), ex:
>>> W: 102.88399
>>> i: 114.18165
>>> t: 117.660614
>>> h: 121.55801
>>> d: 133.09477
>>> u: 140.3994
>>> e: 147.60838
>>> 
>>> So to detect the space between the 2 words "With" & "due"  should I make
>>> subtraction calculations between X of the last letter(h) and the X  of the
>>> first letter (d) and if the number is large than normal then this  is a
>>> space? I think this way might be risky in the detection, or  what?
>> That's the short story. To decide what is normal could be quite  tricky. You have
>> to take the following facts into account:
>> 
>> - different fonts have different widths (important if the font before  the space
>> isn't the same than the font after the space)
>> - keep in mind that you have to take a scaling and sometimes a  rotation into
>> account
>> - the "space" between characters may vary if the text is  jusitified
>> 
>> There are certainly some other details which may be important as well,  so that
>> you end up with some more or less heuristic.
>> 
>> BR
>> Andreas
>> 
>>> Best regards ,
>>> Hesham
>>> 
>>> ------------------------------------------------------------------------
>>> Included message :
>>> 
>>> Hi,
>>> 
>>>> Frank van der Hulst <dr...@gmail.com> hat am  17. März 2016 um
>>>> 08:34
>>>> geschrieben:
>>>> 
>>>> 
>>>> Spaces don't exist as characters in PDFs. To identify spaces,  you have > to
>>>> compare the X coordinates of adjacent characters against  their widths.
>>> That's not correct, spaces exist but in most cases pdf engines  omit them and
>>> replace spaces by a splitted text with an appropriate  positioning.
>>> 
>>> BTW, latex uses the same strategy. Here is a excerpt from your  pdf:
>>> 
>>>  [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d)  -383 (to) -383
>>> (Article)
>>> -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383  (has) -384
>>> (the) -383 (right) ] TJ
>>> 
>>> The text is in between the braces and the numbers are used for  horizontal
>>> positioning.
>>> 
>>> BR
>>> Andreas
>>> 
>>>> 
>>>> On Thu, Mar 17, 2016 at 7:12 PM, Hesham G.  <he...@gmail.com> > wrote:
>>>> 
>>>>> Hello ,
>>>>> 
>>>>> I have a PDF file created using Latex. I am trying to  read and print > > all
>>>>> letters in that file using PDFBox, but when doing this  all spaces in
>>>>> that
>>>>> file are ignored. Here is the code I am using:
>>>>> PDPage page = (PDPage)allPages.get( 0 );
>>>>> PDStream contents = page.getContents();
>>>>> if ( contents != null ) {
>>>>>    PDFTextStripperProcessor  pdfTextStripperProcessor = new
>>>>> PDFTextStripperProcessor();
>>>>>     pdfTextStripperProcessor.processStream( page, > >  page.findResources(),
>>>>> contents.getStream() );
>>>>> }
>>>>> 
>>>>> public class PDFTextStripperProcessor extends  PDFTextStripper {
>>>>>    @Override
>>>>>    public void processTextPosition(  TextPosition text )  {
>>>>>         System.out.println( text.getCharacter() );
>>>>>    }
>>>>> }
>>>>> 
>>>>> And you can check a one page file sample here to test  it:
>>>>> 
>>>>> https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf
>>>>> 
>>>>> What is the cause of this issue please?
>>>>> 
>>>>> 
>>>>> Best regards ,
>>>>> Hesham
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail:  users-help@pdfbox.apache.org
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail:  users-help@pdfbox.apache.org
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apach


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

回复： Spaces are ignored when reading a PDF file

Posted by 风云天空 <10...@qq.com>.

who can help me 
i get this error in multithreading
java.lang.NullPointerException
	at java.awt.color.ICC_Profile.activateDeferredProfile(ICC_Profile.java:1086)
	at java.awt.color.ICC_Profile$1.activate(ICC_Profile.java:742)
	at sun.java2d.cmm.ProfileDeferralMgr.activateProfiles(ProfileDeferralMgr.java:95)
	at java.awt.color.ICC_Profile.getInstance(ICC_Profile.java:775)
	at java.awt.color.ICC_Profile.getInstance(ICC_Profile.java:1013)
	at org.apache.pdfbox.pdmodel.graphics.color.PDICCBased.loadICCProfile(PDICCBased.java:119)
	at org.apache.pdfbox.pdmodel.graphics.color.PDICCBased.<init>(PDICCBased.java:89)
	at org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace.create(PDColorSpace.java:182)
	at org.apache.pdfbox.pdmodel.PDResources.getColorSpace(PDResources.java:172)
	at org.apache.pdfbox.pdmodel.PDResources.getColorSpace(PDResources.java:142)
	at org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColorSpace.process(SetNonStrokingColorSpace.java:41)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:814)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:471)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:445)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
	at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:187)
	at org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:208)
	at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:139)
	at org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:80)
	at com.liaoyoujin.pdfbox.doc.PdfExtractor.getFirstImage(PdfExtractor.java:109)
	at com.liaoyoujin.pdfbox.doc.PdfExtractor$Job.run(PdfExtractor.java:178)
	at com.liaoyoujin.thread.pool.BlockThreadPool$Worker.run(BlockThreadPool.java:53)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
java.util.ConcurrentModificationException
	at java.util.Vector$Itr.checkForComodification(Vector.java:1156)
	at java.util.Vector$Itr.next(Vector.java:1133)



------------------ 原始邮件 ------------------
发件人: "Hesham G.";<he...@gmail.com>;
发送时间: 2016年3月18日(星期五) 下午4:44
收件人: "users"<us...@pdfbox.apache.org>; 

主题: Re: Spaces are ignored when reading a PDF file



   John,
  
 I think I have got the idea ... Thumps up 
  
  
 Best regards ,
 Hesham 
  
 ------------------------------------------------------------------------
 Included message :
  
 I’m rather confused by this thread, inferring spaces is one of the the main  features of PDFTextStripper. I’m not sure why anyone is suggesting to process  the text manually - there’s no need to do that. We do that already!
  
 Looking at the original code the problem is right here:
  
 > public class PDFTextStripperProcessor extends PDFTextStripper {
 >    @Override
 >    public void processTextPosition( TextPosition text  )  {
 >        System.out.println(  text.getCharacter() );
 >    }
 > }
  
 The processTextPosition method is used to pass an unprocessed TextPosition  *in* to PDFTextStripper, but this override prevents that from happening, and is  just printing the unprocessed token before PDFTextStripper has had a chance to  do its job, such as inferring the missing spaces.
  
 You should follow our PrintTextLocations.java example which shows you how  to get the processed TextPositions from PDFTextStripper. It’s really easy to  do.
  
 — John
  
 > On 17 Mar 2016, at 04:44, Hesham G. <he...@gmail.com>  wrote:
 > 
 > Andreas,
 > 
 > You're absolutely right. I am testing it now, but it seems very  complicated. I hope there might be another easier solution.
 > 
 > 
 > Best regards ,
 > Hesham
 > 
 >  ------------------------------------------------------------------------
 > Included message :
 > 
 >> "Hesham G." <he...@gmail.com> hat am 17. März 2016 um  11:20
 >> geschrieben:
 >> 
 >> 
 >> Andreas,
 >> 
 >> That is very helpful.
 >> 
 >> I can get the x location of each character using  TextPosition.getX(), ex:
 >> W: 102.88399
 >> i: 114.18165
 >> t: 117.660614
 >> h: 121.55801
 >> d: 133.09477
 >> u: 140.3994
 >> e: 147.60838
 >> 
 >> So to detect the space between the 2 words "With" & "due"  should I make
 >> subtraction calculations between X of the last letter(h) and the X  of the
 >> first letter (d) and if the number is large than normal then this  is a
 >> space? I think this way might be risky in the detection, or  what?
 > That's the short story. To decide what is normal could be quite  tricky. You have
 > to take the following facts into account:
 > 
 > - different fonts have different widths (important if the font before  the space
 > isn't the same than the font after the space)
 > - keep in mind that you have to take a scaling and sometimes a  rotation into
 > account
 > - the "space" between characters may vary if the text is  jusitified
 > 
 > There are certainly some other details which may be important as well,  so that
 > you end up with some more or less heuristic.
 > 
 > BR
 > Andreas
 > 
 >> Best regards ,
 >> Hesham
 >> 
 >>  ------------------------------------------------------------------------
 >> Included message :
 >> 
 >> Hi,
 >> 
 >> > Frank van der Hulst <dr...@gmail.com> hat am  17. März 2016 um
 >> > 08:34
 >> > geschrieben:
 >> >
 >> >
 >> > Spaces don't exist as characters in PDFs. To identify spaces,  you have > to
 >> > compare the X coordinates of adjacent characters against  their widths.
 >> That's not correct, spaces exist but in most cases pdf engines  omit them and
 >> replace spaces by a splitted text with an appropriate  positioning.
 >> 
 >> BTW, latex uses the same strategy. Here is a excerpt from your  pdf:
 >> 
 >>   [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d)  -383 (to) -383
 >> (Article)
 >> -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383  (has) -384
 >> (the) -383 (right) ] TJ
 >> 
 >> The text is in between the braces and the numbers are used for  horizontal
 >> positioning.
 >> 
 >> BR
 >> Andreas
 >> 
 >> >
 >> > On Thu, Mar 17, 2016 at 7:12 PM, Hesham G.  <he...@gmail.com> > wrote:
 >> >
 >> > > Hello ,
 >> > >
 >> > > I have a PDF file created using Latex. I am trying to  read and print > > all
 >> > > letters in that file using PDFBox, but when doing this  all spaces in
 >> > > that
 >> > > file are ignored. Here is the code I am using:
 >> > > PDPage page = (PDPage)allPages.get( 0 );
 >> > > PDStream contents = page.getContents();
 >> > > if ( contents != null ) {
 >> > >     PDFTextStripperProcessor  pdfTextStripperProcessor = new
 >> > > PDFTextStripperProcessor();
 >> > >      pdfTextStripperProcessor.processStream( page, > >  page.findResources(),
 >> > > contents.getStream() );
 >> > > }
 >> > >
 >> > > public class PDFTextStripperProcessor extends  PDFTextStripper {
 >> > >     @Override
 >> > >     public void processTextPosition(  TextPosition text )  {
 >> > >          System.out.println( text.getCharacter() );
 >> > >     }
 >> > > }
 >> > >
 >> > > And you can check a one page file sample here to test  it:
 >> > >
 >> > >  https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf
 >> > >
 >> > > What is the cause of this issue please?
 >> > >
 >> > >
 >> > > Best regards ,
 >> > > Hesham
 >> 
 >>  ---------------------------------------------------------------------
 >> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
 >> For additional commands, e-mail:  users-help@pdfbox.apache.org
 >> 
 >> 
 >>  ---------------------------------------------------------------------
 >> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
 >> For additional commands, e-mail:  users-help@pdfbox.apache.org
 >> 
 > 
 >  ---------------------------------------------------------------------
 > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
 > For additional commands, e-mail: users-help@pdfbox.apache.org
 > 
 > 
 >  ---------------------------------------------------------------------
 > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
 > For additional commands, e-mail: users-help@pdfbox.apache.org
 > 
  
  
 ---------------------------------------------------------------------
 To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
 For additional commands, e-mail: users-help@pdfbox.apach

Re: Spaces are ignored when reading a PDF file

Posted by "Hesham G." <he...@gmail.com>.

John,

I think I have got the idea ... Thumps up 


Best regards ,
Hesham 

------------------------------------------------------------------------
Included message :

I’m rather confused by this thread, inferring spaces is one of the the main features of PDFTextStripper. I’m not sure why anyone is suggesting to process the text manually - there’s no need to do that. We do that already!

Looking at the original code the problem is right here:

> public class PDFTextStripperProcessor extends PDFTextStripper {
>    @Override
>    public void processTextPosition( TextPosition text )  {
>        System.out.println( text.getCharacter() );
>    }
> }

The processTextPosition method is used to pass an unprocessed TextPosition *in* to PDFTextStripper, but this override prevents that from happening, and is just printing the unprocessed token before PDFTextStripper has had a chance to do its job, such as inferring the missing spaces.

You should follow our PrintTextLocations.java example which shows you how to get the processed TextPositions from PDFTextStripper. It’s really easy to do.

— John

> On 17 Mar 2016, at 04:44, Hesham G. <he...@gmail.com> wrote:
> 
> Andreas,
> 
> You're absolutely right. I am testing it now, but it seems very complicated. I hope there might be another easier solution.
> 
> 
> Best regards ,
> Hesham
> 
> ------------------------------------------------------------------------
> Included message :
> 
>> "Hesham G." <he...@gmail.com> hat am 17. März 2016 um 11:20
>> geschrieben:
>> 
>> 
>> Andreas,
>> 
>> That is very helpful.
>> 
>> I can get the x location of each character using TextPosition.getX(), ex:
>> W: 102.88399
>> i: 114.18165
>> t: 117.660614
>> h: 121.55801
>> d: 133.09477
>> u: 140.3994
>> e: 147.60838
>> 
>> So to detect the space between the 2 words "With" & "due" should I make
>> subtraction calculations between X of the last letter(h) and the X of the
>> first letter (d) and if the number is large than normal then this is a
>> space? I think this way might be risky in the detection, or what?
> That's the short story. To decide what is normal could be quite tricky. You have
> to take the following facts into account:
> 
> - different fonts have different widths (important if the font before the space
> isn't the same than the font after the space)
> - keep in mind that you have to take a scaling and sometimes a rotation into
> account
> - the "space" between characters may vary if the text is jusitified
> 
> There are certainly some other details which may be important as well, so that
> you end up with some more or less heuristic.
> 
> BR
> Andreas
> 
>> Best regards ,
>> Hesham
>> 
>> ------------------------------------------------------------------------
>> Included message :
>> 
>> Hi,
>> 
>> > Frank van der Hulst <dr...@gmail.com> hat am 17. März 2016 um
>> > 08:34
>> > geschrieben:
>> >
>> >
>> > Spaces don't exist as characters in PDFs. To identify spaces, you have > to
>> > compare the X coordinates of adjacent characters against their widths.
>> That's not correct, spaces exist but in most cases pdf engines omit them and
>> replace spaces by a splitted text with an appropriate positioning.
>> 
>> BTW, latex uses the same strategy. Here is a excerpt from your pdf:
>> 
>>   [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383
>> (Article)
>> -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) -384
>> (the) -383 (right) ] TJ
>> 
>> The text is in between the braces and the numbers are used for horizontal
>> positioning.
>> 
>> BR
>> Andreas
>> 
>> >
>> > On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. <he...@gmail.com> > wrote:
>> >
>> > > Hello ,
>> > >
>> > > I have a PDF file created using Latex. I am trying to read and print > > all
>> > > letters in that file using PDFBox, but when doing this all spaces in
>> > > that
>> > > file are ignored. Here is the code I am using:
>> > > PDPage page = (PDPage)allPages.get( 0 );
>> > > PDStream contents = page.getContents();
>> > > if ( contents != null ) {
>> > >     PDFTextStripperProcessor pdfTextStripperProcessor = new
>> > > PDFTextStripperProcessor();
>> > >     pdfTextStripperProcessor.processStream( page, > > page.findResources(),
>> > > contents.getStream() );
>> > > }
>> > >
>> > > public class PDFTextStripperProcessor extends PDFTextStripper {
>> > >     @Override
>> > >     public void processTextPosition( TextPosition text )  {
>> > >         System.out.println( text.getCharacter() );
>> > >     }
>> > > }
>> > >
>> > > And you can check a one page file sample here to test it:
>> > >
>> > > https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf
>> > >
>> > > What is the cause of this issue please?
>> > >
>> > >
>> > > Best regards ,
>> > > Hesham
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Spaces are ignored when reading a PDF file

Posted by John Hewson <jo...@jahewson.com>.

I’m rather confused by this thread, inferring spaces is one of the the main features of PDFTextStripper. I’m not sure why anyone is suggesting to process the text manually - there’s no need to do that. We do that already!

Looking at the original code the problem is right here:

> public class PDFTextStripperProcessor extends PDFTextStripper {
>    @Override
>    public void processTextPosition( TextPosition text )  {
>        System.out.println( text.getCharacter() );
>    }
> }

The processTextPosition method is used to pass an unprocessed TextPosition *in* to PDFTextStripper, but this override prevents that from happening, and is just printing the unprocessed token before PDFTextStripper has had a chance to do its job, such as inferring the missing spaces.

You should follow our PrintTextLocations.java example which shows you how to get the processed TextPositions from PDFTextStripper. It’s really easy to do.

— John

> On 17 Mar 2016, at 04:44, Hesham G. <he...@gmail.com> wrote:
> 
> Andreas,
> 
> You're absolutely right. I am testing it now, but it seems very complicated. I hope there might be another easier solution.
> 
> 
> Best regards ,
> Hesham
> 
> ------------------------------------------------------------------------
> Included message :
> 
>> "Hesham G." <he...@gmail.com> hat am 17. März 2016 um 11:20
>> geschrieben:
>> 
>> 
>> Andreas,
>> 
>> That is very helpful.
>> 
>> I can get the x location of each character using TextPosition.getX(), ex:
>> W: 102.88399
>> i: 114.18165
>> t: 117.660614
>> h: 121.55801
>> d: 133.09477
>> u: 140.3994
>> e: 147.60838
>> 
>> So to detect the space between the 2 words "With" & "due" should I make
>> subtraction calculations between X of the last letter(h) and the X of the
>> first letter (d) and if the number is large than normal then this is a
>> space? I think this way might be risky in the detection, or what?
> That's the short story. To decide what is normal could be quite tricky. You have
> to take the following facts into account:
> 
> - different fonts have different widths (important if the font before the space
> isn't the same than the font after the space)
> - keep in mind that you have to take a scaling and sometimes a rotation into
> account
> - the "space" between characters may vary if the text is jusitified
> 
> There are certainly some other details which may be important as well, so that
> you end up with some more or less heuristic.
> 
> BR
> Andreas
> 
>> Best regards ,
>> Hesham
>> 
>> ------------------------------------------------------------------------
>> Included message :
>> 
>> Hi,
>> 
>> > Frank van der Hulst <dr...@gmail.com> hat am 17. März 2016 um
>> > 08:34
>> > geschrieben:
>> >
>> >
>> > Spaces don't exist as characters in PDFs. To identify spaces, you have > to
>> > compare the X coordinates of adjacent characters against their widths.
>> That's not correct, spaces exist but in most cases pdf engines omit them and
>> replace spaces by a splitted text with an appropriate positioning.
>> 
>> BTW, latex uses the same strategy. Here is a excerpt from your pdf:
>> 
>>   [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383
>> (Article)
>> -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) -384
>> (the) -383 (right) ] TJ
>> 
>> The text is in between the braces and the numbers are used for horizontal
>> positioning.
>> 
>> BR
>> Andreas
>> 
>> >
>> > On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. <he...@gmail.com> > wrote:
>> >
>> > > Hello ,
>> > >
>> > > I have a PDF file created using Latex. I am trying to read and print > > all
>> > > letters in that file using PDFBox, but when doing this all spaces in
>> > > that
>> > > file are ignored. Here is the code I am using:
>> > > PDPage page = (PDPage)allPages.get( 0 );
>> > > PDStream contents = page.getContents();
>> > > if ( contents != null ) {
>> > >     PDFTextStripperProcessor pdfTextStripperProcessor = new
>> > > PDFTextStripperProcessor();
>> > >     pdfTextStripperProcessor.processStream( page, > > page.findResources(),
>> > > contents.getStream() );
>> > > }
>> > >
>> > > public class PDFTextStripperProcessor extends PDFTextStripper {
>> > >     @Override
>> > >     public void processTextPosition( TextPosition text )  {
>> > >         System.out.println( text.getCharacter() );
>> > >     }
>> > > }
>> > >
>> > > And you can check a one page file sample here to test it:
>> > >
>> > > https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf
>> > >
>> > > What is the cause of this issue please?
>> > >
>> > >
>> > > Best regards ,
>> > > Hesham
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Spaces are ignored when reading a PDF file

Posted by "Hesham G." <he...@gmail.com>.

Andreas,

You're absolutely right. I am testing it now, but it seems very complicated. 
I hope there might be another easier solution.


Best regards ,
Hesham

------------------------------------------------------------------------
Included message :

> "Hesham G." <he...@gmail.com> hat am 17. März 2016 um 11:20
> geschrieben:
>
>
> Andreas,
>
> That is very helpful.
>
> I can get the x location of each character using TextPosition.getX(), ex:
> W: 102.88399
> i: 114.18165
> t: 117.660614
> h: 121.55801
> d: 133.09477
> u: 140.3994
> e: 147.60838
>
> So to detect the space between the 2 words "With" & "due" should I make
> subtraction calculations between X of the last letter(h) and the X of the
> first letter (d) and if the number is large than normal then this is a
> space? I think this way might be risky in the detection, or what?
That's the short story. To decide what is normal could be quite tricky. You 
have
to take the following facts into account:

- different fonts have different widths (important if the font before the 
space
isn't the same than the font after the space)
- keep in mind that you have to take a scaling and sometimes a rotation into
account
- the "space" between characters may vary if the text is jusitified

There are certainly some other details which may be important as well, so 
that
you end up with some more or less heuristic.

BR
Andreas

> Best regards ,
> Hesham
>
> ------------------------------------------------------------------------
> Included message :
>
> Hi,
>
> > Frank van der Hulst <dr...@gmail.com> hat am 17. März 2016 um
> > 08:34
> > geschrieben:
> >
> >
> > Spaces don't exist as characters in PDFs. To identify spaces, you have 
> > to
> > compare the X coordinates of adjacent characters against their widths.
> That's not correct, spaces exist but in most cases pdf engines omit them 
> and
> replace spaces by a splitted text with an appropriate positioning.
>
> BTW, latex uses the same strategy. Here is a excerpt from your pdf:
>
>    [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383
> (Article)
> -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 
> (has) -384
> (the) -383 (right) ] TJ
>
> The text is in between the braces and the numbers are used for horizontal
> positioning.
>
> BR
> Andreas
>
> >
> > On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. <he...@gmail.com> 
> > wrote:
> >
> > > Hello ,
> > >
> > > I have a PDF file created using Latex. I am trying to read and print 
> > > all
> > > letters in that file using PDFBox, but when doing this all spaces in
> > > that
> > > file are ignored. Here is the code I am using:
> > > PDPage page = (PDPage)allPages.get( 0 );
> > > PDStream contents = page.getContents();
> > > if ( contents != null ) {
> > >     PDFTextStripperProcessor pdfTextStripperProcessor = new
> > > PDFTextStripperProcessor();
> > >     pdfTextStripperProcessor.processStream( page, 
> > > page.findResources(),
> > > contents.getStream() );
> > > }
> > >
> > > public class PDFTextStripperProcessor extends PDFTextStripper {
> > >     @Override
> > >     public void processTextPosition( TextPosition text )  {
> > >         System.out.println( text.getCharacter() );
> > >     }
> > > }
> > >
> > > And you can check a one page file sample here to test it:
> > >
> > > https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf
> > >
> > > What is the cause of this issue please?
> > >
> > >
> > > Best regards ,
> > > Hesham
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Spaces are ignored when reading a PDF file

Posted by Andreas Lehmkühler <an...@lehmi.de>.

> "Hesham G." <he...@gmail.com> hat am 17. März 2016 um 11:20
> geschrieben:
> 
> 
> Andreas,
> 
> That is very helpful.
> 
> I can get the x location of each character using TextPosition.getX(), ex:
> W: 102.88399
> i: 114.18165
> t: 117.660614
> h: 121.55801
> d: 133.09477
> u: 140.3994
> e: 147.60838
> 
> So to detect the space between the 2 words "With" & "due" should I make 
> subtraction calculations between X of the last letter(h) and the X of the 
> first letter (d) and if the number is large than normal then this is a 
> space? I think this way might be risky in the detection, or what?
That's the short story. To decide what is normal could be quite tricky. You have
to take the following facts into account:

- different fonts have different widths (important if the font before the space
isn't the same than the font after the space)
- keep in mind that you have to take a scaling and sometimes a rotation into
account
- the "space" between characters may vary if the text is jusitified

There are certainly some other details which may be important as well, so that
you end up with some more or less heuristic. 

BR
Andreas

> Best regards ,
> Hesham
> 
> ------------------------------------------------------------------------
> Included message :
> 
> Hi,
> 
> > Frank van der Hulst <dr...@gmail.com> hat am 17. März 2016 um 
> > 08:34
> > geschrieben:
> >
> >
> > Spaces don't exist as characters in PDFs. To identify spaces, you have to
> > compare the X coordinates of adjacent characters against their widths.
> That's not correct, spaces exist but in most cases pdf engines omit them and
> replace spaces by a splitted text with an appropriate positioning.
> 
> BTW, latex uses the same strategy. Here is a excerpt from your pdf:
> 
>    [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383 
> (Article)
> -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) -384
> (the) -383 (right) ] TJ
> 
> The text is in between the braces and the numbers are used for horizontal
> positioning.
> 
> BR
> Andreas
> 
> >
> > On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. <he...@gmail.com> wrote:
> >
> > > Hello ,
> > >
> > > I have a PDF file created using Latex. I am trying to read and print all
> > > letters in that file using PDFBox, but when doing this all spaces in 
> > > that
> > > file are ignored. Here is the code I am using:
> > > PDPage page = (PDPage)allPages.get( 0 );
> > > PDStream contents = page.getContents();
> > > if ( contents != null ) {
> > >     PDFTextStripperProcessor pdfTextStripperProcessor = new
> > > PDFTextStripperProcessor();
> > >     pdfTextStripperProcessor.processStream( page, page.findResources(),
> > > contents.getStream() );
> > > }
> > >
> > > public class PDFTextStripperProcessor extends PDFTextStripper {
> > >     @Override
> > >     public void processTextPosition( TextPosition text )  {
> > >         System.out.println( text.getCharacter() );
> > >     }
> > > }
> > >
> > > And you can check a one page file sample here to test it:
> > >
> > > https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf
> > >
> > > What is the cause of this issue please?
> > >
> > >
> > > Best regards ,
> > > Hesham
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Spaces are ignored when reading a PDF file

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 17.03.2016 um 11:20 schrieb Hesham G.:
>
> So to detect the space between the 2 words "With" & "due" should I 
> make subtraction calculations between X of the last letter(h) and the 
> X of the first letter (d) and if the number is large than normal then 
> this is a space? I think this way might be risky in the detection, or 
> what? 

What you're doing is to reinvent the PDFTextStripper code, which has 
some strategies to decide where there are spaces. That's not a bad idea 
(there are some weaknesses), however it is indeed... "tricky".

https://www.youtube.com/watch?v=cjEdxO91RWQ&feature=youtu.be&t=3m33s



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Spaces are ignored when reading a PDF file

Posted by "Hesham G." <he...@gmail.com>.

Andreas,

That is very helpful.

I can get the x location of each character using TextPosition.getX(), ex:
W: 102.88399
i: 114.18165
t: 117.660614
h: 121.55801
d: 133.09477
u: 140.3994
e: 147.60838

So to detect the space between the 2 words "With" & "due" should I make 
subtraction calculations between X of the last letter(h) and the X of the 
first letter (d) and if the number is large than normal then this is a 
space? I think this way might be risky in the detection, or what?


Best regards ,
Hesham

------------------------------------------------------------------------
Included message :

Hi,

> Frank van der Hulst <dr...@gmail.com> hat am 17. März 2016 um 
> 08:34
> geschrieben:
>
>
> Spaces don't exist as characters in PDFs. To identify spaces, you have to
> compare the X coordinates of adjacent characters against their widths.
That's not correct, spaces exist but in most cases pdf engines omit them and
replace spaces by a splitted text with an appropriate positioning.

BTW, latex uses the same strategy. Here is a excerpt from your pdf:

   [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383 
(Article)
-384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) -384
(the) -383 (right) ] TJ

The text is in between the braces and the numbers are used for horizontal
positioning.

BR
Andreas

>
> On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. <he...@gmail.com> wrote:
>
> > Hello ,
> >
> > I have a PDF file created using Latex. I am trying to read and print all
> > letters in that file using PDFBox, but when doing this all spaces in 
> > that
> > file are ignored. Here is the code I am using:
> > PDPage page = (PDPage)allPages.get( 0 );
> > PDStream contents = page.getContents();
> > if ( contents != null ) {
> >     PDFTextStripperProcessor pdfTextStripperProcessor = new
> > PDFTextStripperProcessor();
> >     pdfTextStripperProcessor.processStream( page, page.findResources(),
> > contents.getStream() );
> > }
> >
> > public class PDFTextStripperProcessor extends PDFTextStripper {
> >     @Override
> >     public void processTextPosition( TextPosition text )  {
> >         System.out.println( text.getCharacter() );
> >     }
> > }
> >
> > And you can check a one page file sample here to test it:
> >
> > https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf
> >
> > What is the cause of this issue please?
> >
> >
> > Best regards ,
> > Hesham

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Spaces are ignored when reading a PDF file

Posted by Andreas Lehmkühler <an...@lehmi.de>.

Hi,

> Frank van der Hulst <dr...@gmail.com> hat am 17. März 2016 um 08:34
> geschrieben:
> 
> 
> Spaces don't exist as characters in PDFs. To identify spaces, you have to
> compare the X coordinates of adjacent characters against their widths.
That's not correct, spaces exist but in most cases pdf engines omit them and
replace spaces by a splitted text with an appropriate positioning.

BTW, latex uses the same strategy. Here is a excerpt from your pdf:

   [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383 (Article)
-384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) -384
(the) -383 (right) ] TJ

The text is in between the braces and the numbers are used for horizontal
positioning.

BR
Andreas

> 
> On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. <he...@gmail.com> wrote:
> 
> > Hello ,
> >
> > I have a PDF file created using Latex. I am trying to read and print all
> > letters in that file using PDFBox, but when doing this all spaces in that
> > file are ignored. Here is the code I am using:
> > PDPage page = (PDPage)allPages.get( 0 );
> > PDStream contents = page.getContents();
> > if ( contents != null ) {
> >     PDFTextStripperProcessor pdfTextStripperProcessor = new
> > PDFTextStripperProcessor();
> >     pdfTextStripperProcessor.processStream( page, page.findResources(),
> > contents.getStream() );
> > }
> >
> > public class PDFTextStripperProcessor extends PDFTextStripper {
> >     @Override
> >     public void processTextPosition( TextPosition text )  {
> >         System.out.println( text.getCharacter() );
> >     }
> > }
> >
> > And you can check a one page file sample here to test it:
> >
> > https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf
> >
> > What is the cause of this issue please?
> >
> >
> > Best regards ,
> > Hesham

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Spaces are ignored when reading a PDF file

Posted by Frank van der Hulst <dr...@gmail.com>.

Spaces don't exist as characters in PDFs. To identify spaces, you have to
compare the X coordinates of adjacent characters against their widths.

On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. <he...@gmail.com> wrote:

> Hello ,
>
> I have a PDF file created using Latex. I am trying to read and print all
> letters in that file using PDFBox, but when doing this all spaces in that
> file are ignored. Here is the code I am using:
> PDPage page = (PDPage)allPages.get( 0 );
> PDStream contents = page.getContents();
> if ( contents != null ) {
>     PDFTextStripperProcessor pdfTextStripperProcessor = new
> PDFTextStripperProcessor();
>     pdfTextStripperProcessor.processStream( page, page.findResources(),
> contents.getStream() );
> }
>
> public class PDFTextStripperProcessor extends PDFTextStripper {
>     @Override
>     public void processTextPosition( TextPosition text )  {
>         System.out.println( text.getCharacter() );
>     }
> }
>
> And you can check a one page file sample here to test it:
>
> https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf
>
> What is the cause of this issue please?
>
>
> Best regards ,
> Hesham

Re: Spaces are ignored when reading a PDF file

Posted by "Hesham G." <he...@gmail.com>.

Tilman,

I am using this code to extract the text from the pdf because I need font 
information about the extracted characters like determining the font name 
used. Using the normal extraction code will not work in my case.


Best regards ,
Hesham

------------------------------------------------------------------------
Included message :

Am 17.03.2016 um 07:12 schrieb Hesham G.:
> Hello ,
>
> I have a PDF file created using Latex. I am trying to read and print all 
> letters in that file using PDFBox, but when doing this all spaces in that 
> file are ignored.

Here's what I get with ExtractText (your code is.... unusual), this
looks excellent to me:

article titles c©by Michael O’Kane are not part of the law mu7ami.com
Article [220] Right to Regulate
With due regard to Article (219), the competent authority has the right
of monitoring the companies with regard to application of the provisions
set forth in the law and the company’s articles of association and bylaw
including the authority to inspect the company and check its account and
ask for data from the board of directors or the company managers through
a representative or more of its personnel or experts it chooses for this
pur-
pose.
Article [221] Access to Records
All the company officials shall acquaint the Ministry representatives and
the Authority, fi the company is listed in the financial market or
seeking to
be listed, with regard to the works stated in Article (220), all that
they ask
of company books and records and documents and provide them with all
related information or clarification.
94 version 0.2 provided by mu7ami.com


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Spaces are ignored when reading a PDF file

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 17.03.2016 um 07:12 schrieb Hesham G.:
> Hello ,
>
> I have a PDF file created using Latex. I am trying to read and print all letters in that file using PDFBox, but when doing this all spaces in that file are ignored.

Here's what I get with ExtractText (your code is.... unusual), this 
looks excellent to me:

article titles c©by Michael O’Kane are not part of the law mu7ami.com
Article [220] Right to Regulate
With due regard to Article (219), the competent authority has the right
of monitoring the companies with regard to application of the provisions
set forth in the law and the company’s articles of association and bylaw
including the authority to inspect the company and check its account and
ask for data from the board of directors or the company managers through
a representative or more of its personnel or experts it chooses for this 
pur-
pose.
Article [221] Access to Records
All the company officials shall acquaint the Ministry representatives and
the Authority, fi the company is listed in the financial market or 
seeking to
be listed, with regard to the works stated in Article (220), all that 
they ask
of company books and records and documents and provide them with all
related information or clarification.
94 version 0.2 provided by mu7ami.com


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org