You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Renaud Billen <re...@nic.be> on 2015/01/10 14:04:02 UTC

Content of pdf moved around

Hello,

I have a little issue with the extraction of the text of some pdfs, where some words are switching order with others..

With the pdf attached to this mail, if I use "save as text » from adobe reader, I get : 

Référence: LIX-673LIX-6737 


Nom: The test company 


Type: 
Ouverture: 24/04/2007 

Titulaire: BD 
Resp.: LIX 
Co-Resp.: BB 
Client 




But with pdfbox I get : 

Référence: LIX-6737
Nom: The test company
Titulaire: BD
Resp.: LIX
Co-Resp.: BB
Type:
Ouverture: 24/04/2007
Client


Could you tell me if something can be done to solve this problem?

Thanks,
Renaud



Re: Content of pdf moved around - SOLVED

Posted by Renaud Billen <re...@nic.be>.
Wow thought the sort option would make an alphabetical sort, so I haven’t tried it, but it did the trick… :)

Anyway thanks a lot for you help,
Renaud

> Le 10 janv. 2015 à 14:20, Andreas Lehmkuehler <an...@lehmi.de> a écrit :
> 
> Hi,
> 
> Am 10.01.2015 um 14:04 schrieb Renaud Billen:
>> Hello,
>> 
>> I have a little issue with the extraction of the text of some pdfs, where some words are switching order with others..
>> 
>> With the pdf attached to this mail, if I use "save as text » from adobe reader, I get :
>> 
>> Référence: LIX-673LIX-6737
>> 
>> 
>> Nom: The test company
>> 
>> 
>> Type:
>> Ouverture: 24/04/2007
>> 
>> Titulaire: BD
>> Resp.: LIX
>> Co-Resp.: BB
>> Client
>> 
>> 
>> 
>> 
>> But with pdfbox I get :
>> 
>> Référence: LIX-6737
>> Nom: The test company
>> Titulaire: BD
>> Resp.: LIX
>> Co-Resp.: BB
>> Type:
>> Ouverture: 24/04/2007
>> Client
>> 
>> 
>> Could you tell me if something can be done to solve this problem?
> Is the sort option activated?
> 
>> Thanks,
>> Renaud
> 
> BR
> Andreas Lehmkühler


Re: Content of pdf moved around

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi,

Am 10.01.2015 um 14:04 schrieb Renaud Billen:
> Hello,
>
> I have a little issue with the extraction of the text of some pdfs, where some words are switching order with others..
>
> With the pdf attached to this mail, if I use "save as text » from adobe reader, I get :
>
> Référence: LIX-673LIX-6737
>
>
> Nom: The test company
>
>
> Type:
> Ouverture: 24/04/2007
>
> Titulaire: BD
> Resp.: LIX
> Co-Resp.: BB
> Client
>
>
>
>
> But with pdfbox I get :
>
> Référence: LIX-6737
> Nom: The test company
> Titulaire: BD
> Resp.: LIX
> Co-Resp.: BB
> Type:
> Ouverture: 24/04/2007
> Client
>
>
> Could you tell me if something can be done to solve this problem?
Is the sort option activated?

> Thanks,
> Renaud

BR
Andreas Lehmkühler


Re: Content of pdf moved around

Posted by Andreas Lehmkühler <an...@lehmi.de>.
Hi Ray,

to unsubscribe you have to write an email to users-subscribe@pdfbox.apache.org.
See [1] for fruther details.

BR
Andreas Lehmkühler

[1] http://pdfbox.apache.org/mailinglists.html

> Ray Morris <ra...@bigpond.com> hat am 10. Januar 2015 um 22:48
> geschrieben:
> 
> 
> Please unsubscribe ray.morris.brisbane@bigpond.com
> 
> I briefly had the ambition to teach myself how to maintain bookmarks and XML 
> metadata for sheet music libraries but gave up that idea because of the 
> complexity of PDF files.
> 
> -----Original Message----- 
> From: Tilman Hausherr
> Sent: Saturday, January 10, 2015 11:24 PM
> To: users@pdfbox.apache.org
> Subject: Re: Content of pdf moved around
> 
> Hi,
> 
> The PDF didn't go through (never does), but you can try to use
> PDFTextStripper.setSortByPosition().
> 
> Tilman|*
> *|
> Am 10.01.2015 um 14:04 schrieb Renaud Billen:
> > Hello,
> >
> > I have a little issue with the extraction of the text of some pdfs, where 
> > some words are switching order with others..
> >
> > With the pdf attached to this mail, if I use "save as text » from adobe 
> > reader, I get :
> >
> > Référence: LIX-673LIX-6737
> >
> >
> > Nom: The test company
> >
> >
> > Type:
> > Ouverture: 24/04/2007
> >
> > Titulaire: BD
> > Resp.: LIX
> > Co-Resp.: BB
> > Client
> >
> >
> >
> >
> > But with pdfbox I get :
> >
> > Référence: LIX-6737
> > Nom: The test company
> > Titulaire: BD
> > Resp.: LIX
> > Co-Resp.: BB
> > Type:
> > Ouverture: 24/04/2007
> > Client
> >
> >
> > Could you tell me if something can be done to solve this problem?
> >
> > Thanks,
> > Renaud
> >
> >
>

Re: Content of pdf moved around

Posted by Ray Morris <ra...@bigpond.com>.
Please unsubscribe ray.morris.brisbane@bigpond.com

I briefly had the ambition to teach myself how to maintain bookmarks and XML 
metadata for sheet music libraries but gave up that idea because of the 
complexity of PDF files.

-----Original Message----- 
From: Tilman Hausherr
Sent: Saturday, January 10, 2015 11:24 PM
To: users@pdfbox.apache.org
Subject: Re: Content of pdf moved around

Hi,

The PDF didn't go through (never does), but you can try to use
PDFTextStripper.setSortByPosition().

Tilman|*
*|
Am 10.01.2015 um 14:04 schrieb Renaud Billen:
> Hello,
>
> I have a little issue with the extraction of the text of some pdfs, where 
> some words are switching order with others..
>
> With the pdf attached to this mail, if I use "save as text » from adobe 
> reader, I get :
>
> Référence: LIX-673LIX-6737
>
>
> Nom: The test company
>
>
> Type:
> Ouverture: 24/04/2007
>
> Titulaire: BD
> Resp.: LIX
> Co-Resp.: BB
> Client
>
>
>
>
> But with pdfbox I get :
>
> Référence: LIX-6737
> Nom: The test company
> Titulaire: BD
> Resp.: LIX
> Co-Resp.: BB
> Type:
> Ouverture: 24/04/2007
> Client
>
>
> Could you tell me if something can be done to solve this problem?
>
> Thanks,
> Renaud
>
>


Re: Content of pdf moved around

Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,

The PDF didn't go through (never does), but you can try to use 
PDFTextStripper.setSortByPosition().

Tilman|*
*|
Am 10.01.2015 um 14:04 schrieb Renaud Billen:
> Hello,
>
> I have a little issue with the extraction of the text of some pdfs, where some words are switching order with others..
>
> With the pdf attached to this mail, if I use "save as text » from adobe reader, I get :
>
> Référence: LIX-673LIX-6737
>
>
> Nom: The test company
>
>
> Type:
> Ouverture: 24/04/2007
>
> Titulaire: BD
> Resp.: LIX
> Co-Resp.: BB
> Client
>
>
>
>
> But with pdfbox I get :
>
> Référence: LIX-6737
> Nom: The test company
> Titulaire: BD
> Resp.: LIX
> Co-Resp.: BB
> Type:
> Ouverture: 24/04/2007
> Client
>
>
> Could you tell me if something can be done to solve this problem?
>
> Thanks,
> Renaud
>
>