You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Renaud Billen <re...@nic.be> on 2015/01/10 14:04:02 UTC
Content of pdf moved around
Hello,
I have a little issue with the extraction of the text of some pdfs, where some words are switching order with others..
With the pdf attached to this mail, if I use "save as text » from adobe reader, I get :
Référence: LIX-673LIX-6737
Nom: The test company
Type:
Ouverture: 24/04/2007
Titulaire: BD
Resp.: LIX
Co-Resp.: BB
Client
But with pdfbox I get :
Référence: LIX-6737
Nom: The test company
Titulaire: BD
Resp.: LIX
Co-Resp.: BB
Type:
Ouverture: 24/04/2007
Client
Could you tell me if something can be done to solve this problem?
Thanks,
Renaud
Re: Content of pdf moved around - SOLVED
Posted by Renaud Billen <re...@nic.be>.
Wow thought the sort option would make an alphabetical sort, so I haven’t tried it, but it did the trick… :)
Anyway thanks a lot for you help,
Renaud
> Le 10 janv. 2015 à 14:20, Andreas Lehmkuehler <an...@lehmi.de> a écrit :
>
> Hi,
>
> Am 10.01.2015 um 14:04 schrieb Renaud Billen:
>> Hello,
>>
>> I have a little issue with the extraction of the text of some pdfs, where some words are switching order with others..
>>
>> With the pdf attached to this mail, if I use "save as text » from adobe reader, I get :
>>
>> Référence: LIX-673LIX-6737
>>
>>
>> Nom: The test company
>>
>>
>> Type:
>> Ouverture: 24/04/2007
>>
>> Titulaire: BD
>> Resp.: LIX
>> Co-Resp.: BB
>> Client
>>
>>
>>
>>
>> But with pdfbox I get :
>>
>> Référence: LIX-6737
>> Nom: The test company
>> Titulaire: BD
>> Resp.: LIX
>> Co-Resp.: BB
>> Type:
>> Ouverture: 24/04/2007
>> Client
>>
>>
>> Could you tell me if something can be done to solve this problem?
> Is the sort option activated?
>
>> Thanks,
>> Renaud
>
> BR
> Andreas Lehmkühler
Re: Content of pdf moved around
Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi,
Am 10.01.2015 um 14:04 schrieb Renaud Billen:
> Hello,
>
> I have a little issue with the extraction of the text of some pdfs, where some words are switching order with others..
>
> With the pdf attached to this mail, if I use "save as text » from adobe reader, I get :
>
> Référence: LIX-673LIX-6737
>
>
> Nom: The test company
>
>
> Type:
> Ouverture: 24/04/2007
>
> Titulaire: BD
> Resp.: LIX
> Co-Resp.: BB
> Client
>
>
>
>
> But with pdfbox I get :
>
> Référence: LIX-6737
> Nom: The test company
> Titulaire: BD
> Resp.: LIX
> Co-Resp.: BB
> Type:
> Ouverture: 24/04/2007
> Client
>
>
> Could you tell me if something can be done to solve this problem?
Is the sort option activated?
> Thanks,
> Renaud
BR
Andreas Lehmkühler
Re: Content of pdf moved around
Posted by Andreas Lehmkühler <an...@lehmi.de>.
Hi Ray,
to unsubscribe you have to write an email to users-subscribe@pdfbox.apache.org.
See [1] for fruther details.
BR
Andreas Lehmkühler
[1] http://pdfbox.apache.org/mailinglists.html
> Ray Morris <ra...@bigpond.com> hat am 10. Januar 2015 um 22:48
> geschrieben:
>
>
> Please unsubscribe ray.morris.brisbane@bigpond.com
>
> I briefly had the ambition to teach myself how to maintain bookmarks and XML
> metadata for sheet music libraries but gave up that idea because of the
> complexity of PDF files.
>
> -----Original Message-----
> From: Tilman Hausherr
> Sent: Saturday, January 10, 2015 11:24 PM
> To: users@pdfbox.apache.org
> Subject: Re: Content of pdf moved around
>
> Hi,
>
> The PDF didn't go through (never does), but you can try to use
> PDFTextStripper.setSortByPosition().
>
> Tilman|*
> *|
> Am 10.01.2015 um 14:04 schrieb Renaud Billen:
> > Hello,
> >
> > I have a little issue with the extraction of the text of some pdfs, where
> > some words are switching order with others..
> >
> > With the pdf attached to this mail, if I use "save as text » from adobe
> > reader, I get :
> >
> > Référence: LIX-673LIX-6737
> >
> >
> > Nom: The test company
> >
> >
> > Type:
> > Ouverture: 24/04/2007
> >
> > Titulaire: BD
> > Resp.: LIX
> > Co-Resp.: BB
> > Client
> >
> >
> >
> >
> > But with pdfbox I get :
> >
> > Référence: LIX-6737
> > Nom: The test company
> > Titulaire: BD
> > Resp.: LIX
> > Co-Resp.: BB
> > Type:
> > Ouverture: 24/04/2007
> > Client
> >
> >
> > Could you tell me if something can be done to solve this problem?
> >
> > Thanks,
> > Renaud
> >
> >
>
Re: Content of pdf moved around
Posted by Ray Morris <ra...@bigpond.com>.
Please unsubscribe ray.morris.brisbane@bigpond.com
I briefly had the ambition to teach myself how to maintain bookmarks and XML
metadata for sheet music libraries but gave up that idea because of the
complexity of PDF files.
-----Original Message-----
From: Tilman Hausherr
Sent: Saturday, January 10, 2015 11:24 PM
To: users@pdfbox.apache.org
Subject: Re: Content of pdf moved around
Hi,
The PDF didn't go through (never does), but you can try to use
PDFTextStripper.setSortByPosition().
Tilman|*
*|
Am 10.01.2015 um 14:04 schrieb Renaud Billen:
> Hello,
>
> I have a little issue with the extraction of the text of some pdfs, where
> some words are switching order with others..
>
> With the pdf attached to this mail, if I use "save as text » from adobe
> reader, I get :
>
> Référence: LIX-673LIX-6737
>
>
> Nom: The test company
>
>
> Type:
> Ouverture: 24/04/2007
>
> Titulaire: BD
> Resp.: LIX
> Co-Resp.: BB
> Client
>
>
>
>
> But with pdfbox I get :
>
> Référence: LIX-6737
> Nom: The test company
> Titulaire: BD
> Resp.: LIX
> Co-Resp.: BB
> Type:
> Ouverture: 24/04/2007
> Client
>
>
> Could you tell me if something can be done to solve this problem?
>
> Thanks,
> Renaud
>
>
Re: Content of pdf moved around
Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,
The PDF didn't go through (never does), but you can try to use
PDFTextStripper.setSortByPosition().
Tilman|*
*|
Am 10.01.2015 um 14:04 schrieb Renaud Billen:
> Hello,
>
> I have a little issue with the extraction of the text of some pdfs, where some words are switching order with others..
>
> With the pdf attached to this mail, if I use "save as text » from adobe reader, I get :
>
> Référence: LIX-673LIX-6737
>
>
> Nom: The test company
>
>
> Type:
> Ouverture: 24/04/2007
>
> Titulaire: BD
> Resp.: LIX
> Co-Resp.: BB
> Client
>
>
>
>
> But with pdfbox I get :
>
> Référence: LIX-6737
> Nom: The test company
> Titulaire: BD
> Resp.: LIX
> Co-Resp.: BB
> Type:
> Ouverture: 24/04/2007
> Client
>
>
> Could you tell me if something can be done to solve this problem?
>
> Thanks,
> Renaud
>
>