You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by "Romano, Adrian" <Ad...@ngc.com> on 2008/12/18 21:52:36 UTC

Encoding issues

I have a few Russian PDFs that are exibiting strange behavior when being extracted with PDFTextStripper. I am attaching my pdf, but I'm not sure if that is the correct thing to do. When I extract the PDF on windows using UTF-8  encoding, the output is garbage. When I extract the PDF on windows not specifying an encoding, the output is correct when viewed with Ultra Edit. When I extract the PDF on linux using any encoding, the output is garbage. 
It appears to me that the encoding isn't being read correctly from the PDF, and when it's outputted as UTF-8, it is being double encoded. I can detect this double encoding, and then run the file with no encoding specified, then convert it to UTF-8 using iconv, and it is OK. But, this method does not work on linux, as I cannot get the file to extract using any encoding on linux. 
Has anyone come across anything like this before, and if so, what can be done to solve it? I am using the latest 0.8 build from the svn repository. I just recently started using pdfbox, so I am not very familiar with the code. Any information will be helpful. Thanks.
 
-Adrian Romano
 
 

RE: Garbage Output

Posted by "Duseja, Sushil" <su...@fiserv.com>.
Thanks!

 

We look forward to hear from you on this very soon.

 

Thanks for your help.

 

From: Daniel Manzke [mailto:daniel.manzke@googlemail.com] 
Sent: Monday, December 29, 2008 5:36 PM
To: Duseja, Sushil; pdfbox-users@incubator.apache.org
Cc: Rally, Menka
Subject: Re: Garbage Output

 

Sorry this would be a job for one of the pdfbox developers. Until now I'm just doing some support for the list and didn't have too much know-how about it.

 

So I can just have a look in the evening and maybe I will find a solution. ;)

 

 

Daniel

 

2008/12/29 Duseja, Sushil <su...@fiserv.com>

If possible, can you please let us know your contact number to discuss this issue?

 

Thanks!

 

From: Daniel Manzke [mailto:daniel.manzke@googlemail.com] 
Sent: Monday, December 29, 2008 5:12 PM
To: Duseja, Sushil; pdfbox-users@incubator.apache.org
Cc: Rally, Menka
Subject: Re: Garbage Output

 

Hi,

 

I've just added this line:

 

//after stripper.extractRegions();

stripper.getText(document));

 

After doing this I got some text for the regions. But it seems that this text is related to page 1. Did you have found an example how to use the Stripper? Maybe another guy could help you, due the fact that I don't have any knowledge about the Stripper.

 

If I have some time in the evening I will give it another test. 

 

 

Bye,

Daniel

2008/12/29 Duseja, Sushil <su...@fiserv.com>

Hello Daniel,

 

I tried using the compiled version sent across by you with no luck.

 

I tried running a java program (for text extraction) with PDFBox 0.7.3 and 0.8 versions in the classpath separately. With 0.8, I am not being able to fetch anything. However with 0.7.3, I could extract all values apart from "Year of Form"  whose value is garbage - À¾´» , which is why you recommended using 0.8.

 

Note - Java program and my PDF are attached for your kind reference. The names of the java files are self explanatory and indicative of which version they are using. The contents of these java files are exactly the same.

 

Please advise.

 

Thanks!

 

From: Daniel Manzke [mailto:daniel.manzke@googlemail.com] 
Sent: Monday, December 29, 2008 2:45 PM


To: Duseja, Sushil
Cc: pdfbox-users@incubator.apache.org; Rally, Menka
Subject: Re: Garbage Output

 

Just check out the latest source code and run Maven.

 

I will send you a compiled version.

 

 

Bye

2008/12/29 Duseja, Sushil <su...@fiserv.com>

Thanks Daniel.

 

Do you mean that - I need to fetch the latest source code from the trunk in the Subversion repository? If no, how can I get the source code for 0.8?

 

I would really appreciate if you can build me a compiled version. I hope I am not bothering you.

 

Thanking you in anticipation.

 

From: Daniel Manzke [mailto:daniel.manzke@googlemail.com] 
Sent: Monday, December 29, 2008 1:41 PM


To: Duseja, Sushil
Cc: pdfbox-users@incubator.apache.org; Rally, Menka
Subject: Re: Garbage Output

 

PDFBox is still under incubation and there is not 0.8 distribution. What you could do, is downloading the source code and build it by your own. So you could have a look at the code and debug it, where the garbage is produced. Or ask me and I will build you a compiled version.

 

 

Daniel

2008/12/29 Duseja, Sushil <su...@fiserv.com>

Thanks again for responding.

 

Can you please point me to the URL/location from which 0.8 version can be downloaded? 

 

I referred to - http://sourceforge.net/project/showfiles.php?group_id=78314; however it shows the latest version is 0.7.3.

 

Thanks for your time.

 

From: Daniel Manzke [mailto:daniel.manzke@googlemail.com] 
Sent: Monday, December 29, 2008 1:29 PM
To: Duseja, Sushil
Cc: pdfbox-users@incubator.apache.org; Rally, Menka
Subject: Re: Garbage Output

 

Try to check out the latest Development Build. Due the fact thaht 0.7.3 is outdated. (year: 2006) In 0.8 there are a lot of issues fixed.

 

 

Bye,

daniel

2008/12/29 Duseja, Sushil <su...@fiserv.com>

Hello Daniel,

Thanks for the response.

I am using version 0.7.3.

Thanks!


-----Original Message-----
From: Daniel Manzke [mailto:daniel.manzke@googlemail.com]
Sent: Friday, December 26, 2008 9:11 PM
To: pdfbox-users@incubator.apache.org
Subject: Re: Garbage Output

Hi,
standard question. ;) Which version are you using?


Daniel

2008/12/26 Duseja, Sushil <su...@fiserv.com>

>  Hello,
>
>
>
> While extracting text from a pdf file (attached for your kind reference)
> using PDFBox, I get garbage output (*À¾´»*) for a special text value"*2007
> *" (please see page 2); I can fetch other values correctly though.
>
> Is this an *encoding issue*; if yes, can anyone please let me know how to
> fix it? If possible, please point me to some working examples.
>
>
>
> Thanks in advance.
>



--
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke


RE: Garbage Output

Posted by "Duseja, Sushil" <su...@fiserv.com>.
Hello Daniel,

 

Any luck on this?

 

Thanks!

 

From: Daniel Manzke [mailto:daniel.manzke@googlemail.com] 
Sent: Monday, December 29, 2008 5:36 PM
To: Duseja, Sushil; pdfbox-users@incubator.apache.org
Cc: Rally, Menka
Subject: Re: Garbage Output

 

Sorry this would be a job for one of the pdfbox developers. Until now I'm just doing some support for the list and didn't have too much know-how about it.

 

So I can just have a look in the evening and maybe I will find a solution. ;)

 

 

Daniel

 

2008/12/29 Duseja, Sushil <su...@fiserv.com>

If possible, can you please let us know your contact number to discuss this issue?

 

Thanks!

 

From: Daniel Manzke [mailto:daniel.manzke@googlemail.com] 
Sent: Monday, December 29, 2008 5:12 PM
To: Duseja, Sushil; pdfbox-users@incubator.apache.org
Cc: Rally, Menka
Subject: Re: Garbage Output

 

Hi,

 

I've just added this line:

 

//after stripper.extractRegions();

stripper.getText(document));

 

After doing this I got some text for the regions. But it seems that this text is related to page 1. Did you have found an example how to use the Stripper? Maybe another guy could help you, due the fact that I don't have any knowledge about the Stripper.

 

If I have some time in the evening I will give it another test. 

 

 

Bye,

Daniel

2008/12/29 Duseja, Sushil <su...@fiserv.com>

Hello Daniel,

 

I tried using the compiled version sent across by you with no luck.

 

I tried running a java program (for text extraction) with PDFBox 0.7.3 and 0.8 versions in the classpath separately. With 0.8, I am not being able to fetch anything. However with 0.7.3, I could extract all values apart from "Year of Form"  whose value is garbage - À¾´» , which is why you recommended using 0.8.

 

Note - Java program and my PDF are attached for your kind reference. The names of the java files are self explanatory and indicative of which version they are using. The contents of these java files are exactly the same.

 

Please advise.

 

Thanks!

 

From: Daniel Manzke [mailto:daniel.manzke@googlemail.com] 
Sent: Monday, December 29, 2008 2:45 PM


To: Duseja, Sushil
Cc: pdfbox-users@incubator.apache.org; Rally, Menka
Subject: Re: Garbage Output

 

Just check out the latest source code and run Maven.

 

I will send you a compiled version.

 

 

Bye

2008/12/29 Duseja, Sushil <su...@fiserv.com>

Thanks Daniel.

 

Do you mean that - I need to fetch the latest source code from the trunk in the Subversion repository? If no, how can I get the source code for 0.8?

 

I would really appreciate if you can build me a compiled version. I hope I am not bothering you.

 

Thanking you in anticipation.

 

From: Daniel Manzke [mailto:daniel.manzke@googlemail.com] 
Sent: Monday, December 29, 2008 1:41 PM


To: Duseja, Sushil
Cc: pdfbox-users@incubator.apache.org; Rally, Menka
Subject: Re: Garbage Output

 

PDFBox is still under incubation and there is not 0.8 distribution. What you could do, is downloading the source code and build it by your own. So you could have a look at the code and debug it, where the garbage is produced. Or ask me and I will build you a compiled version.

 

 

Daniel

2008/12/29 Duseja, Sushil <su...@fiserv.com>

Thanks again for responding.

 

Can you please point me to the URL/location from which 0.8 version can be downloaded? 

 

I referred to - http://sourceforge.net/project/showfiles.php?group_id=78314; however it shows the latest version is 0.7.3.

 

Thanks for your time.

 

From: Daniel Manzke [mailto:daniel.manzke@googlemail.com] 
Sent: Monday, December 29, 2008 1:29 PM
To: Duseja, Sushil
Cc: pdfbox-users@incubator.apache.org; Rally, Menka
Subject: Re: Garbage Output

 

Try to check out the latest Development Build. Due the fact thaht 0.7.3 is outdated. (year: 2006) In 0.8 there are a lot of issues fixed.

 

 

Bye,

daniel

2008/12/29 Duseja, Sushil <su...@fiserv.com>

Hello Daniel,

Thanks for the response.

I am using version 0.7.3.

Thanks!


-----Original Message-----
From: Daniel Manzke [mailto:daniel.manzke@googlemail.com]
Sent: Friday, December 26, 2008 9:11 PM
To: pdfbox-users@incubator.apache.org
Subject: Re: Garbage Output

Hi,
standard question. ;) Which version are you using?


Daniel

2008/12/26 Duseja, Sushil <su...@fiserv.com>

>  Hello,
>
>
>
> While extracting text from a pdf file (attached for your kind reference)
> using PDFBox, I get garbage output (*À¾´»*) for a special text value"*2007
> *" (please see page 2); I can fetch other values correctly though.
>
> Is this an *encoding issue*; if yes, can anyone please let me know how to
> fix it? If possible, please point me to some working examples.
>
>
>
> Thanks in advance.
>



--
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke


RE: Garbage Output

Posted by "Duseja, Sushil" <su...@fiserv.com>.
Hello Daniel,

 

Any luck with this?

 

Thanks!

 

From: Duseja, Sushil 
Sent: Monday, January 05, 2009 7:28 PM
To: Daniel Manzke
Cc: pdfbox-users@incubator.apache.org; Rally, Menka
Subject: RE: Garbage Output

 

Hello Daniel,

 

The text ("2007") we need to extract is written in CLRDingbats font. Can you please give us any pointer so that we don't get garbage value while extracting it from the pdf (attached for your reference)?

 

Thanks!

 

From: Daniel Manzke [mailto:daniel.manzke@googlemail.com] 
Sent: Monday, December 29, 2008 5:36 PM
To: Duseja, Sushil; pdfbox-users@incubator.apache.org
Cc: Rally, Menka
Subject: Re: Garbage Output

 

Sorry this would be a job for one of the pdfbox developers. Until now I'm just doing some support for the list and didn't have too much know-how about it.

 

So I can just have a look in the evening and maybe I will find a solution. ;)

 

 

Daniel

 

2008/12/29 Duseja, Sushil <su...@fiserv.com>

If possible, can you please let us know your contact number to discuss this issue?

 

Thanks!

 

From: Daniel Manzke [mailto:daniel.manzke@googlemail.com] 
Sent: Monday, December 29, 2008 5:12 PM
To: Duseja, Sushil; pdfbox-users@incubator.apache.org
Cc: Rally, Menka
Subject: Re: Garbage Output

 

Hi,

 

I've just added this line:

 

//after stripper.extractRegions();

stripper.getText(document));

 

After doing this I got some text for the regions. But it seems that this text is related to page 1. Did you have found an example how to use the Stripper? Maybe another guy could help you, due the fact that I don't have any knowledge about the Stripper.

 

If I have some time in the evening I will give it another test. 

 

 

Bye,

Daniel

2008/12/29 Duseja, Sushil <su...@fiserv.com>

Hello Daniel,

 

I tried using the compiled version sent across by you with no luck.

 

I tried running a java program (for text extraction) with PDFBox 0.7.3 and 0.8 versions in the classpath separately. With 0.8, I am not being able to fetch anything. However with 0.7.3, I could extract all values apart from "Year of Form"  whose value is garbage - À¾´» , which is why you recommended using 0.8.

 

Note - Java program and my PDF are attached for your kind reference. The names of the java files are self explanatory and indicative of which version they are using. The contents of these java files are exactly the same.

 

Please advise.

 

Thanks!

 

From: Daniel Manzke [mailto:daniel.manzke@googlemail.com] 
Sent: Monday, December 29, 2008 2:45 PM


To: Duseja, Sushil
Cc: pdfbox-users@incubator.apache.org; Rally, Menka
Subject: Re: Garbage Output

 

Just check out the latest source code and run Maven.

 

I will send you a compiled version.

 

 

Bye

2008/12/29 Duseja, Sushil <su...@fiserv.com>

Thanks Daniel.

 

Do you mean that - I need to fetch the latest source code from the trunk in the Subversion repository? If no, how can I get the source code for 0.8?

 

I would really appreciate if you can build me a compiled version. I hope I am not bothering you.

 

Thanking you in anticipation.

 

From: Daniel Manzke [mailto:daniel.manzke@googlemail.com] 
Sent: Monday, December 29, 2008 1:41 PM


To: Duseja, Sushil
Cc: pdfbox-users@incubator.apache.org; Rally, Menka
Subject: Re: Garbage Output

 

PDFBox is still under incubation and there is not 0.8 distribution. What you could do, is downloading the source code and build it by your own. So you could have a look at the code and debug it, where the garbage is produced. Or ask me and I will build you a compiled version.

 

 

Daniel

2008/12/29 Duseja, Sushil <su...@fiserv.com>

Thanks again for responding.

 

Can you please point me to the URL/location from which 0.8 version can be downloaded? 

 

I referred to - http://sourceforge.net/project/showfiles.php?group_id=78314; however it shows the latest version is 0.7.3.

 

Thanks for your time.

 

From: Daniel Manzke [mailto:daniel.manzke@googlemail.com] 
Sent: Monday, December 29, 2008 1:29 PM
To: Duseja, Sushil
Cc: pdfbox-users@incubator.apache.org; Rally, Menka
Subject: Re: Garbage Output

 

Try to check out the latest Development Build. Due the fact thaht 0.7.3 is outdated. (year: 2006) In 0.8 there are a lot of issues fixed.

 

 

Bye,

daniel

2008/12/29 Duseja, Sushil <su...@fiserv.com>

Hello Daniel,

Thanks for the response.

I am using version 0.7.3.

Thanks!


-----Original Message-----
From: Daniel Manzke [mailto:daniel.manzke@googlemail.com]
Sent: Friday, December 26, 2008 9:11 PM
To: pdfbox-users@incubator.apache.org
Subject: Re: Garbage Output

Hi,
standard question. ;) Which version are you using?


Daniel

2008/12/26 Duseja, Sushil <su...@fiserv.com>

>  Hello,
>
>
>
> While extracting text from a pdf file (attached for your kind reference)
> using PDFBox, I get garbage output (*À¾´»*) for a special text value"*2007
> *" (please see page 2); I can fetch other values correctly though.
>
> Is this an *encoding issue*; if yes, can anyone please let me know how to
> fix it? If possible, please point me to some working examples.
>
>
>
> Thanks in advance.
>



--
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke


Re: Garbage Output

Posted by Andreas Lehmkühler <an...@lehmi.de>.
> The text (“*2007*”) we need to extract is written in *CLRDingbats* font.
> Can you please give us any pointer so that we don’t get garbage value
> while extracting it from the pdf (attached for your reference)?
> ...
As it is not possible to attach files to any contribution to this
mailinglist, I suggest to submit an issue via JIRA [1] with a brief
description of your problem. Furthermore it is easier to track problems
using JIRA. Please attach a least one sample document to demonstrate the
problem.

TIA,
Andreas

[1] https://issues.apache.org/jira/browse/PDFBOX


RE: Garbage Output

Posted by "Duseja, Sushil" <su...@fiserv.com>.
Hello Daniel,

 

The text ("2007") we need to extract is written in CLRDingbats font. Can you please give us any pointer so that we don't get garbage value while extracting it from the pdf (attached for your reference)?

 

Thanks!

 

From: Daniel Manzke [mailto:daniel.manzke@googlemail.com] 
Sent: Monday, December 29, 2008 5:36 PM
To: Duseja, Sushil; pdfbox-users@incubator.apache.org
Cc: Rally, Menka
Subject: Re: Garbage Output

 

Sorry this would be a job for one of the pdfbox developers. Until now I'm just doing some support for the list and didn't have too much know-how about it.

 

So I can just have a look in the evening and maybe I will find a solution. ;)

 

 

Daniel

 

2008/12/29 Duseja, Sushil <su...@fiserv.com>

If possible, can you please let us know your contact number to discuss this issue?

 

Thanks!

 

From: Daniel Manzke [mailto:daniel.manzke@googlemail.com] 
Sent: Monday, December 29, 2008 5:12 PM
To: Duseja, Sushil; pdfbox-users@incubator.apache.org
Cc: Rally, Menka
Subject: Re: Garbage Output

 

Hi,

 

I've just added this line:

 

//after stripper.extractRegions();

stripper.getText(document));

 

After doing this I got some text for the regions. But it seems that this text is related to page 1. Did you have found an example how to use the Stripper? Maybe another guy could help you, due the fact that I don't have any knowledge about the Stripper.

 

If I have some time in the evening I will give it another test. 

 

 

Bye,

Daniel

2008/12/29 Duseja, Sushil <su...@fiserv.com>

Hello Daniel,

 

I tried using the compiled version sent across by you with no luck.

 

I tried running a java program (for text extraction) with PDFBox 0.7.3 and 0.8 versions in the classpath separately. With 0.8, I am not being able to fetch anything. However with 0.7.3, I could extract all values apart from "Year of Form"  whose value is garbage - À¾´» , which is why you recommended using 0.8.

 

Note - Java program and my PDF are attached for your kind reference. The names of the java files are self explanatory and indicative of which version they are using. The contents of these java files are exactly the same.

 

Please advise.

 

Thanks!

 

From: Daniel Manzke [mailto:daniel.manzke@googlemail.com] 
Sent: Monday, December 29, 2008 2:45 PM


To: Duseja, Sushil
Cc: pdfbox-users@incubator.apache.org; Rally, Menka
Subject: Re: Garbage Output

 

Just check out the latest source code and run Maven.

 

I will send you a compiled version.

 

 

Bye

2008/12/29 Duseja, Sushil <su...@fiserv.com>

Thanks Daniel.

 

Do you mean that - I need to fetch the latest source code from the trunk in the Subversion repository? If no, how can I get the source code for 0.8?

 

I would really appreciate if you can build me a compiled version. I hope I am not bothering you.

 

Thanking you in anticipation.

 

From: Daniel Manzke [mailto:daniel.manzke@googlemail.com] 
Sent: Monday, December 29, 2008 1:41 PM


To: Duseja, Sushil
Cc: pdfbox-users@incubator.apache.org; Rally, Menka
Subject: Re: Garbage Output

 

PDFBox is still under incubation and there is not 0.8 distribution. What you could do, is downloading the source code and build it by your own. So you could have a look at the code and debug it, where the garbage is produced. Or ask me and I will build you a compiled version.

 

 

Daniel

2008/12/29 Duseja, Sushil <su...@fiserv.com>

Thanks again for responding.

 

Can you please point me to the URL/location from which 0.8 version can be downloaded? 

 

I referred to - http://sourceforge.net/project/showfiles.php?group_id=78314; however it shows the latest version is 0.7.3.

 

Thanks for your time.

 

From: Daniel Manzke [mailto:daniel.manzke@googlemail.com] 
Sent: Monday, December 29, 2008 1:29 PM
To: Duseja, Sushil
Cc: pdfbox-users@incubator.apache.org; Rally, Menka
Subject: Re: Garbage Output

 

Try to check out the latest Development Build. Due the fact thaht 0.7.3 is outdated. (year: 2006) In 0.8 there are a lot of issues fixed.

 

 

Bye,

daniel

2008/12/29 Duseja, Sushil <su...@fiserv.com>

Hello Daniel,

Thanks for the response.

I am using version 0.7.3.

Thanks!


-----Original Message-----
From: Daniel Manzke [mailto:daniel.manzke@googlemail.com]
Sent: Friday, December 26, 2008 9:11 PM
To: pdfbox-users@incubator.apache.org
Subject: Re: Garbage Output

Hi,
standard question. ;) Which version are you using?


Daniel

2008/12/26 Duseja, Sushil <su...@fiserv.com>

>  Hello,
>
>
>
> While extracting text from a pdf file (attached for your kind reference)
> using PDFBox, I get garbage output (*À¾´»*) for a special text value"*2007
> *" (please see page 2); I can fetch other values correctly though.
>
> Is this an *encoding issue*; if yes, can anyone please let me know how to
> fix it? If possible, please point me to some working examples.
>
>
>
> Thanks in advance.
>



--
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke


Re: Garbage Output

Posted by Daniel Manzke <da...@googlemail.com>.
Sorry this would be a job for one of the pdfbox developers. Until now I'm
just doing some support for the list and didn't have too much know-how about
it.
So I can just have a look in the evening and maybe I will find a solution.
;)


Daniel

2008/12/29 Duseja, Sushil <su...@fiserv.com>

>  If possible, can you please let us know your contact number to discuss
> this issue?
>
>
>
> Thanks!
>
>
>
> *From:* Daniel Manzke [mailto:daniel.manzke@googlemail.com]
> *Sent:* Monday, December 29, 2008 5:12 PM
> *To:* Duseja, Sushil; pdfbox-users@incubator.apache.org
> *Cc:* Rally, Menka
> *Subject:* Re: Garbage Output
>
>
>
> Hi,
>
>
>
> I've just added this line:
>
>
>
> //after stripper.extractRegions();
>
> stripper.getText(document));
>
>
>
> After doing this I got some text for the regions. But it seems that this
> text is related to page 1. Did you have found an example how to use the
> Stripper? Maybe another guy could help you, due the fact that I don't have
> any knowledge about the Stripper.
>
>
>
> If I have some time in the evening I will give it another test.
>
>
>
>
>
> Bye,
>
> Daniel
>
> 2008/12/29 Duseja, Sushil <su...@fiserv.com>
>
> Hello Daniel,
>
>
>
> I tried using the compiled version sent across by you with no luck.
>
>
>
> I tried running a java program (for text extraction) with PDFBox 0.7.3 and
> 0.8 versions in the classpath separately. With 0.8, I am not being able to
> fetch anything. However with 0.7.3, I could extract all values apart from
> "Year of Form"  whose value is garbage - À¾´» , which is why you recommended
> using 0.8.
>
>
>
> Note - Java program and my PDF are attached for your kind reference. The
> names of the java files are self explanatory and indicative of which version
> they are using. The contents of these java files are exactly the same.
>
>
>
> Please advise.
>
>
>
> Thanks!
>
>
>
> *From:* Daniel Manzke [mailto:daniel.manzke@googlemail.com]
> *Sent:* Monday, December 29, 2008 2:45 PM
>
>
> *To:* Duseja, Sushil
> *Cc:* pdfbox-users@incubator.apache.org; Rally, Menka
> *Subject:* Re: Garbage Output
>
>
>
> Just check out the latest source code and run Maven.
>
>
>
> I will send you a compiled version.
>
>
>
>
>
> Bye
>
> 2008/12/29 Duseja, Sushil <su...@fiserv.com>
>
> Thanks Daniel.
>
>
>
> Do you mean that - I need to fetch the latest source code from the trunk in
> the Subversion repository? If no, how can I get the source code for 0.8?
>
>
>
> I would really appreciate if you can build me a compiled version. I hope I
> am not bothering you.
>
>
>
> Thanking you in anticipation.
>
>
>
> *From:* Daniel Manzke [mailto:daniel.manzke@googlemail.com]
> *Sent:* Monday, December 29, 2008 1:41 PM
>
>
> *To:* Duseja, Sushil
> *Cc:* pdfbox-users@incubator.apache.org; Rally, Menka
> *Subject:* Re: Garbage Output
>
>
>
> PDFBox is still under incubation and there is not 0.8 distribution. What
> you could do, is downloading the source code and build it by your own. So
> you could have a look at the code and debug it, where the garbage is
> produced. Or ask me and I will build you a compiled version.
>
>
>
>
>
> Daniel
>
> 2008/12/29 Duseja, Sushil <su...@fiserv.com>
>
> Thanks again for responding.
>
>
>
> Can you please point me to the URL/location from which 0.8 version can be
> downloaded?
>
>
>
> I referred to -
> http://sourceforge.net/project/showfiles.php?group_id=78314; however it
> shows the latest version is 0.7.3.
>
>
>
> Thanks for your time.
>
>
>
> *From:* Daniel Manzke [mailto:daniel.manzke@googlemail.com]
> *Sent:* Monday, December 29, 2008 1:29 PM
> *To:* Duseja, Sushil
> *Cc:* pdfbox-users@incubator.apache.org; Rally, Menka
> *Subject:* Re: Garbage Output
>
>
>
> Try to check out the latest Development Build. Due the fact thaht 0.7.3 is
> outdated. (year: 2006) In 0.8 there are a lot of issues fixed.
>
>
>
>
>
> Bye,
>
> daniel
>
> 2008/12/29 Duseja, Sushil <su...@fiserv.com>
>
> Hello Daniel,
>
> Thanks for the response.
>
> I am using version 0.7.3.
>
> Thanks!
>
>
> -----Original Message-----
> From: Daniel Manzke [mailto:daniel.manzke@googlemail.com]
> Sent: Friday, December 26, 2008 9:11 PM
> To: pdfbox-users@incubator.apache.org
> Subject: Re: Garbage Output
>
> Hi,
> standard question. ;) Which version are you using?
>
>
> Daniel
>
> 2008/12/26 Duseja, Sushil <su...@fiserv.com>
>
> >  Hello,
> >
> >
> >
> > While extracting text from a pdf file (attached for your kind reference)
> > using PDFBox, I get garbage output (*À¾´»*) for a special text
> value"*2007
> > *" (please see page 2); I can fetch other values correctly though.
> >
> > Is this an *encoding issue*; if yes, can anyone please let me know how to
> > fix it? If possible, please point me to some working examples.
> >
> >
> >
> > Thanks in advance.
> >
>
>
>
> --
> Mit freundlichen Grüßen
>
> Daniel Manzke
>
>
>
>
> --
> Mit freundlichen Grüßen
>
> Daniel Manzke
>
>
>
>
> --
> Mit freundlichen Grüßen
>
> Daniel Manzke
>
>
>
>
> --
> Mit freundlichen Grüßen
>
> Daniel Manzke
>
>
>
>
> --
> Mit freundlichen Grüßen
>
> Daniel Manzke
>



-- 
Mit freundlichen Grüßen

Daniel Manzke

Re: Garbage Output

Posted by Daniel Manzke <da...@googlemail.com>.
Hi,
I've just added this line:

//after stripper.extractRegions();
stripper.getText(document));

After doing this I got some text for the regions. But it seems that this
text is related to page 1. Did you have found an example how to use the
Stripper? Maybe another guy could help you, due the fact that I don't have
any knowledge about the Stripper.

If I have some time in the evening I will give it another test.


Bye,
Daniel

2008/12/29 Duseja, Sushil <su...@fiserv.com>

>  Hello Daniel,
>
>
>
> I tried using the compiled version sent across by you with no luck.
>
>
>
> I tried running a java program (for text extraction) with PDFBox 0.7.3 and
> 0.8 versions in the classpath separately. With 0.8, I am not being able to
> fetch anything. However with 0.7.3, I could extract all values apart from
> "Year of Form"  whose value is garbage - À¾´» , which is why you recommended
> using 0.8.
>
>
>
> Note - Java program and my PDF are attached for your kind reference. The
> names of the java files are self explanatory and indicative of which version
> they are using. The contents of these java files are exactly the same.
>
>
>
> Please advise.
>
>
>
> Thanks!
>
>
>
> *From:* Daniel Manzke [mailto:daniel.manzke@googlemail.com]
> *Sent:* Monday, December 29, 2008 2:45 PM
>
> *To:* Duseja, Sushil
> *Cc:* pdfbox-users@incubator.apache.org; Rally, Menka
> *Subject:* Re: Garbage Output
>
>
>
> Just check out the latest source code and run Maven.
>
>
>
> I will send you a compiled version.
>
>
>
>
>
> Bye
>
> 2008/12/29 Duseja, Sushil <su...@fiserv.com>
>
> Thanks Daniel.
>
>
>
> Do you mean that - I need to fetch the latest source code from the trunk in
> the Subversion repository? If no, how can I get the source code for 0.8?
>
>
>
> I would really appreciate if you can build me a compiled version. I hope I
> am not bothering you.
>
>
>
> Thanking you in anticipation.
>
>
>
> *From:* Daniel Manzke [mailto:daniel.manzke@googlemail.com]
> *Sent:* Monday, December 29, 2008 1:41 PM
>
>
> *To:* Duseja, Sushil
> *Cc:* pdfbox-users@incubator.apache.org; Rally, Menka
> *Subject:* Re: Garbage Output
>
>
>
> PDFBox is still under incubation and there is not 0.8 distribution. What
> you could do, is downloading the source code and build it by your own. So
> you could have a look at the code and debug it, where the garbage is
> produced. Or ask me and I will build you a compiled version.
>
>
>
>
>
> Daniel
>
> 2008/12/29 Duseja, Sushil <su...@fiserv.com>
>
> Thanks again for responding.
>
>
>
> Can you please point me to the URL/location from which 0.8 version can be
> downloaded?
>
>
>
> I referred to -
> http://sourceforge.net/project/showfiles.php?group_id=78314; however it
> shows the latest version is 0.7.3.
>
>
>
> Thanks for your time.
>
>
>
> *From:* Daniel Manzke [mailto:daniel.manzke@googlemail.com]
> *Sent:* Monday, December 29, 2008 1:29 PM
> *To:* Duseja, Sushil
> *Cc:* pdfbox-users@incubator.apache.org; Rally, Menka
> *Subject:* Re: Garbage Output
>
>
>
> Try to check out the latest Development Build. Due the fact thaht 0.7.3 is
> outdated. (year: 2006) In 0.8 there are a lot of issues fixed.
>
>
>
>
>
> Bye,
>
> daniel
>
> 2008/12/29 Duseja, Sushil <su...@fiserv.com>
>
> Hello Daniel,
>
> Thanks for the response.
>
> I am using version 0.7.3.
>
> Thanks!
>
>
> -----Original Message-----
> From: Daniel Manzke [mailto:daniel.manzke@googlemail.com]
> Sent: Friday, December 26, 2008 9:11 PM
> To: pdfbox-users@incubator.apache.org
> Subject: Re: Garbage Output
>
> Hi,
> standard question. ;) Which version are you using?
>
>
> Daniel
>
> 2008/12/26 Duseja, Sushil <su...@fiserv.com>
>
> >  Hello,
> >
> >
> >
> > While extracting text from a pdf file (attached for your kind reference)
> > using PDFBox, I get garbage output (*À¾´»*) for a special text
> value"*2007
> > *" (please see page 2); I can fetch other values correctly though.
> >
> > Is this an *encoding issue*; if yes, can anyone please let me know how to
> > fix it? If possible, please point me to some working examples.
> >
> >
> >
> > Thanks in advance.
> >
>
>
>
> --
> Mit freundlichen Grüßen
>
> Daniel Manzke
>
>
>
>
> --
> Mit freundlichen Grüßen
>
> Daniel Manzke
>
>
>
>
> --
> Mit freundlichen Grüßen
>
> Daniel Manzke
>
>
>
>
> --
> Mit freundlichen Grüßen
>
> Daniel Manzke
>



-- 
Mit freundlichen Grüßen

Daniel Manzke

Re: Garbage Output

Posted by Daniel Manzke <da...@googlemail.com>.
Just check out the latest source code and run Maven.
I will send you a compiled version.


Bye

2008/12/29 Duseja, Sushil <su...@fiserv.com>

>  Thanks Daniel.
>
>
>
> Do you mean that - I need to fetch the latest source code from the trunk in
> the Subversion repository? If no, how can I get the source code for 0.8?
>
>
>
> I would really appreciate if you can build me a compiled version. I hope I
> am not bothering you.
>
>
>
> Thanking you in anticipation.
>
>
>
> *From:* Daniel Manzke [mailto:daniel.manzke@googlemail.com]
> *Sent:* Monday, December 29, 2008 1:41 PM
>
> *To:* Duseja, Sushil
> *Cc:* pdfbox-users@incubator.apache.org; Rally, Menka
> *Subject:* Re: Garbage Output
>
>
>
> PDFBox is still under incubation and there is not 0.8 distribution. What
> you could do, is downloading the source code and build it by your own. So
> you could have a look at the code and debug it, where the garbage is
> produced. Or ask me and I will build you a compiled version.
>
>
>
>
>
> Daniel
>
> 2008/12/29 Duseja, Sushil <su...@fiserv.com>
>
> Thanks again for responding.
>
>
>
> Can you please point me to the URL/location from which 0.8 version can be
> downloaded?
>
>
>
> I referred to -
> http://sourceforge.net/project/showfiles.php?group_id=78314; however it
> shows the latest version is 0.7.3.
>
>
>
> Thanks for your time.
>
>
>
> *From:* Daniel Manzke [mailto:daniel.manzke@googlemail.com]
> *Sent:* Monday, December 29, 2008 1:29 PM
> *To:* Duseja, Sushil
> *Cc:* pdfbox-users@incubator.apache.org; Rally, Menka
> *Subject:* Re: Garbage Output
>
>
>
> Try to check out the latest Development Build. Due the fact thaht 0.7.3 is
> outdated. (year: 2006) In 0.8 there are a lot of issues fixed.
>
>
>
>
>
> Bye,
>
> daniel
>
> 2008/12/29 Duseja, Sushil <su...@fiserv.com>
>
> Hello Daniel,
>
> Thanks for the response.
>
> I am using version 0.7.3.
>
> Thanks!
>
>
> -----Original Message-----
> From: Daniel Manzke [mailto:daniel.manzke@googlemail.com]
> Sent: Friday, December 26, 2008 9:11 PM
> To: pdfbox-users@incubator.apache.org
> Subject: Re: Garbage Output
>
> Hi,
> standard question. ;) Which version are you using?
>
>
> Daniel
>
> 2008/12/26 Duseja, Sushil <su...@fiserv.com>
>
> >  Hello,
> >
> >
> >
> > While extracting text from a pdf file (attached for your kind reference)
> > using PDFBox, I get garbage output (*À¾´»*) for a special text
> value"*2007
> > *" (please see page 2); I can fetch other values correctly though.
> >
> > Is this an *encoding issue*; if yes, can anyone please let me know how to
> > fix it? If possible, please point me to some working examples.
> >
> >
> >
> > Thanks in advance.
> >
>
>
>
> --
> Mit freundlichen Grüßen
>
> Daniel Manzke
>
>
>
>
> --
> Mit freundlichen Grüßen
>
> Daniel Manzke
>
>
>
>
> --
> Mit freundlichen Grüßen
>
> Daniel Manzke
>



-- 
Mit freundlichen Grüßen

Daniel Manzke

RE: Garbage Output

Posted by "Duseja, Sushil" <su...@fiserv.com>.
Thanks Daniel.

 

Do you mean that - I need to fetch the latest source code from the trunk in the Subversion repository? If no, how can I get the source code for 0.8?

 

I would really appreciate if you can build me a compiled version. I hope I am not bothering you.

 

Thanking you in anticipation.

 

From: Daniel Manzke [mailto:daniel.manzke@googlemail.com] 
Sent: Monday, December 29, 2008 1:41 PM
To: Duseja, Sushil
Cc: pdfbox-users@incubator.apache.org; Rally, Menka
Subject: Re: Garbage Output

 

PDFBox is still under incubation and there is not 0.8 distribution. What you could do, is downloading the source code and build it by your own. So you could have a look at the code and debug it, where the garbage is produced. Or ask me and I will build you a compiled version.

 

 

Daniel

2008/12/29 Duseja, Sushil <su...@fiserv.com>

Thanks again for responding.

 

Can you please point me to the URL/location from which 0.8 version can be downloaded? 

 

I referred to - http://sourceforge.net/project/showfiles.php?group_id=78314; however it shows the latest version is 0.7.3.

 

Thanks for your time.

 

From: Daniel Manzke [mailto:daniel.manzke@googlemail.com] 
Sent: Monday, December 29, 2008 1:29 PM
To: Duseja, Sushil
Cc: pdfbox-users@incubator.apache.org; Rally, Menka
Subject: Re: Garbage Output

 

Try to check out the latest Development Build. Due the fact thaht 0.7.3 is outdated. (year: 2006) In 0.8 there are a lot of issues fixed.

 

 

Bye,

daniel

2008/12/29 Duseja, Sushil <su...@fiserv.com>

Hello Daniel,

Thanks for the response.

I am using version 0.7.3.

Thanks!


-----Original Message-----
From: Daniel Manzke [mailto:daniel.manzke@googlemail.com]
Sent: Friday, December 26, 2008 9:11 PM
To: pdfbox-users@incubator.apache.org
Subject: Re: Garbage Output

Hi,
standard question. ;) Which version are you using?


Daniel

2008/12/26 Duseja, Sushil <su...@fiserv.com>

>  Hello,
>
>
>
> While extracting text from a pdf file (attached for your kind reference)
> using PDFBox, I get garbage output (*À¾´»*) for a special text value"*2007
> *" (please see page 2); I can fetch other values correctly though.
>
> Is this an *encoding issue*; if yes, can anyone please let me know how to
> fix it? If possible, please point me to some working examples.
>
>
>
> Thanks in advance.
>



--
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke


Re: Garbage Output

Posted by Daniel Manzke <da...@googlemail.com>.
PDFBox is still under incubation and there is not 0.8 distribution. What you
could do, is downloading the source code and build it by your own. So you
could have a look at the code and debug it, where the garbage is produced.
Or ask me and I will build you a compiled version.

Daniel

2008/12/29 Duseja, Sushil <su...@fiserv.com>

>  Thanks again for responding.
>
>
>
> Can you please point me to the URL/location from which 0.8 version can be
> downloaded?
>
>
>
> I referred to -
> http://sourceforge.net/project/showfiles.php?group_id=78314; however it
> shows the latest version is 0.7.3.
>
>
>
> Thanks for your time.
>
>
>
> *From:* Daniel Manzke [mailto:daniel.manzke@googlemail.com]
> *Sent:* Monday, December 29, 2008 1:29 PM
> *To:* Duseja, Sushil
> *Cc:* pdfbox-users@incubator.apache.org; Rally, Menka
> *Subject:* Re: Garbage Output
>
>
>
> Try to check out the latest Development Build. Due the fact thaht 0.7.3 is
> outdated. (year: 2006) In 0.8 there are a lot of issues fixed.
>
>
>
>
>
> Bye,
>
> daniel
>
> 2008/12/29 Duseja, Sushil <su...@fiserv.com>
>
> Hello Daniel,
>
> Thanks for the response.
>
> I am using version 0.7.3.
>
> Thanks!
>
>
> -----Original Message-----
> From: Daniel Manzke [mailto:daniel.manzke@googlemail.com]
> Sent: Friday, December 26, 2008 9:11 PM
> To: pdfbox-users@incubator.apache.org
> Subject: Re: Garbage Output
>
> Hi,
> standard question. ;) Which version are you using?
>
>
> Daniel
>
> 2008/12/26 Duseja, Sushil <su...@fiserv.com>
>
> >  Hello,
> >
> >
> >
> > While extracting text from a pdf file (attached for your kind reference)
> > using PDFBox, I get garbage output (*À¾´»*) for a special text
> value"*2007
> > *" (please see page 2); I can fetch other values correctly though.
> >
> > Is this an *encoding issue*; if yes, can anyone please let me know how to
> > fix it? If possible, please point me to some working examples.
> >
> >
> >
> > Thanks in advance.
> >
>
>
>
> --
> Mit freundlichen Grüßen
>
> Daniel Manzke
>
>
>
>
> --
> Mit freundlichen Grüßen
>
> Daniel Manzke
>



-- 
Mit freundlichen Grüßen

Daniel Manzke

RE: Garbage Output

Posted by "Duseja, Sushil" <su...@fiserv.com>.
Thanks again for responding.

 

Can you please point me to the URL/location from which 0.8 version can be downloaded? 

 

I referred to - http://sourceforge.net/project/showfiles.php?group_id=78314; however it shows the latest version is 0.7.3.

 

Thanks for your time.

 

From: Daniel Manzke [mailto:daniel.manzke@googlemail.com] 
Sent: Monday, December 29, 2008 1:29 PM
To: Duseja, Sushil
Cc: pdfbox-users@incubator.apache.org; Rally, Menka
Subject: Re: Garbage Output

 

Try to check out the latest Development Build. Due the fact thaht 0.7.3 is outdated. (year: 2006) In 0.8 there are a lot of issues fixed.

 

 

Bye,

daniel

2008/12/29 Duseja, Sushil <su...@fiserv.com>

Hello Daniel,

Thanks for the response.

I am using version 0.7.3.

Thanks!


-----Original Message-----
From: Daniel Manzke [mailto:daniel.manzke@googlemail.com]
Sent: Friday, December 26, 2008 9:11 PM
To: pdfbox-users@incubator.apache.org
Subject: Re: Garbage Output

Hi,
standard question. ;) Which version are you using?


Daniel

2008/12/26 Duseja, Sushil <su...@fiserv.com>

>  Hello,
>
>
>
> While extracting text from a pdf file (attached for your kind reference)
> using PDFBox, I get garbage output (*À¾´»*) for a special text value"*2007
> *" (please see page 2); I can fetch other values correctly though.
>
> Is this an *encoding issue*; if yes, can anyone please let me know how to
> fix it? If possible, please point me to some working examples.
>
>
>
> Thanks in advance.
>



--
Mit freundlichen Grüßen

Daniel Manzke




-- 
Mit freundlichen Grüßen

Daniel Manzke


Re: Garbage Output

Posted by Daniel Manzke <da...@googlemail.com>.
Try to check out the latest Development Build. Due the fact thaht 0.7.3 is
outdated. (year: 2006) In 0.8 there are a lot of issues fixed.

Bye,
daniel

2008/12/29 Duseja, Sushil <su...@fiserv.com>

> Hello Daniel,
>
> Thanks for the response.
>
> I am using version 0.7.3.
>
> Thanks!
>
> -----Original Message-----
> From: Daniel Manzke [mailto:daniel.manzke@googlemail.com]
> Sent: Friday, December 26, 2008 9:11 PM
> To: pdfbox-users@incubator.apache.org
> Subject: Re: Garbage Output
>
> Hi,
> standard question. ;) Which version are you using?
>
>
> Daniel
>
> 2008/12/26 Duseja, Sushil <su...@fiserv.com>
>
> >  Hello,
> >
> >
> >
> > While extracting text from a pdf file (attached for your kind reference)
> > using PDFBox, I get garbage output (*À¾´»*) for a special text
> value"*2007
> > *" (please see page 2); I can fetch other values correctly though.
> >
> > Is this an *encoding issue*; if yes, can anyone please let me know how to
> > fix it? If possible, please point me to some working examples.
> >
> >
> >
> > Thanks in advance.
> >
>
>
>
> --
> Mit freundlichen Grüßen
>
> Daniel Manzke
>



-- 
Mit freundlichen Grüßen

Daniel Manzke

RE: Garbage Output

Posted by "Duseja, Sushil" <su...@fiserv.com>.
Hello Daniel,

Thanks for the response.

I am using version 0.7.3.

Thanks!

-----Original Message-----
From: Daniel Manzke [mailto:daniel.manzke@googlemail.com] 
Sent: Friday, December 26, 2008 9:11 PM
To: pdfbox-users@incubator.apache.org
Subject: Re: Garbage Output

Hi,
standard question. ;) Which version are you using?


Daniel

2008/12/26 Duseja, Sushil <su...@fiserv.com>

>  Hello,
>
>
>
> While extracting text from a pdf file (attached for your kind reference)
> using PDFBox, I get garbage output (*À¾´»*) for a special text value"*2007
> *" (please see page 2); I can fetch other values correctly though.
>
> Is this an *encoding issue*; if yes, can anyone please let me know how to
> fix it? If possible, please point me to some working examples.
>
>
>
> Thanks in advance.
>



-- 
Mit freundlichen Grüßen

Daniel Manzke

Re: Garbage Output

Posted by Daniel Manzke <da...@googlemail.com>.
Hi,
standard question. ;) Which version are you using?


Daniel

2008/12/26 Duseja, Sushil <su...@fiserv.com>

>  Hello,
>
>
>
> While extracting text from a pdf file (attached for your kind reference)
> using PDFBox, I get garbage output (*À¾´»*) for a special text value"*2007
> *" (please see page 2); I can fetch other values correctly though.
>
> Is this an *encoding issue*; if yes, can anyone please let me know how to
> fix it? If possible, please point me to some working examples.
>
>
>
> Thanks in advance.
>



-- 
Mit freundlichen Grüßen

Daniel Manzke

Garbage Output

Posted by "Duseja, Sushil" <su...@fiserv.com>.
Hello,

 

While extracting text from a pdf file (attached for your kind reference) using PDFBox, I get garbage output (À¾´») for a special text value"2007" (please see page 2); I can fetch other values correctly though. 

Is this an encoding issue; if yes, can anyone please let me know how to fix it? If possible, please point me to some working examples.

 

Thanks in advance.