You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by "Hawkins, Thomas A. - Student" <th...@midway.edu> on 2012/05/18 07:21:23 UTC

PDFBox and superscript format .NET

I am using the .NET version of PDFBox and I have a pdf that contains data such as this:

Name                  Location
Jim Daviees              85
Herschel Walker          96
Vince Gogh               47
Andrew Lincoln        104

I need both the name value and the location value. When I use the following code:

    Dim p As PDDocument = PDDocument.load(fi.FullName)
                    Dim r As PDFTextStripper = New PDFTextStripper

                    Dim stringVal As String = r.getText(p)
                    Dim bytes As Byte() = System.Text.Encoding.ASCII.GetBytes(stringVal)

I get the following in the .txt file (also in html when I've converted it to that)
Jim Daviees
Herschel Walker
Vince Gogh
Andrew Lincoln
85
96
47
104

I'm okay with the layout, as I've got a work around for that, my problem is that it destroys any mention of the superscript exponents. Is there a way that I can locate these superscript parts and encapsulate them in brackets or something so as the returned value is more like this:
Jim Daviees
Herschel Walker
Vince Gogh
Andrew Lincoln
8[5]
9[6]
4[7]
10[4]

So, nutshell time. Can I use pdfbox (.NET Version) to locate the instances of superscript in a pdf file (like locating <sup></sup> in html) and change it out for an easily recognized symbol to be output to my destination file. I picked brackets because I have no brackets in my source file whatsoever and they would be very easy for me to code around. Thanks in advance.

Re: PDFBox and superscript format .NET

Posted by Ian Holsman <kr...@gmail.com>.

no idea about examples

look at implementing endPage() and doing something like:

for (List<TextPosition> aCharactersByArticle : charactersByArticle) {
 for (TextPosition t : aCharactersByArticle) {
 }
}

On May 19, 2012, at 3:54 AM, Hawkins, Thomas A. - Student wrote:

> Any idea as to where I might go for some examples of the textposition class - I've searched the docs and found nothing. Looking over the old threads, I've only found people with issues in regards to textposition. This sounds perfect as to what I need, I just need to figure out how to use it (ie get the x,y and iterate through them)
> 
> Thank you.
> ________________________________________
> From: Ian Holsman [kryton@gmail.com]
> Sent: Friday, May 18, 2012 3:46 AM
> To: users@pdfbox.apache.org
> Cc: users@pdfbox.apache.org
> Subject: Re: PDFBox and superscript format .NET
> 
> You might want to look at the process operator function and watching for tj&ts operators. Ts is the super/subscript operator which might give you the information you need. If you track the textposition class it should give you the x,y position if the lettering.
> Sadly it's harder than it sounds :(
> (I'm a newbie so I might be completely off base)
> 
> Sent from my iPhone
> 
> On 18/05/2012, at 3:37 PM, "Hawkins, Thomas A. - Student" <th...@midway.edu> wrote:
> 
>> As an addendum, I didn't realize when I sent this out - the numbers are a combination of regular and superscript, since email won't support it, mathematical operators it is. The numbers should be
>> 8^5       (INSTEAD OF 85)
>> 9^6       (INSTEAD OF 96)
>> 4^7       (INSTEAD OF 47)
>> 10^4     (INSTEAD OF 104)
>> ________________________________________
>> From: Hawkins, Thomas A. - Student [thawkins@midway.edu]
>> Sent: Friday, May 18, 2012 1:21 AM
>> To: users@pdfbox.apache.org
>> Subject: PDFBox and superscript format .NET
>> 
>> I am using the .NET version of PDFBox and I have a pdf that contains data such as this:
>> 
>> Name                  Location
>> Jim Daviees              85
>> Herschel Walker          96
>> Vince Gogh               47
>> Andrew Lincoln        104
>> 
>> I need both the name value and the location value. When I use the following code:
>> 
>>   Dim p As PDDocument = PDDocument.load(fi.FullName)
>>                   Dim r As PDFTextStripper = New PDFTextStripper
>> 
>>                   Dim stringVal As String = r.getText(p)
>>                   Dim bytes As Byte() = System.Text.Encoding.ASCII.GetBytes(stringVal)
>> 
>> I get the following in the .txt file (also in html when I've converted it to that)
>> Jim Daviees
>> Herschel Walker
>> Vince Gogh
>> Andrew Lincoln
>> 85
>> 96
>> 47
>> 104
>> 
>> I'm okay with the layout, as I've got a work around for that, my problem is that it destroys any mention of the superscript exponents. Is there a way that I can locate these superscript parts and encapsulate them in brackets or something so as the returned value is more like this:
>> Jim Daviees
>> Herschel Walker
>> Vince Gogh
>> Andrew Lincoln
>> 8[5]
>> 9[6]
>> 4[7]
>> 10[4]
>> 
>> So, nutshell time. Can I use pdfbox (.NET Version) to locate the instances of superscript in a pdf file (like locating <sup></sup> in html) and change it out for an easily recognized symbol to be output to my destination file. I picked brackets because I have no brackets in my source file whatsoever and they would be very easy for me to code around. Thanks in advance.

RE: PDFBox and superscript format .NET

Posted by "Hawkins, Thomas A. - Student" <th...@midway.edu>.

Any idea as to where I might go for some examples of the textposition class - I've searched the docs and found nothing. Looking over the old threads, I've only found people with issues in regards to textposition. This sounds perfect as to what I need, I just need to figure out how to use it (ie get the x,y and iterate through them)

Thank you.
________________________________________
From: Ian Holsman [kryton@gmail.com]
Sent: Friday, May 18, 2012 3:46 AM
To: users@pdfbox.apache.org
Cc: users@pdfbox.apache.org
Subject: Re: PDFBox and superscript format .NET

You might want to look at the process operator function and watching for tj&ts operators. Ts is the super/subscript operator which might give you the information you need. If you track the textposition class it should give you the x,y position if the lettering.
Sadly it's harder than it sounds :(
(I'm a newbie so I might be completely off base)

Sent from my iPhone

On 18/05/2012, at 3:37 PM, "Hawkins, Thomas A. - Student" <th...@midway.edu> wrote:

> As an addendum, I didn't realize when I sent this out - the numbers are a combination of regular and superscript, since email won't support it, mathematical operators it is. The numbers should be
> 8^5       (INSTEAD OF 85)
> 9^6       (INSTEAD OF 96)
> 4^7       (INSTEAD OF 47)
> 10^4     (INSTEAD OF 104)
> ________________________________________
> From: Hawkins, Thomas A. - Student [thawkins@midway.edu]
> Sent: Friday, May 18, 2012 1:21 AM
> To: users@pdfbox.apache.org
> Subject: PDFBox and superscript format .NET
>
> I am using the .NET version of PDFBox and I have a pdf that contains data such as this:
>
> Name                  Location
> Jim Daviees              85
> Herschel Walker          96
> Vince Gogh               47
> Andrew Lincoln        104
>
> I need both the name value and the location value. When I use the following code:
>
>    Dim p As PDDocument = PDDocument.load(fi.FullName)
>                    Dim r As PDFTextStripper = New PDFTextStripper
>
>                    Dim stringVal As String = r.getText(p)
>                    Dim bytes As Byte() = System.Text.Encoding.ASCII.GetBytes(stringVal)
>
> I get the following in the .txt file (also in html when I've converted it to that)
> Jim Daviees
> Herschel Walker
> Vince Gogh
> Andrew Lincoln
> 85
> 96
> 47
> 104
>
> I'm okay with the layout, as I've got a work around for that, my problem is that it destroys any mention of the superscript exponents. Is there a way that I can locate these superscript parts and encapsulate them in brackets or something so as the returned value is more like this:
> Jim Daviees
> Herschel Walker
> Vince Gogh
> Andrew Lincoln
> 8[5]
> 9[6]
> 4[7]
> 10[4]
>
> So, nutshell time. Can I use pdfbox (.NET Version) to locate the instances of superscript in a pdf file (like locating <sup></sup> in html) and change it out for an easily recognized symbol to be output to my destination file. I picked brackets because I have no brackets in my source file whatsoever and they would be very easy for me to code around. Thanks in advance.

Re: PDFBox and superscript format .NET

Posted by Ian Holsman <kr...@gmail.com>.

You might want to look at the process operator function and watching for tj&ts operators. Ts is the super/subscript operator which might give you the information you need. If you track the textposition class it should give you the x,y position if the lettering. 
Sadly it's harder than it sounds :(
(I'm a newbie so I might be completely off base)

Sent from my iPhone

On 18/05/2012, at 3:37 PM, "Hawkins, Thomas A. - Student" <th...@midway.edu> wrote:

> As an addendum, I didn't realize when I sent this out - the numbers are a combination of regular and superscript, since email won't support it, mathematical operators it is. The numbers should be
> 8^5       (INSTEAD OF 85)
> 9^6       (INSTEAD OF 96)
> 4^7       (INSTEAD OF 47)
> 10^4     (INSTEAD OF 104)
> ________________________________________
> From: Hawkins, Thomas A. - Student [thawkins@midway.edu]
> Sent: Friday, May 18, 2012 1:21 AM
> To: users@pdfbox.apache.org
> Subject: PDFBox and superscript format .NET
> 
> I am using the .NET version of PDFBox and I have a pdf that contains data such as this:
> 
> Name                  Location
> Jim Daviees              85
> Herschel Walker          96
> Vince Gogh               47
> Andrew Lincoln        104
> 
> I need both the name value and the location value. When I use the following code:
> 
>    Dim p As PDDocument = PDDocument.load(fi.FullName)
>                    Dim r As PDFTextStripper = New PDFTextStripper
> 
>                    Dim stringVal As String = r.getText(p)
>                    Dim bytes As Byte() = System.Text.Encoding.ASCII.GetBytes(stringVal)
> 
> I get the following in the .txt file (also in html when I've converted it to that)
> Jim Daviees
> Herschel Walker
> Vince Gogh
> Andrew Lincoln
> 85
> 96
> 47
> 104
> 
> I'm okay with the layout, as I've got a work around for that, my problem is that it destroys any mention of the superscript exponents. Is there a way that I can locate these superscript parts and encapsulate them in brackets or something so as the returned value is more like this:
> Jim Daviees
> Herschel Walker
> Vince Gogh
> Andrew Lincoln
> 8[5]
> 9[6]
> 4[7]
> 10[4]
> 
> So, nutshell time. Can I use pdfbox (.NET Version) to locate the instances of superscript in a pdf file (like locating <sup></sup> in html) and change it out for an easily recognized symbol to be output to my destination file. I picked brackets because I have no brackets in my source file whatsoever and they would be very easy for me to code around. Thanks in advance.

RE: PDFBox and superscript format .NET

Posted by "Hawkins, Thomas A. - Student" <th...@midway.edu>.

As an addendum, I didn't realize when I sent this out - the numbers are a combination of regular and superscript, since email won't support it, mathematical operators it is. The numbers should be
8^5       (INSTEAD OF 85)
9^6       (INSTEAD OF 96)
4^7       (INSTEAD OF 47)
10^4     (INSTEAD OF 104)
________________________________________
From: Hawkins, Thomas A. - Student [thawkins@midway.edu]
Sent: Friday, May 18, 2012 1:21 AM
To: users@pdfbox.apache.org
Subject: PDFBox and superscript format .NET

I am using the .NET version of PDFBox and I have a pdf that contains data such as this:

Name                  Location
Jim Daviees              85
Herschel Walker          96
Vince Gogh               47
Andrew Lincoln        104

I need both the name value and the location value. When I use the following code:

    Dim p As PDDocument = PDDocument.load(fi.FullName)
                    Dim r As PDFTextStripper = New PDFTextStripper

                    Dim stringVal As String = r.getText(p)
                    Dim bytes As Byte() = System.Text.Encoding.ASCII.GetBytes(stringVal)

I get the following in the .txt file (also in html when I've converted it to that)
Jim Daviees
Herschel Walker
Vince Gogh
Andrew Lincoln
85
96
47
104

I'm okay with the layout, as I've got a work around for that, my problem is that it destroys any mention of the superscript exponents. Is there a way that I can locate these superscript parts and encapsulate them in brackets or something so as the returned value is more like this:
Jim Daviees
Herschel Walker
Vince Gogh
Andrew Lincoln
8[5]
9[6]
4[7]
10[4]

So, nutshell time. Can I use pdfbox (.NET Version) to locate the instances of superscript in a pdf file (like locating <sup></sup> in html) and change it out for an easily recognized symbol to be output to my destination file. I picked brackets because I have no brackets in my source file whatsoever and they would be very easy for me to code around. Thanks in advance.