You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by "Hawkins, Thomas A. - Student" <th...@midway.edu> on 2012/05/20 01:31:29 UTC

PDFBox View Post-Read, Pre-Conversion Stream

I've asked this question a couple of times and I really need help - no one has really given me any type of answer that I can use. I've had answers but they point me in no positive direction.



I am converting pdf files to txt files (of course I lose the formatting), but I get horrible results converting to html and even worse to XML.



So what I want to do, is have the program either place a space between superscript exponents, or, place exponents in brackets.



Is there anyway for me to access the stream of data after the pdf is read, but before it is converted to a string. If I can find a way to do this then I can figure out how to edit the data to return the txt file I want.



I am using the .NET port of pdfBox and I would appreciate some examples (preferably VB or C#) but Java was my first language and I'm sure I can knock the dust off of my knowledge.

RE: PDFBox View Post-Read, Pre-Conversion Stream

Posted by "Hawkins, Thomas A. - Student" <th...@midway.edu>.

I suppose I should clarify; I was not bemoaning the lack of support or generosity, I was stating the situation as it had occurred to that point - one gentleman had offered support (and through no fault of his own) added more questions than answers. I did not mean to come off as impatient, I am always thankful and never question the delivery of free knowledge.
________________________________________
From: Andreas Lehmkuehler [andreas@lehmi.de]
Sent: Sunday, May 20, 2012 7:02 AM
To: users@pdfbox.apache.org
Subject: Re: PDFBox View Post-Read, Pre-Conversion Stream

Hi,

Am 20.05.2012 01:31, schrieb Hawkins, Thomas A. - Student:
> I've asked this question a couple of times and I really need help - no one has really
 > given me any type of answer that I can use. I've had answers but they
 > point me in no positive direction.
No offense, but you have to be more patient, we are all volunteers ...

> I am converting pdf files to txt files (of course I lose the formatting),
 > but I get horrible results converting to html and even worse to XML.
>
> So what I want to do, is have the program either place a space between
 > superscript exponents, or, place exponents in brackets.
>
> Is there anyway for me to access the stream of data after the pdf is read,
 > but before it is converted to a string. If I can find a way to do this
 > then I can figure out how to edit the data to return the txt file I want.
It is not that easy.

- the information you are looking for is part of the so called contentstream
- that stream is processed within PDFStreamEngine#processStream [1]
- the main test-processing is done in PDFStreamEngine#processEncodedText
- the PDF-operator -> ProcessOperator mapping can be found here [2]
- the class TestPosition doesn't have any onformation about text features like
superscript
- you might have a look at the pdf specs [3]


> I am using the .NET port of pdfBox and I would appreciate some
 > examples (preferably VB or C#) but Java was my first language and
 > I'm sure I can knock the dust off of my knowledge.
As it is complicated enough to implement this stuff in java, I guess
there won't be any approaches in VB or C#.

BR
Andreas Lehmkühler

[1]
http://svn.apache.org/repos/asf/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/util/PDFStreamEngine.java
[2]
http://svn.apache.org/repos/asf/pdfbox/trunk/pdfbox/src/main/resources/org/apache/pdfbox/resources/PDFTextStripper.properties
[3] http://www.adobe.com/de/devnet/pdf/pdf_reference.html

Re: PDFBox View Post-Read, Pre-Conversion Stream

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Hi,

Am 20.05.2012 01:31, schrieb Hawkins, Thomas A. - Student:
> I've asked this question a couple of times and I really need help - no one has really
 > given me any type of answer that I can use. I've had answers but they
 > point me in no positive direction.
No offense, but you have to be more patient, we are all volunteers ...

> I am converting pdf files to txt files (of course I lose the formatting),
 > but I get horrible results converting to html and even worse to XML.
>
> So what I want to do, is have the program either place a space between
 > superscript exponents, or, place exponents in brackets.
>
> Is there anyway for me to access the stream of data after the pdf is read,
 > but before it is converted to a string. If I can find a way to do this
 > then I can figure out how to edit the data to return the txt file I want.
It is not that easy.

- the information you are looking for is part of the so called contentstream
- that stream is processed within PDFStreamEngine#processStream [1]
- the main test-processing is done in PDFStreamEngine#processEncodedText
- the PDF-operator -> ProcessOperator mapping can be found here [2]
- the class TestPosition doesn't have any onformation about text features like 
superscript
- you might have a look at the pdf specs [3]


> I am using the .NET port of pdfBox and I would appreciate some
 > examples (preferably VB or C#) but Java was my first language and
 > I'm sure I can knock the dust off of my knowledge.
As it is complicated enough to implement this stuff in java, I guess
there won't be any approaches in VB or C#.

BR
Andreas Lehmkühler

[1] 
http://svn.apache.org/repos/asf/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/util/PDFStreamEngine.java
[2] 
http://svn.apache.org/repos/asf/pdfbox/trunk/pdfbox/src/main/resources/org/apache/pdfbox/resources/PDFTextStripper.properties
[3] http://www.adobe.com/de/devnet/pdf/pdf_reference.html