You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Nikhil Varma <pn...@gmail.com> on 2018/04/27 14:18:29 UTC

Where/When does charactersByArticle array get populated.

Hello,

I've been using PDFBox for quite some time now. I am very happy with the
flexibility and functionality it gave me to process pdf documents.

Recently I decided to give back to the community, in the process I am
trying to reverse engineer the library in order to understand how the flow
goes about. One thing I am stuck at is how or when are TextPosition's
in  charactersByArticle
array populated and appended to the array. I see its being simply checked
if its has some content and being iterated over in writePage()
<https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java#L475>
function in PDFTextStripper class. But I was unable to figure out how and
when is this array being populated with character values.

If some can brief me about the flow,how this is done it would be very
helpful.

Re: Where/When does charactersByArticle array get populated.

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 27.04.2018 um 16:18 schrieb Nikhil Varma:
> Hello,
>
> I've been using PDFBox for quite some time now. I am very happy with the
> flexibility and functionality it gave me to process pdf documents.
>
> Recently I decided to give back to the community, in the process I am
> trying to reverse engineer the library in order to understand how the flow
> goes about. One thing I am stuck at is how or when are TextPosition's
> in  charactersByArticle
> array populated and appended to the array. I see its being simply checked
> if its has some content and being iterated over in writePage()
> <https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java#L475>
> function in PDFTextStripper class. But I was unable to figure out how and
> when is this array being populated with character values.

it's a list of lists (usually only one, see comment "Most PDFs won't 
have any beads, so charactersByArticle will contain a single entry.")

charactersByArticle.add(new ArrayList<TextPosition>());

....

List<TextPosition> textList = charactersByArticle.get(articleDivisionIndex);

and later, you'll see

textList.add(text);


>
> If some can brief me about the flow,how this is done it would be very
> helpful.
>

To be honest, I barely understand what's being done, LOL. I did some 
work there, but never touched the core algorithm.


If you want to do any changes, post here before doing to much work, I 
have more tests than those in the repository. There are some nasty 
corner cases where we can't put the files online due to copyrights.

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org