You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Dexter Mishra <de...@gmail.com> on 2009/03/28 16:13:11 UTC
[Some Comments and Ideas] Modified the Parser for some need

Hi All,
I am a fairly new PDFBox user. 3-2 weeks before I started using it. We had a
special need. We had few (may be 10-100 depending on the customer) PDFs (1.3
or 1.4 PDF version only ) that we wanted to merge. But we had a criteria. In
our PDFs we input user comments as metadata somethimes (business needs).

I will tell the structure of the PDF first
*PDF A*
...
% DEX (SECT=1)
263.9999084473 285.001953125 Td
(We choose the)Tj

% DEX (SECT=1)
24 273.0018615723 Td
(below:)Tj
...

*PDF B*
...
% DEX (SECT=1)
263.9999084473 285.001953125 Td
(PDF B)Tj
...

*PDF C*
...
% DEX (SECT=2)
263.9999084473 285.001953125 Td
(PDF C)Tj
% DEX (SECT=1)
263.9999084473 285.001953125 Td
(The above text of PDF C is not choosen)Tj
...


Now we need to construct a PDF D. But on certain condition. PDF D should
contain only texts merged where SECT=1
So if you see the above example the text displayed in the PDF D should be "*We
choose the below: PDF B. The above text of PDF C is not choosen*". This
comes because if you see the comment in PDF C the SECT=2 out there. There
can be more conditions like this set. These comments for us are used like
certain metadata infromation regarding the structure and use of the PDF. But
in the Parser the % is always skipped. So I did certain changes for
implementing this. I changed the parser to take certain values like % DEX as
not a comment. This can be fed to the parser thru some XML data (You can use
the string % DEX % ME etc anything you want. Till now am taking this
programatically, need to do more testing for this and this will be some
extension). Now I define a COSDummy which will hold the string % DEX and the
rest between a starting "*(*" and ending ")". It will be parsed as a
COSString. Any comments which is not having something like % DEX or what
ever is given will not be parsed and stored as COSDummy. So now I added a
routine to the parser which will take a condition like SECT=1 (I plan to
make it a certain query later) and only extract the text that follows. I
intend to add LUA(www.lua.org) in between the "(" and ")" also. So what will
taht give me? Well if SECT=1 take the data and draw a graph. This will be
done by lua (LUA is a powerful data description and scripting language. Its
small and lightweight. It was also used in World of Warcraft ).
I will need you comments on the parts of code that I can change for this. I
have done the changes but those are for simple SECT=1 and stuff, if I want
to add lua i need more robust code. So I need your input. Where can I add a
call back function so that when ever SECT=1 hits then call the text
processing operatiors that follow? Any more ideas or use cases? Etc.
Please let me know.