You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by Maruan Sahyoun <sa...@fileaffairs.de> on 2012/07/19 13:02:59 UTC

ConformingParser (PDFBOX-1000)

Hi there,

resuming to work on PDFBOX-1000 I came across a question how to maintain some state within the base components PDFLexer and Simple Parser (which has yet to come). 

E.g. in order to differentiate a number from an indirect object I potentially have to read three tokens {num} {gen}  obj to check if {num} is an individual number or the start of an indirect object. There are two ways to recover if I've read too many tokens and the number was in fact the individual object

a) depend on file position e.g. filePointer and seek
b) maintain some internal state

I currently tend to go for b) as this would remove the dependency on filePointer() and seek() or similar methods but that means if the parsing has to start from a new point within the file, object etc. there needs too be some reset() call to reset the state. Also the caller e.g. ConformingParser has to make sure that there is some way to reposition the cursor. On the other hand not being dependent on a specific position would enable the PDFLexer and SimpleParser to be extended to work on byte[] and similar. 

WDYT

Kind regards

Maruan Sahyoun

Re: ConformingParser (PDFBOX-1000)

Posted by Timo Boehme <ti...@ontochem.com>.

Hi,

Am 19.07.2012 13:02, schrieb Maruan Sahyoun:
> resuming to work on PDFBOX-1000 I came across a question how to maintain some state within the base components PDFLexer and Simple Parser (which has yet to come).
>
> E.g. in order to differentiate a number from an indirect object I potentially have to read three tokens {num} {gen}  obj to check if {num} is an individual number or the start of an indirect object. There are two ways to recover if I've read too many tokens and the number was in fact the individual object
>
> a) depend on file position e.g. filePointer and seek
> b) maintain some internal state
>
> I currently tend to go for b) as this would remove the dependency on filePointer() and seek() or similar methods but that means if the parsing has to start from a new point within the file, object etc. there needs too be some reset() call to reset the state. Also the caller e.g. ConformingParser has to make sure that there is some way to reposition the cursor. On the other hand not being dependent on a specific position would enable the PDFLexer and SimpleParser to be extended to work on byte[] and similar.
>
> WDYT

why not using o.a.p.io.RandomAccessRead? This interface can be 
implemented for all kinds of input material.


Best regards,

Timo


-- 

  Timo Boehme
  OntoChem GmbH
  H.-Damerow-Str. 4
  06120 Halle/Saale
  T: +49 345 4780474
  F: +49 345 4780471
  timo.boehme@ontochem.com

_____________________________________________________________________

  OntoChem GmbH
  Geschäftsführer: Dr. Lutz Weber
  Sitz: Halle / Saale
  Registergericht: Stendal
  Registernummer: HRB 215461
_____________________________________________________________________