You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Timo Boehme <ti...@ontochem.com> on 2012/07/19 13:00:44 UTC

Object scanning (was: Re: Apache PDFBox July 2012 board report due)

Hi

Am 19.07.2012 10:03, schrieb Maruan Sahyoun:

> maybe wie can join forces here as I'm currently working on an Xref
> class which parses xref tables and xref streams. One method should
> also do the mentioned scanning.

Sure. I haven't started yet thus we can discuss the details. What I had 
in mind was a fast scanning of line starts with object start, endobj, 
endstream. With this we can detect missing endobj/endstream etc. 
Furthermore we can correct xref entries which sometimes are some bytes 
off. Embedded, not extra encoded PDFs can make some trouble here but as 
long as the embedding object and the embedded PDF is correct this can be 
handled - furthermore this method is only needed for broken PDFs and 
most of them won't have such embedded PDFs.


Kind regards,

Timo


> Am 19.07.2012 um 09:42 schrieb "Andreas Lehmkühler"<an...@lehmi.de>:
>> Timo Boehme<ti...@ontochem.com>  hat am 16. Juli 2012 um 18:02
>> geschrieben:
>>> Am 16.07.2012 17:48, schrieb Andreas Lehmkuehler:
>>>> Am 10.07.2012 09:16, schrieb Timo Boehme:
>>>>> ...
>>> For the next time I plan to improve on the broken document robustness of
>>> the parser by doing a first scan over the document (in case of parsing
>>> failure), collecting object start/end points and using them to repair
>>> xref table.
>>
>> Seems to be necessary, at least for some PDFs. :-(
>>
>>> Another task I would like to do is reducing the amount of memory needed
>>> by using the existing file as input stream resource instead of copying
>>> an object stream first to a temporary buffer (in cases where an input
>>> file exists).
>>> Maybe for this we should change from assuming to have an input stream to
>>> assuming we have an input file and if we have an input stream a
>>> temporary file is created on the fly - WDYT?
>>
>> I guess internally we have to use something abstract and as everything is a
>> stream
>> the might be a good choice. AFAIU the current implementation, one reason for the
>> usage of a temporary buffer is the fact that the data is modified
>> (decompressing,
>> decrypting) and we must not alter the input data. It is perhaps a better idea to
>> somehow split the inputstream and the unfilteredinputstream, e.g. read from the
>> inputstream every time an object is dereferenced and store the (decompressed)
>> data in the corresponding object.
>>
>>>
>>>
>>> Kind regards,
>>> Timo
>>
>>
>> BR
>> Andreas Lehmkühler


-- 

  Timo Boehme
  OntoChem GmbH
  H.-Damerow-Str. 4
  06120 Halle/Saale
  T: +49 345 4780474
  F: +49 345 4780471
  timo.boehme@ontochem.com

_____________________________________________________________________

  OntoChem GmbH
  Geschäftsführer: Dr. Lutz Weber
  Sitz: Halle / Saale
  Registergericht: Stendal
  Registernummer: HRB 215461
_____________________________________________________________________