You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@corinthia.apache.org by Peter Kelly <pm...@apache.org> on 2015/01/01 11:58:16 UTC

DFStorage

I realise that I haven’t done a very good job of documenting the code in Corinthia, as you’ve probably noticed :) I’ve been meaning to get around to this for a while now.

There’s an “abstract class” (or, more accurately, interface) called DFStorage which abstracts over different ways of storing “files” (byte arrays/byte streams) - the concrete implementations are 1) memory 2) a directory in the filesystem and 3) a zip file.

The idea with the zip implementation of DFStorage is that you create such an object, and when you read from or write to it, it works directly from the zip file. For example:

DFStorage *st = DFStorageOpenZip(“filename.docx”,&error);
if (st != NULL) {
    DFBuffer *foo = DFStorageRead(st,…)
    DFStorageWrite(st,…)
    // etc.
    DFStorageSave(st);
    DFStorageRelease(foo);
}

This way, filter code doesn’t need to care how the data is actually stored.

Now with the current implementation, which is a very simplistic one, it simply reads the whole zip file into memory. This is largely due to a limitation in the minizip API, which enforces sequential access to the entries in a file. It would be conceivable to have the zip DFStorage implementation first read a directory listing, and then for each file that’s requested, do a linear scan through all the entries before finding the requested file, and then reading that. This would be an O(n) operation, but would be unlikely to be a major problem since most zip packages we’re dealing with will only have a fairly small number of entries.

Minizip does not provide any way to cache the location in the zip file of a particular entry, even though this information would be possible to obtain in theory (just not through minizip’s AP). If I were writing a zip implementation from scratch (and maybe this is something we could consider), I would have it read a list of all entries and remember their locations in a hash table, so that when a particular named entry is requested, we can go directly to that point in the file without having to do a linear scan.

Writing to zip files is another inconvenient thing, because it’s really only possible to do it in an append-only manner. If a large image is deleted from a document, or replaced with a modified image, we don’t want to keep the old one around; so instead we create an entirely new zip file and overwrite the old one. As for reading, the current implementation stores all the content in memory and then writes it out to disk in one go when you call DFStorageSave(). However for documents containing large images this may mean unacceptable amounts of memory usage, depending on the application/environment.

Coming back to what I said in my previous response to Jan about having multiple zip files open at a time, it’s not done at the moment within the context of a single conversion (but can happen with multiple threads if there are several conversions going on at the same time). However, if we were to adopt the above approach to limit memory usage of the zip-based DFStorage objects, and we were converting say directly from one zip-based file format to another (think OOXML to ODF), this would require the ability to have multiple zip files open at the same time, and in the same thread.

On the question of providing our own versions of the APIs for external libraries, I guess I can sort off see some benefit now there in the sense that we can, in theory at least, swap out a different implementation. But this is only possible if the other implementation works the same semantically, with only a different syntax. For example the use wrapper functions in an of itself does not really help much IMHO, as the way in which a piece of given functionality is exposed may differ between libraries. I’m not opposed to wrapper functions as in Jan’s branch as such, but it’s just some food for thought.

—
Dr Peter M. Kelly
pmkelly@apache.org

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)


RE: DFStorage

Posted by "Dennis E. Hamilton" <de...@acm.org>.
I have a question that may just be one of nomenclature, ...

 -- replying below to --
From: Peter Kelly [mailto:pmkelly@apache.org] 
Sent: Thursday, January 1, 2015 02:58
To: dev@corinthia.incubator.apache.org
Subject: DFStorage

I realise that I haven’t done a very good job of documenting the code in Corinthia, as you’ve probably noticed :) I’ve been meaning to get around to this for a while now.

[ ... ]

Now with the current implementation, which is a very simplistic one, it simply reads the whole zip file into memory. This is largely due to a limitation in the minizip API, which enforces sequential access to the entries in a file. It would be conceivable to have the zip DFStorage implementation first read a directory listing, and then for each file that’s requested, do a linear scan through all the entries before finding the requested file, and then reading that. This would be an O(n) operation, but would be unlikely to be a major problem since most zip packages we’re dealing with will only have a fairly small number of entries.

Minizip does not provide any way to cache the location in the zip file of a particular entry, even though this information would be possible to obtain in theory (just not through minizip’s AP). If I were writing a zip implementation from scratch (and maybe this is something we could consider), I would have it read a list of all entries and remember their locations in a hash table, so that when a particular named entry is requested, we can go directly to that point in the file without having to do a linear scan.

[ ... ]

<orcmid>
   @Peter, I want to verify that we have the same understanding of the Zip file.

   The Zip file itself has a global directory to all of the component files at the end of the file.  The global directory provides offsets to where each component file begins in the Zip stream and also provides other pertinent information.

   To produce a Zip file, minizip would need to remember all of this to append to the stream once all of the part files are written out.

The global directory could certainly be cached and, if necessary, indexed from a hash table on the names of the component parts.  

   Without looking at minizip, I would assume that there has to be some internal representation of the global directory even if it is not exposed.  Would it be useful to exploit that somehow in elevating a better API?

   So long as the Zip stream can be read via random access, it is normal to access the global directory first and then access the parts based on the global directory, even if access is in sequential order of those parts in the stream.  That helps detect apparent corruption of the Zip and it is essential when the header for a component file does not specify the length of the file data.

   Does this square with your understanding of what is involved in minizip operation?
</orcmid>

     

  



Re: DFStorage

Posted by jan i <ja...@apache.org>.
On Thursday, January 1, 2015, Peter Kelly <pm...@apache.org> wrote:

> I realise that I haven’t done a very good job of documenting the code in
> Corinthia, as you’ve probably noticed :) I’ve been meaning to get around to
> this for a while now.
>
> There’s an “abstract class” (or, more accurately, interface) called
> DFStorage which abstracts over different ways of storing “files” (byte
> arrays/byte streams) - the concrete implementations are 1) memory 2) a
> directory in the filesystem and 3) a zip file.
>
> The idea with the zip implementation of DFStorage is that you create such
> an object, and when you read from or write to it, it works directly from
> the zip file. For example:
>
> DFStorage *st = DFStorageOpenZip(“filename.docx”,&error);
> if (st != NULL) {
>     DFBuffer *foo = DFStorageRead(st,…)
>     DFStorageWrite(st,…)
>     // etc.
>     DFStorageSave(st);
>     DFStorageRelease(foo);
> }
>
> This way, filter code doesn’t need to care how the data is actually stored.
>
> Now with the current implementation, which is a very simplistic one, it
> simply reads the whole zip file into memory. This is largely due to a
> limitation in the minizip API, which enforces sequential access to the
> entries in a file. It would be conceivable to have the zip DFStorage
> implementation first read a directory listing, and then for each file
> that’s requested, do a linear scan through all the entries before finding
> the requested file, and then reading that. This would be an O(n) operation,
> but would be unlikely to be a major problem since most zip packages we’re
> dealing with will only have a fairly small number of entries.
>
> Minizip does not provide any way to cache the location in the zip file of
> a particular entry, even though this information would be possible to
> obtain in theory (just not through minizip’s AP). If I were writing a zip
> implementation from scratch (and maybe this is something we could
> consider), I would have it read a list of all entries and remember their
> locations in a hash table, so that when a particular named entry is
> requested, we can go directly to that point in the file without having to
> do a linear scan.
>
> Writing to zip files is another inconvenient thing, because it’s really
> only possible to do it in an append-only manner. If a large image is
> deleted from a document, or replaced with a modified image, we don’t want
> to keep the old one around; so instead we create an entirely new zip file
> and overwrite the old one. As for reading, the current implementation
> stores all the content in memory and then writes it out to disk in one go
> when you call DFStorageSave(). However for documents containing large
> images this may mean unacceptable amounts of memory usage, depending on the
> application/environment.
>
> Coming back to what I said in my previous response to Jan about having
> multiple zip files open at a time, it’s not done at the moment within the
> context of a single conversion (but can happen with multiple threads if
> there are several conversions going on at the same time). However, if we
> were to adopt the above approach to limit memory usage of the zip-based
> DFStorage objects, and we were converting say directly from one zip-based
> file format to another (think OOXML to ODF), this would require the ability
> to have multiple zip files open at the same time, and in the same thread.
>
> On the question of providing our own versions of the APIs for external
> libraries, I guess I can sort off see some benefit now there in the sense
> that we can, in theory at least, swap out a different implementation. But
> this is only possible if the other implementation works the same
> semantically, with only a different syntax. For example the use wrapper
> functions in an of itself does not really help much IMHO, as the way in
> which a piece of given functionality is exposed may differ between
> libraries. I’m not opposed to wrapper functions as in Jan’s branch as such,
> but it’s just some food for thought.

I agree that it  food for thought. My idea with the API is the first step
in providing what you suggest,  with the API I can work in platform to
first remove minizip e.g. use an external (non source library) and then add
the scanning functionality we need.

I just want a clear separation between platform/library code and
application code. BUT we should only make platform APIs where it makes
sense.

rgds
jan i



> —
> Dr Peter M. Kelly
> pmkelly@apache.org <javascript:;>
>
> PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
>
>

-- 
Sent from My iPad, sorry for any misspellings.