You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jackrabbit.apache.org by jd...@21technologies.com on 2007/01/08 21:49:09 UTC

Unnecessary downloading of blob data

Hi,
I recently extended the DatabasePersistenceManager to create a 
PostgresqlPersistenceManager that takes advantage of postgresql's 
LargeObject API.  This was done primarily by extending DbBLOBStore and 
overwriting the methods so that they store an oid instead of a bytea 
column for BINVAL_DATA and then using the LargeObject API to actually 
store the binary data based off of the oid.  My new postgresql persistence 
manager successfully removed the memory problems associated with using the 
bytea data type to store large blobs (I can now store any size file with 
default VM heap sizes).  Although this fixed the memory problems I was 
working on earlier, I noticed that performance still wasn't as fast as I 
would've hoped.  Stepping through the code, I realized that every time 
Serializer.serialize or Serializer.deserialize for PropertyState gets 
called, any binary data stored in a blob gets downloaded to a temporary 
file as the InternalValues are getting assembled.  For a large file, this 
may take 30 seconds.  Thus, simple operations like listing child nodes 
causes the entire blob to be downloaded from the postgresql database and 
written to a temporary file, which seems unnecessary since all I really 
care about are the names of the nodes, not the contents,  in most of these 
situations. For example, here is the code I use to completely clean out my 
repository starting at the root.  Unfortunately, this has the nasty side 
effect of creating temporary files for all of the blobs in the repository 
before the nodes are deleted:

                Node root = repoSession.getRootNode();
            Node jackrabbitRoot = 
root.getNode(JackrabbitResourceRepository.REPOSITORY_ROOT_NAME);
            for (Iterator childItr = jackrabbitRoot.getNodes(); 
childItr.hasNext();) {
                Node nodeToDelete = (Node) childItr.next();
                nodeToDelete.remove();
            }

 Is there any way to avoid downloading the blob data unnecessarily in 
situations like the one above?  I really only want to download the blob if 
a user asks for it.  Instead, it seems the blob is always getting 
downloaded so that Jackrabbit can create the BLOBFileValue for each blob 
in the DB.
Thanks,
Joe.

Re: Unnecessary downloading of blob data

Posted by Stefan Guggisberg <st...@gmail.com>.

On 1/8/07, jdente@21technologies.com <jd...@21technologies.com> wrote:
> Hi,
> I recently extended the DatabasePersistenceManager to create a
> PostgresqlPersistenceManager that takes advantage of postgresql's
> LargeObject API.  This was done primarily by extending DbBLOBStore and
> overwriting the methods so that they store an oid instead of a bytea
> column for BINVAL_DATA and then using the LargeObject API to actually
> store the binary data based off of the oid.  My new postgresql persistence
> manager successfully removed the memory problems associated with using the
> bytea data type to store large blobs (I can now store any size file with
> default VM heap sizes).  Although this fixed the memory problems I was
> working on earlier, I noticed that performance still wasn't as fast as I
> would've hoped.  Stepping through the code, I realized that every time
> Serializer.serialize or Serializer.deserialize for PropertyState gets
> called, any binary data stored in a blob gets downloaded to a temporary
> file as the InternalValues are getting assembled.  For a large file, this
> may take 30 seconds.  Thus, simple operations like listing child nodes
> causes the entire blob to be downloaded from the postgresql database and
> written to a temporary file, which seems unnecessary since all I really
> care about are the names of the nodes, not the contents,  in most of these
> situations.

note that just enumerating/reading nodes (e.g. using Node.getNodes()) should
*not* cause any binary property value to be loaded.

e.g. let's assume a node /a with child nodes b1, b2, ... bn each having
binary properties. the following code should not trigger any binary data
access:

NodeIterator iter = root.getNode("a").getNodes();
while (iter.hasNext()) {
    Node child = iter.nextNode();
    System.out.println(child.getName());
}

cheers
stefan

> For example, here is the code I use to completely clean out my
> repository starting at the root.  Unfortunately, this has the nasty side
> effect of creating temporary files for all of the blobs in the repository
> before the nodes are deleted:
>
>                 Node root = repoSession.getRootNode();
>             Node jackrabbitRoot =
> root.getNode(JackrabbitResourceRepository.REPOSITORY_ROOT_NAME);
>             for (Iterator childItr = jackrabbitRoot.getNodes();
> childItr.hasNext();) {
>                 Node nodeToDelete = (Node) childItr.next();
>                 nodeToDelete.remove();
>             }
>
>  Is there any way to avoid downloading the blob data unnecessarily in
> situations like the one above?  I really only want to download the blob if
> a user asks for it.  Instead, it seems the blob is always getting
> downloaded so that Jackrabbit can create the BLOBFileValue for each blob
> in the DB.
> Thanks,
> Joe.
>

Re: Unnecessary downloading of blob data

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

[Probably more appropriate for dev@]

On 1/8/07, jdente@21technologies.com <jd...@21technologies.com> wrote:
> Is there any way to avoid downloading the blob data unnecessarily in
> situations like the one above?  I really only want to download the blob if
> a user asks for it.  Instead, it seems the blob is always getting
> downloaded so that Jackrabbit can create the BLOBFileValue for each blob
> in the DB.

Unfortunately currently only file system based blob stores avoid this
copying. The best way to improve the situation would probably be to
make BLOBFileValue to keep the blob identifier and a reference to the
BLOBStore instead of a FileSystemResource reference. There are
probably some complications to work around... You may want to file an
improvement request in Jira about this.

BR,

Jukka Zitting