You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by jd...@21technologies.com on 2007/01/08 21:49:09 UTC
Unnecessary downloading of blob data
Hi,
I recently extended the DatabasePersistenceManager to create a
PostgresqlPersistenceManager that takes advantage of postgresql's
LargeObject API. This was done primarily by extending DbBLOBStore and
overwriting the methods so that they store an oid instead of a bytea
column for BINVAL_DATA and then using the LargeObject API to actually
store the binary data based off of the oid. My new postgresql persistence
manager successfully removed the memory problems associated with using the
bytea data type to store large blobs (I can now store any size file with
default VM heap sizes). Although this fixed the memory problems I was
working on earlier, I noticed that performance still wasn't as fast as I
would've hoped. Stepping through the code, I realized that every time
Serializer.serialize or Serializer.deserialize for PropertyState gets
called, any binary data stored in a blob gets downloaded to a temporary
file as the InternalValues are getting assembled. For a large file, this
may take 30 seconds. Thus, simple operations like listing child nodes
causes the entire blob to be downloaded from the postgresql database and
written to a temporary file, which seems unnecessary since all I really
care about are the names of the nodes, not the contents, in most of these
situations. For example, here is the code I use to completely clean out my
repository starting at the root. Unfortunately, this has the nasty side
effect of creating temporary files for all of the blobs in the repository
before the nodes are deleted:
Node root = repoSession.getRootNode();
Node jackrabbitRoot =
root.getNode(JackrabbitResourceRepository.REPOSITORY_ROOT_NAME);
for (Iterator childItr = jackrabbitRoot.getNodes();
childItr.hasNext();) {
Node nodeToDelete = (Node) childItr.next();
nodeToDelete.remove();
}
Is there any way to avoid downloading the blob data unnecessarily in
situations like the one above? I really only want to download the blob if
a user asks for it. Instead, it seems the blob is always getting
downloaded so that Jackrabbit can create the BLOBFileValue for each blob
in the DB.
Thanks,
Joe.
Re: Unnecessary downloading of blob data
Posted by Stefan Guggisberg <st...@gmail.com>.
On 1/8/07, jdente@21technologies.com <jd...@21technologies.com> wrote:
> Hi,
> I recently extended the DatabasePersistenceManager to create a
> PostgresqlPersistenceManager that takes advantage of postgresql's
> LargeObject API. This was done primarily by extending DbBLOBStore and
> overwriting the methods so that they store an oid instead of a bytea
> column for BINVAL_DATA and then using the LargeObject API to actually
> store the binary data based off of the oid. My new postgresql persistence
> manager successfully removed the memory problems associated with using the
> bytea data type to store large blobs (I can now store any size file with
> default VM heap sizes). Although this fixed the memory problems I was
> working on earlier, I noticed that performance still wasn't as fast as I
> would've hoped. Stepping through the code, I realized that every time
> Serializer.serialize or Serializer.deserialize for PropertyState gets
> called, any binary data stored in a blob gets downloaded to a temporary
> file as the InternalValues are getting assembled. For a large file, this
> may take 30 seconds. Thus, simple operations like listing child nodes
> causes the entire blob to be downloaded from the postgresql database and
> written to a temporary file, which seems unnecessary since all I really
> care about are the names of the nodes, not the contents, in most of these
> situations.
note that just enumerating/reading nodes (e.g. using Node.getNodes()) should
*not* cause any binary property value to be loaded.
e.g. let's assume a node /a with child nodes b1, b2, ... bn each having
binary properties. the following code should not trigger any binary data
access:
NodeIterator iter = root.getNode("a").getNodes();
while (iter.hasNext()) {
Node child = iter.nextNode();
System.out.println(child.getName());
}
cheers
stefan
> For example, here is the code I use to completely clean out my
> repository starting at the root. Unfortunately, this has the nasty side
> effect of creating temporary files for all of the blobs in the repository
> before the nodes are deleted:
>
> Node root = repoSession.getRootNode();
> Node jackrabbitRoot =
> root.getNode(JackrabbitResourceRepository.REPOSITORY_ROOT_NAME);
> for (Iterator childItr = jackrabbitRoot.getNodes();
> childItr.hasNext();) {
> Node nodeToDelete = (Node) childItr.next();
> nodeToDelete.remove();
> }
>
> Is there any way to avoid downloading the blob data unnecessarily in
> situations like the one above? I really only want to download the blob if
> a user asks for it. Instead, it seems the blob is always getting
> downloaded so that Jackrabbit can create the BLOBFileValue for each blob
> in the DB.
> Thanks,
> Joe.
>
Re: Unnecessary downloading of blob data
Posted by Jukka Zitting <ju...@gmail.com>.
Hi,
[Probably more appropriate for dev@]
On 1/8/07, jdente@21technologies.com <jd...@21technologies.com> wrote:
> Is there any way to avoid downloading the blob data unnecessarily in
> situations like the one above? I really only want to download the blob if
> a user asks for it. Instead, it seems the blob is always getting
> downloaded so that Jackrabbit can create the BLOBFileValue for each blob
> in the DB.
Unfortunately currently only file system based blob stores avoid this
copying. The best way to improve the situation would probably be to
make BLOBFileValue to keep the blob identifier and a reference to the
BLOBStore instead of a FileSystemResource reference. There are
probably some complications to work around... You may want to file an
improvement request in Jira about this.
BR,
Jukka Zitting