You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by chewy_fruit_loop <ch...@yahoo.com> on 2007/09/12 17:22:09 UTC

persistance

I'm currently trying to import an XML file into a bog standard empty
repository.
The problem is the file is 72.5mb containing around 200,000 elements (yes
they are all required).  This is currently taking about 90 mins (give or
take) to get into derby, and thats with indexing off.

The time wouldn't be such an issue if it didn't use 1.7Gb of RAM.
I've decorated a ContentHandler so it calls :

root.update(<workspace name>)
root.save()

where root is the root node from the tree.
This is being called after every 500 start elements.  The save just doesn't
seem to flush the contents that have been parsed to the persistent store. 
This is the same if I use derby or Oracle as storage.  The only time things
seem to start to be persisted is when the endDocument is hit.

have I missed something blindingly obvious here?  I really don't mind
everyone having a bit of a chuckle at me, I just want to get this sorted
out.


thanks

-- 
View this message in context: http://www.nabble.com/persistance-tf4430069.html#a12637949
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.


Re: persistance

Posted by chewy_fruit_loop <ch...@yahoo.com>.
hmmm..... thats what I'am doing....

heres the code that kicks off the import :

  ContentHandler cHandler =
session.getWorkspace().getImportContentHandler("/",
ImportUUIDBehavior.IMPORT_UUID_CREATE_NEW);
  HandlerWrapper handler = new HandlerWrapper(cHandler,session);

  XMLReader reader = XMLReaderFactory.createXMLReader();
                	
  reader.setContentHandler(handler);                	
  WorkspaceImpl wsp = (WorkspaceImpl) session.getWorkspace();
                	    
  if(!exists)		
    wsp.createWorkspace(ROOT_NODE);
                	
  session.getRepository().login(login,  ROOT_NODE);
                	
  reader.parse(new org.xml.sax.InputSource(getXMLStream()));  


HandlerWrapper is my extension of ContentHandler in a very similar vein to 
http://www.nabble.com/Session.importXml---how-to-monitor-progress-t4349372.html
this post 

i know that in theory that this should work...
once the repository has been created on the disk (about 6.5Mb), its size
will not increase until the endDodcument method has been called.  If you use
a typical XML document, I doubt you'll even notice this, but the documents
that will be going into the importer will range from hundreds of kilobytes
to 70+ Mb in size.

I had the notion that this could be derby holding on to the transactions
until a commit was issued to it, so I switched to the oracle manager and had
exactly the same result, only much slower (read significantly).

Is there a way to set the persistence manager to write as it goes?

I've had to set the jvm to have a maximum heap of 1.5Gb just so I can get to
the end of the document, and I've also had to turn of lucene as that was
making the heap over run and kill the program (which incidentally means I
now have to work out how to generate an index for the repository after the
import has finished as theres not enough memory available on a win32 system
to allocate 2Gb to the jvm, but thats another story).

I really want this to be a doh moment but I'm getting an uneasy feeling that
its not....



Florent Guillaume wrote:
> 
> If you import that big a file, you should import directly into the 
> workspace and not in the session, without going through the transient 
> space and using lots of memory.
> So use Workspace.getImportContentHandler or Workspace.importXML, not the 
> Session methods. Read the JSR-170 for the benefits.
> 
> Florent
> 
> chewy_fruit_loop wrote:
>> I'm currently trying to import an XML file into a bog standard empty
>> repository.
>> The problem is the file is 72.5mb containing around 200,000 elements (yes
>> they are all required).  This is currently taking about 90 mins (give or
>> take) to get into derby, and thats with indexing off.
>> 
>> The time wouldn't be such an issue if it didn't use 1.7Gb of RAM.
>> I've decorated a ContentHandler so it calls :
>> 
>> root.update(<workspace name>)
>> root.save()
>> 
>> where root is the root node from the tree.
>> This is being called after every 500 start elements.  The save just
>> doesn't
>> seem to flush the contents that have been parsed to the persistent store. 
>> This is the same if I use derby or Oracle as storage.  The only time
>> things
>> seem to start to be persisted is when the endDocument is hit.
>> 
>> have I missed something blindingly obvious here?  I really don't mind
>> everyone having a bit of a chuckle at me, I just want to get this sorted
>> out.
>> 
>> 
>> thanks
>> 
> 
> 
> -- 
> Florent Guillaume, Director of R&D, Nuxeo
> Open Source Enterprise Content Management (ECM)
> http://www.nuxeo.com   http://www.nuxeo.org   +33 1 40 33 79 87
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/persistance-tf4430069.html#a12671085
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.


Re: persistance

Posted by chewy_fruit_loop <ch...@yahoo.com>.
Stefan you're spot on there.
Thanks a million, I've only had to remove a 
  root.update(<workspaceName>)
from my save code and that worked.

and the code for getting a ContentHandler now looks like 

  ContentHandler cHandler = session.getImportContentHandler("/",
ImportUUIDBehavior.IMPORT_UUID_CREATE_NEW);

where it used to be 
  
  ContentHandler cHandler =
session.getWorkspace().getImportContentHandler("/",
ImportUUIDBehavior.IMPORT_UUID_CREATE_NEW); 


hopefully the next unfortunate soul to come across this little doosie will
find this post and save themselves a bunch of greif


Thanks all :)


Stefan Guggisberg wrote:
> 
> On 9/14/07, Florent Guillaume <fg...@nuxeo.com> wrote:
>> If you import that big a file, you should import directly into the
>> workspace and not in the session, without going through the transient
>> space and using lots of memory.
>> So use Workspace.getImportContentHandler or Workspace.importXML, not the
>> Session methods. Read the JSR-170 for the benefits.
> 
> that's absolutely correct, theoretically ;-) the workspace methods avoid
> the
> transient layer. however, in the current implementation of jackrabbit
> the workspace import methods are still memory-bound because the
> entire change log is kept in memory until commit on endDocument.
> 
> for very large imports i'd therefore suggest to use the session
> import methods, saving batches of e.g. 1000 items by using a
> ContentHandler decorator.
> 
> hope this helps.
> 
> cheers
> stefan
> 
>>
>> Florent
>>
>> chewy_fruit_loop wrote:
>> > I'm currently trying to import an XML file into a bog standard empty
>> > repository.
>> > The problem is the file is 72.5mb containing around 200,000 elements
>> (yes
>> > they are all required).  This is currently taking about 90 mins (give
>> or
>> > take) to get into derby, and thats with indexing off.
>> >
>> > The time wouldn't be such an issue if it didn't use 1.7Gb of RAM.
>> > I've decorated a ContentHandler so it calls :
>> >
>> > root.update(<workspace name>)
>> > root.save()
>> >
>> > where root is the root node from the tree.
>> > This is being called after every 500 start elements.  The save just
>> doesn't
>> > seem to flush the contents that have been parsed to the persistent
>> store.
>> > This is the same if I use derby or Oracle as storage.  The only time
>> things
>> > seem to start to be persisted is when the endDocument is hit.
>> >
>> > have I missed something blindingly obvious here?  I really don't mind
>> > everyone having a bit of a chuckle at me, I just want to get this
>> sorted
>> > out.
>> >
>> >
>> > thanks
>> >
>>
>>
>> --
>> Florent Guillaume, Director of R&D, Nuxeo
>> Open Source Enterprise Content Management (ECM)
>> http://www.nuxeo.com   http://www.nuxeo.org   +33 1 40 33 79 87
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/persistance-tf4430069.html#a12671851
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.


Re: persistance

Posted by Stefan Guggisberg <st...@gmail.com>.
On 9/14/07, Florent Guillaume <fg...@nuxeo.com> wrote:
> If you import that big a file, you should import directly into the
> workspace and not in the session, without going through the transient
> space and using lots of memory.
> So use Workspace.getImportContentHandler or Workspace.importXML, not the
> Session methods. Read the JSR-170 for the benefits.

that's absolutely correct, theoretically ;-) the workspace methods avoid the
transient layer. however, in the current implementation of jackrabbit
the workspace import methods are still memory-bound because the
entire change log is kept in memory until commit on endDocument.

for very large imports i'd therefore suggest to use the session
import methods, saving batches of e.g. 1000 items by using a
ContentHandler decorator.

hope this helps.

cheers
stefan

>
> Florent
>
> chewy_fruit_loop wrote:
> > I'm currently trying to import an XML file into a bog standard empty
> > repository.
> > The problem is the file is 72.5mb containing around 200,000 elements (yes
> > they are all required).  This is currently taking about 90 mins (give or
> > take) to get into derby, and thats with indexing off.
> >
> > The time wouldn't be such an issue if it didn't use 1.7Gb of RAM.
> > I've decorated a ContentHandler so it calls :
> >
> > root.update(<workspace name>)
> > root.save()
> >
> > where root is the root node from the tree.
> > This is being called after every 500 start elements.  The save just doesn't
> > seem to flush the contents that have been parsed to the persistent store.
> > This is the same if I use derby or Oracle as storage.  The only time things
> > seem to start to be persisted is when the endDocument is hit.
> >
> > have I missed something blindingly obvious here?  I really don't mind
> > everyone having a bit of a chuckle at me, I just want to get this sorted
> > out.
> >
> >
> > thanks
> >
>
>
> --
> Florent Guillaume, Director of R&D, Nuxeo
> Open Source Enterprise Content Management (ECM)
> http://www.nuxeo.com   http://www.nuxeo.org   +33 1 40 33 79 87
>
>

Re: persistance

Posted by Florent Guillaume <fg...@nuxeo.com>.
If you import that big a file, you should import directly into the 
workspace and not in the session, without going through the transient 
space and using lots of memory.
So use Workspace.getImportContentHandler or Workspace.importXML, not the 
Session methods. Read the JSR-170 for the benefits.

Florent

chewy_fruit_loop wrote:
> I'm currently trying to import an XML file into a bog standard empty
> repository.
> The problem is the file is 72.5mb containing around 200,000 elements (yes
> they are all required).  This is currently taking about 90 mins (give or
> take) to get into derby, and thats with indexing off.
> 
> The time wouldn't be such an issue if it didn't use 1.7Gb of RAM.
> I've decorated a ContentHandler so it calls :
> 
> root.update(<workspace name>)
> root.save()
> 
> where root is the root node from the tree.
> This is being called after every 500 start elements.  The save just doesn't
> seem to flush the contents that have been parsed to the persistent store. 
> This is the same if I use derby or Oracle as storage.  The only time things
> seem to start to be persisted is when the endDocument is hit.
> 
> have I missed something blindingly obvious here?  I really don't mind
> everyone having a bit of a chuckle at me, I just want to get this sorted
> out.
> 
> 
> thanks
> 


-- 
Florent Guillaume, Director of R&D, Nuxeo
Open Source Enterprise Content Management (ECM)
http://www.nuxeo.com   http://www.nuxeo.org   +33 1 40 33 79 87