You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by da...@butterdev.com on 2015/11/13 22:33:01 UTC

Node Retrieval Performance

Hi,
I am new to JackRabbit and using version 2.11.2.  I am using JackRabbit 
to store documents in a multi-threaded environment.  I noticed that the 
time it takes to retrieve the root node is inconsistent and slow 
(several seconds +) and degrades over time (after 50K plus child nodes 
retrieval is taking ~15 seconds).

Originally, I was using code as follows to obtain a repository:

  public Repository getRepository() throws ClassNotFoundException, 
RepositoryException {
      
ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory"));
      return JcrUtils.getRepository(jackabbitServerUrl);
  }

Then I came across the following thread:
http://jackrabbit.510166.n4.nabble.com/getRootNode-takes-27-seconds-td1571027.html#a1571302

This thread had some useful information (BatchReadConfig), but I am not 
certain how to use the API to take advantage of it.  I have changed my 
code to the following but it doesn't appear that node retrieval 
performance has improved, is there something I am missing/doing wrong?

1) Repository Factory
public Repository getRepository(@SuppressWarnings("rawtypes") Map 
parameters) throws RepositoryException {
         String repositoryFactoryName = parameters != null && (
                 parameters.containsKey(PARAM_REPOSITORY_SERVICE_FACTORY) 
||
                         parameters.containsKey(PARAM_REPOSITORY_CONFIG))
                 ? 
"org.apache.jackrabbit.jcr2spi.Jcr2spiRepositoryFactory"
                 : "org.apache.jackrabbit.core.RepositoryFactoryImpl";

         Object repositoryFactory;
         try {
             Class<?> repositoryFactoryClass = 
Class.forName(repositoryFactoryName, true,
                     Thread.currentThread().getContextClassLoader());

             repositoryFactory = repositoryFactoryClass.newInstance();
         }
         catch (Exception e) {
             throw new RepositoryException(e);
         }

         if (repositoryFactory instanceof RepositoryFactory) {
             return ((RepositoryFactory) 
repositoryFactory).getRepository(parameters);
         }
         else {
             throw new RepositoryException(repositoryFactory + " is not a 
RepositoryFactory");
         }
     }

2) Use the factory to get a repo:
  public Repository getRepository() throws ClassNotFoundException, 
RepositoryException {
         Map<String, RepositoryConfig> parameters = 
Collections.singletonMap(
                 "org.apache.jackrabbit.jcr2spi.RepositoryConfig",
                 (RepositoryConfig) new 
RepositoryConfigImpl(jackabbitServerUrl));

         return getRepository(parameters);
     }

3) Repository Config:
private static final class RepositoryConfigImpl implements 
RepositoryConfig {

         private String jackabbitServerUrl;

         private RepositoryConfigImpl(String jackabbitServerUrl) {
             super();
             this.jackabbitServerUrl = jackabbitServerUrl;
         }

         public CacheBehaviour getCacheBehaviour() {
             return CacheBehaviour.INVALIDATE;
         }

         public int getItemCacheSize() {
             return 100;
         }

         public int getPollTimeout() {
             return 5000;
         }

         public RepositoryService getRepositoryService() throws 
RepositoryException {
             BatchReadConfig brc = new BatchReadConfig() {
                 public int getDepth(Path path, PathResolver resolver) 
throws NamespaceException {
                     return 1;
                 }
             };
             return new RepositoryServiceImpl(jackabbitServerUrl, brc);
         }

     }

Thanks for your time.

David




Re: Node Retrieval Performance

Posted by Robert Munteanu <ro...@apache.org>.
On Sat, Nov 14, 2015 at 8:28 PM, Clay Ferguson <wc...@gmail.com> wrote:
> I would argue
> that most of them would fix this if they ever started over from scratch

Which I guess explain my other mail related to Oak having a much
better performance profile with child nodes which are not orderable.

Thanks,

Robert

Re: Node Retrieval Performance

Posted by Clay Ferguson <wc...@gmail.com>.
Dirk,
There are probably only a hand-full of core JCR (Jackrabbit) developers who
fully understand why this technical limitation exists, and I would argue
that most of them would fix this if they ever started over from scratch.
Just imagine how much harder it is now for someone to convert an existing
RDBMS over to JCR. In such a conversion, the obvious thing to do is let
each RDB table become a parent node and contain all the children from the
table of the same "type". But even this most basic architectural pattern
fails at 50K records? Uh, that's a fail. If this is an impossible problem
to solve for some kind of technical reason, that's one thing, but when
people try to explain it away as a "feature" rather than a "bug", I just
don't buy a word of it.
-Clay


On Sat, Nov 14, 2015 at 11:22 AM, Dirk Rudolph <di...@netcentric.biz>
wrote:

> The whole discussion about large number of siblings is kind of off topic.
>
> I would answer: as everything performs better when it's optimized and the
> common use case of jackrabbit is to store hierarchical data, choose another
> data store if you want to store flat data. Same for RDBMS. They are not
> designed for hierarchical data and implementing this use case has a
> drawback there as well as implementing large flat data for jackrabbit.
>
> The question and my last off topic comment to the discussion is: is it a)
> worth the effort and b) do we really want the drawbacks?
>
> At the end the original topic should perform well with some hierarchy. If
> not jackrabbit may not be the best to store those kind of data.
>
> Cheers, D
>
>

Re: Node Retrieval Performance

Posted by Dirk Rudolph <di...@netcentric.biz>.
The whole discussion about large number of siblings is kind of off topic.

I would answer: as everything performs better when it's optimized and the
common use case of jackrabbit is to store hierarchical data, choose another
data store if you want to store flat data. Same for RDBMS. They are not
designed for hierarchical data and implementing this use case has a
drawback there as well as implementing large flat data for jackrabbit.

The question and my last off topic comment to the discussion is: is it a)
worth the effort and b) do we really want the drawbacks?

At the end the original topic should perform well with some hierarchy. If
not jackrabbit may not be the best to store those kind of data.

Cheers, D

On Saturday, 14 November 2015, Clay Ferguson <wc...@gmail.com> wrote:

> Dirk,
> You are not adding new information. Everything you just said was a known
> and a given. We all realize we can be creative and solve this problem, and
> avoid large numbers of children in all manor of creative and
> straightforward ways. However, can you imagine yourself making the same
> statement about RDBMS tables? If you were a developer on a RDBMS,
> struggling to get scale working, would you ever say this to your boss: "Oh
> well, if the table gets over 50K, we can just add new tables, because since
> the DB can't deal with it we can just put the responsibility on the app
> developers." If that would be a silly statement in the RDBMS world, it will
> be silly in the NoSQL world for all the same exact reasons.
>
>
> Best regards,
> Clay Ferguson
> wclayf@gmail.com <javascript:;>
>
>
> On Sat, Nov 14, 2015 at 10:07 AM, Dirk Rudolph <
> dirk.rudolph@netcentric.biz <javascript:;>>
> wrote:
>
> > Each of the records has an primary key I guess. So build the uuid or any
> > hash from it and use it as key in a BTree structure. Simple and
> > straightforward.
> >
> > Actually the idea is to find structure in your data. This is a core idea
> of
> > structured document stores. In case you have a large amount of siblings
> the
> > detail level of your structure might not be deep enough.
> >
> > Anyway if you want to store key value tables somewhere there is a broad
> > pool of available open source solutions.
> >
> > Cheers, D
> >
> > On Saturday, 14 November 2015, Clay Ferguson <wclayf@gmail.com
> <javascript:;>> wrote:
> >
> > > Dirk,
> > > What you're explaining would work great if the data had naturally
> > occurring
> > > categories all being conveniently at whatever size JCR happens to
> handle
> > > ok. This just doesn't work well in actuality. What if I just need to
> > store
> > > a table of 25 million arbitrary records? The "it can't be done" with
> JCR
> > is
> > > the honest answer. Solving it by creating a bunch of separate buckets
> is
> > a
> > > massive ugly kluge. Whatever the technical limitation is, it's INSIDE
> > > Jackrabbit, and badly needs to be addressed rather than forcing
> > developers
> > > to jump thru hoops in application code. Surely I can't be the only one
> to
> > > think this? Is everybody else just afraid to be critical like me,
> because
> > > they are getting paid to work on JCR? Why don't we just be honest.
> > >
> > > Best regards,
> > > Clay Ferguson
> > > wclayf@gmail.com <javascript:;> <javascript:;>
> > >
> > >
> > > On Sat, Nov 14, 2015 at 2:35 AM, Dirk Rudolph <
> > dirk.rudolph@netcentric.biz <javascript:;>
> > > <javascript:;>>
> > > wrote:
> > >
> > > > > I am planning on storing a lot of data in JackRabbit (terabytes)
> > > >
> > > > But that should not mean storing them all as children of a single
> Node.
> > > > Probably you should think about driving the hierarchy as explained in
> > > > DavidsModel.
> > > >
> > > > So in general you would structure your files in for example
> categories:
> > > >
> > > > /categoryA
> > > > /categoryB
> > > > /categoryC
> > > >
> > > > Or even
> > > >
> > > > /categoryA/sub1/subsuba
> > > > /categoryA/sub1/subsubb
> > > >
> > > > and so on. Each of them could then be a root of a NodeSequence
> managed
> > as
> > > > BTree. This would you additionally allow to split the content over
> > > multiple
> > > > jackrabbit instances to increase performance.
> > > >
> > > > In general Jackrabbit is/should be able to handle that many data but
> > > > maintanance might take a lot of time blocking your application. So
> you
> > > > should try to keep the repository size of a single instance as small
> as
> > > > possible by for example splitting content by category, region of
> > access,
> > > or
> > > > what ever.
> > > >
> > > > > Or can I simplify it and just do something like this to get a repo
> > > >
> > > >
> > > > Have a look at:
> > > >
> > > >
> > > >
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#getRepository(java.util.Map)
> > > > <
> > > >
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#getRepository(java.util.Map)
> > > > >
> > > >
> > > > The parameterMap contains for example
> > > >
> > > >
> > > >
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#REPOSITORY_URI
> > > > <
> > > >
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#REPOSITORY_URI
> > > > >
> > > >
> > > >
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_REPOSITORY_URI
> > > > <
> > > >
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_REPOSITORY_URI
> > > > >
> > > >
> > > >
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_ITEMINFO_CACHE_SIZE
> > > > <
> > > >
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_ITEMINFO_CACHE_SIZE
> > > > >
> > > >
> > > > Btw. It should not be required to call ServiceLoader#load() by
> > yourself.
> > > >
> > > > Cheers, D
> > > >
> > > > Dirk Rudolph | Senior Software Engineer
> > > > Netcentric AG
> > > >
> > > > M: +41 79 642 37 11
> > > > D: +49 174 966 84 34
> > > >
> > > > dirk.rudolph@netcentric.biz <javascript:;> <javascript:;> <mailto:
> > > dirk.rudolph@netcentric.biz <javascript:;> <javascript:;>> |
> > > > www.netcentric.biz <http://www.netcentric.biz/>
> > > > > On 14 Nov 2015, at 01:26, David Marginian <david@butterdev.com
> <javascript:;>
> > > <javascript:;>> wrote:
> > > > >
> > > > > Thanks Dirk, I should have found that page on my own.  I am going
> to
> > > > look into using the BTreeManager, just curious what are the
> limitations
> > > for
> > > > documents/file counts within a node?  I am planning on storing a lot
> of
> > > > data in JackRabbit (terabytes).  Also, is the configuration code I
> > posted
> > > > in my previous posts the best way to do things?  Or can I simplify it
> > and
> > > > just do something like this to get a repo:
> > > > >
> > > > >
> > > >
> > >
> >
> ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory"));
> > > > > return JcrUtils.getRepository(jackabbitServerUrl);
> > > > >
> > > > > On 11/13/2015 03:47 PM, Dirk Rudolph wrote:
> > > > >> Did I understood you right, you have thousands of child nodes
> below
> > > the
> > > > >> root node?
> > > > >>
> > > > >> You should avoid this because this is considered bad practice in
> > terms
> > > > of
> > > > >> write performance and depending on your concurrent access this
> might
> > > > also
> > > > >> block read access.
> > > > >>
> > > > >> http://wiki.apache.org/jackrabbit/Performance
> > > > >>
> > > > >> Try to introduce a structure to your content using BTreeManger
> > > > >>
> > > > >>
> > > > >>
> > > >
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/flat/BTreeManager.html
> > > > >>
> > > > >> Cheers, D
> > > > >>
> > > > >>
> > > > >> On Friday, 13 November 2015, David Marginian <david@butterdev.com
> <javascript:;>
> > > <javascript:;>>
> > > > wrote:
> > > > >>
> > > > >>> Thanks Clay.  I am not trying to load that many records at once.
> > The
> > > > >>> application is crawling a directory.  It places the files from
> that
> > > > >>> directory into JackRabbit one at a time, and puts a content id
> > onto a
> > > > queue
> > > > >>> which is picked up by consumers on different servers.  Those
> > > consumers
> > > > then
> > > > >>> use the content id to retrieve the file from JackRabbit. Each
> piece
> > > of
> > > > >>> content is saved in a node under the root node.  The performance
> > > > slowdown
> > > > >>> is coming from calling session.getRootNode(), from what I can
> > gather
> > > > from
> > > > >>> the docs I need the root node in order to add a child node.  Note
> > the
> > > > >>> slowdown is pretty significant and I don't need to have close to
> > 50k
> > > to
> > > > >>> start seeing it (I start seeing it within a few minutes of
> running
> > my
> > > > >>> app).  I don't need orderable nodes, how do I disable that?
> > > > >>>
> > > > >>>
> > > > >>> On 11/13/2015 03:10 PM, Clay Ferguson wrote:
> > > > >>>
> > > > >>>> ​Please let us know more about your use case. Why are you even
> > > > "trying" to
> > > > >>>> load that many records all at once. Or at least scan them one by
> > > one,
> > > > I
> > > > >>>> mean. In most use cases you wouldn't need to do this kind of
> > thing,
> > > > unless
> > > > >>>> it's some kind of backup or replication. I say "most" cases...
> I'm
> > > not
> > > > >>>>   saying you don't need to just asking for a bit more
> background.
> > > > BTW: If
> > > > >>>> you don't need 'orderable' nodes try to avoid them. That type of
> > > node
> > > > does
> > > > >>>> not work at 'scale'... and 50K is propably pushing it.​
> > > > >>>>
> > > > >>>> Best regards,
> > > > >>>> Clay Ferguson
> > > > >>>> wclayf@gmail.com <javascript:;> <javascript:;>
> > > > >>>>
> > > > >>>>
> > > > >>>> On Fri, Nov 13, 2015 at 3:33 PM, <david@butterdev.com
> <javascript:;>
> > > <javascript:;>> wrote:
> > > > >>>>
> > > > >>>> Hi,
> > > > >>>>> I am new to JackRabbit and using version 2.11.2.  I am using
> > > > JackRabbit
> > > > >>>>> to
> > > > >>>>> store documents in a multi-threaded environment.  I noticed
> that
> > > the
> > > > time
> > > > >>>>> it takes to retrieve the root node is inconsistent and slow
> > > (several
> > > > >>>>> seconds +) and degrades over time (after 50K plus child nodes
> > > > retrieval
> > > > >>>>> is
> > > > >>>>> taking ~15 seconds).
> > > > >>>>>
> > > > >>>>> Originally, I was using code as follows to obtain a repository:
> > > > >>>>>
> > > > >>>>>   public Repository getRepository() throws
> > ClassNotFoundException,
> > > > >>>>> RepositoryException {
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > >
> > >
> >
> ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory"));
> > > > >>>>>       return JcrUtils.getRepository(jackabbitServerUrl);
> > > > >>>>>   }
> > > > >>>>>
> > > > >>>>> Then I came across the following thread:
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > >
> > >
> >
> http://jackrabbit.510166.n4.nabble.com/getRootNode-takes-27-seconds-td1571027.html#a1571302
> > > > >>>>>
> > > > >>>>> This thread had some useful information (BatchReadConfig), but
> I
> > am
> > > > not
> > > > >>>>> certain how to use the API to take advantage of it.  I have
> > changed
> > > > my
> > > > >>>>> code
> > > > >>>>> to the following but it doesn't appear that node retrieval
> > > > performance
> > > > >>>>> has
> > > > >>>>> improved, is there something I am missing/doing wrong?
> > > > >>>>>
> > > > >>>>> 1) Repository Factory
> > > > >>>>> public Repository getRepository(@SuppressWarnings("rawtypes")
> Map
> > > > >>>>> parameters) throws RepositoryException {
> > > > >>>>>          String repositoryFactoryName = parameters != null && (
> > > > >>>>>
> > > > >>>>>  parameters.containsKey(PARAM_REPOSITORY_SERVICE_FACTORY) ||
> > > > >>>>>
> > > > parameters.containsKey(PARAM_REPOSITORY_CONFIG))
> > > > >>>>>                  ?
> > > > >>>>> "org.apache.jackrabbit.jcr2spi.Jcr2spiRepositoryFactory"
> > > > >>>>>                  :
> > > > "org.apache.jackrabbit.core.RepositoryFactoryImpl";
> > > > >>>>>
> > > > >>>>>          Object repositoryFactory;
> > > > >>>>>          try {
> > > > >>>>>              Class<?> repositoryFactoryClass =
> > > > >>>>> Class.forName(repositoryFactoryName, true,
> > > > >>>>>
> > > Thread.currentThread().getContextClassLoader());
> > > > >>>>>
> > > > >>>>>              repositoryFactory =
> > > > repositoryFactoryClass.newInstance();
> > > > >>>>>          }
> > > > >>>>>          catch (Exception e) {
> > > > >>>>>              throw new RepositoryException(e);
> > > > >>>>>          }
> > > > >>>>>
> > > > >>>>>          if (repositoryFactory instanceof RepositoryFactory) {
> > > > >>>>>              return ((RepositoryFactory)
> > > > >>>>> repositoryFactory).getRepository(parameters);
> > > > >>>>>          }
> > > > >>>>>          else {
> > > > >>>>>              throw new RepositoryException(repositoryFactory +
> "
> > is
> > > > not a
> > > > >>>>> RepositoryFactory");
> > > > >>>>>          }
> > > > >>>>>      }
> > > > >>>>>
> > > > >>>>> 2) Use the factory to get a repo:
> > > > >>>>>   public Repository getRepository() throws
> > ClassNotFoundException,
> > > > >>>>> RepositoryException {
> > > > >>>>>          Map<String, RepositoryConfig> parameters =
> > > > >>>>> Collections.singletonMap(
> > > > >>>>>
> > "org.apache.jackrabbit.jcr2spi.RepositoryConfig",
> > > > >>>>>                  (RepositoryConfig) new
> > > > >>>>> RepositoryConfigImpl(jackabbitServerUrl));
> > > > >>>>>
> > > > >>>>>          return getRepository(parameters);
> > > > >>>>>      }
> > > > >>>>>
> > > > >>>>> 3) Repository Config:
> > > > >>>>> private static final class RepositoryConfigImpl implements
> > > > >>>>> RepositoryConfig {
> > > > >>>>>
> > > > >>>>>          private String jackabbitServerUrl;
> > > > >>>>>
> > > > >>>>>          private RepositoryConfigImpl(String
> jackabbitServerUrl)
> > {
> > > > >>>>>              super();
> > > > >>>>>              this.jackabbitServerUrl = jackabbitServerUrl;
> > > > >>>>>          }
> > > > >>>>>
> > > > >>>>>          public CacheBehaviour getCacheBehaviour() {
> > > > >>>>>              return CacheBehaviour.INVALIDATE;
> > > > >>>>>          }
> > > > >>>>>
> > > > >>>>>          public int getItemCacheSize() {
> > > > >>>>>              return 100;
> > > > >>>>>          }
> > > > >>>>>
> > > > >>>>>          public int getPollTimeout() {
> > > > >>>>>              return 5000;
> > > > >>>>>          }
> > > > >>>>>
> > > > >>>>>          public RepositoryService getRepositoryService() throws
> > > > >>>>> RepositoryException {
> > > > >>>>>              BatchReadConfig brc = new BatchReadConfig() {
> > > > >>>>>                  public int getDepth(Path path, PathResolver
> > > > resolver)
> > > > >>>>> throws NamespaceException {
> > > > >>>>>                      return 1;
> > > > >>>>>                  }
> > > > >>>>>              };
> > > > >>>>>              return new
> RepositoryServiceImpl(jackabbitServerUrl,
> > > > brc);
> > > > >>>>>          }
> > > > >>>>>
> > > > >>>>>      }
> > > > >>>>>
> > > > >>>>> Thanks for your time.
> > > > >>>>>
> > > > >>>>> David
> > > >
> > > >
> > >
> >
> >
> > --
> >
> > Dirk Rudolph | Senior Software Engineer
> >
> > Netcentric AG
> >
> > M: +41 79 642 37 11
> > D: +49 174 966 84 34
> >
> > dirk.rudolph@netcentric.biz <javascript:;> | www.netcentric.biz
> >
>


-- 

Dirk Rudolph | Senior Software Engineer

Netcentric AG

M: +41 79 642 37 11
D: +49 174 966 84 34

dirk.rudolph@netcentric.biz | www.netcentric.biz

Re: Node Retrieval Performance

Posted by Ron Wheeler <rw...@artifact-software.com>.
Only if you don't care about performance.

You may have to break up tables or physically cluster tables together to 
get the performance that you want based on the pattern of access and the 
actual columns that will be requested together.

It is hard for a database system to know how many records you are going 
to add when you declare the first node and what the pattern of access 
will be when you store the first record.

Ron

On 14/11/2015 1:37 PM, Clay Ferguson wrote:
> Sorry Ron,
> You have it precisely backwards. In RDBMS modeling you focus on the
> organization of the data and relationships of the data, and never break
> stuff up to "help" the DB loading. For example, once you see the need for a
> PERSONS table, you generally have just *one* PERSONS table even if you have
> millions of people. In an RDBMS you never find yourself searching for
> patterns in the data just so you can break stuff up to help provide a
> crutch for the DB engine. DB indexes are all that are required to solve
> scalability. NEVER breaking up data.
> -Clay
>
>
> On Sat, Nov 14, 2015 at 11:43 AM, Ron Wheeler <
> rwheeler@artifact-software.com> wrote:
>
>> Even in an RDBMS application, the database designer has to be aware of the
>> physical structure used by the implementation if you want to have
>> reasonable performance with large numbers of records.
>>
>> You have to do tuning and give some thought to the way you structure your
>> tables and indexes.
>> That may mean splitting tables in ways that have little to do with the
>> business logic in order to get the best performance in the most common or
>> critical cases.
>>
>> That is why we have database administrators and courses at the university
>> level on data structures.
>>
>> Ron
>>
>>
>>


-- 
Ron Wheeler
President
Artifact Software Inc
email: rwheeler@artifact-software.com
skype: ronaldmwheeler
phone: 866-970-2435, ext 102


Re: Node Retrieval Performance

Posted by Clay Ferguson <wc...@gmail.com>.
Sorry Ron,
You have it precisely backwards. In RDBMS modeling you focus on the
organization of the data and relationships of the data, and never break
stuff up to "help" the DB loading. For example, once you see the need for a
PERSONS table, you generally have just *one* PERSONS table even if you have
millions of people. In an RDBMS you never find yourself searching for
patterns in the data just so you can break stuff up to help provide a
crutch for the DB engine. DB indexes are all that are required to solve
scalability. NEVER breaking up data.
-Clay


On Sat, Nov 14, 2015 at 11:43 AM, Ron Wheeler <
rwheeler@artifact-software.com> wrote:

> Even in an RDBMS application, the database designer has to be aware of the
> physical structure used by the implementation if you want to have
> reasonable performance with large numbers of records.
>
> You have to do tuning and give some thought to the way you structure your
> tables and indexes.
> That may mean splitting tables in ways that have little to do with the
> business logic in order to get the best performance in the most common or
> critical cases.
>
> That is why we have database administrators and courses at the university
> level on data structures.
>
> Ron
>
>
>

Re: Node Retrieval Performance

Posted by Ron Wheeler <rw...@artifact-software.com>.
Even in an RDBMS application, the database designer has to be aware of 
the physical structure used by the implementation if you want to have 
reasonable performance with large numbers of records.

You have to do tuning and give some thought to the way you structure 
your tables and indexes.
That may mean splitting tables in ways that have little to do with the 
business logic in order to get the best performance in the most common 
or critical cases.

That is why we have database administrators and courses at the 
university level on data structures.

Ron


On 14/11/2015 11:56 AM, Clay Ferguson wrote:
> Dirk,
> You are not adding new information. Everything you just said was a known
> and a given. We all realize we can be creative and solve this problem, and
> avoid large numbers of children in all manor of creative and
> straightforward ways. However, can you imagine yourself making the same
> statement about RDBMS tables? If you were a developer on a RDBMS,
> struggling to get scale working, would you ever say this to your boss: "Oh
> well, if the table gets over 50K, we can just add new tables, because since
> the DB can't deal with it we can just put the responsibility on the app
> developers." If that would be a silly statement in the RDBMS world, it will
> be silly in the NoSQL world for all the same exact reasons.
>
>
> Best regards,
> Clay Ferguson
> wclayf@gmail.com
>
>
> On Sat, Nov 14, 2015 at 10:07 AM, Dirk Rudolph <di...@netcentric.biz>
> wrote:
>
>> Each of the records has an primary key I guess. So build the uuid or any
>> hash from it and use it as key in a BTree structure. Simple and
>> straightforward.
>>
>> Actually the idea is to find structure in your data. This is a core idea of
>> structured document stores. In case you have a large amount of siblings the
>> detail level of your structure might not be deep enough.
>>
>> Anyway if you want to store key value tables somewhere there is a broad
>> pool of available open source solutions.
>>
>> Cheers, D
>>
>> On Saturday, 14 November 2015, Clay Ferguson <wc...@gmail.com> wrote:
>>
>>> Dirk,
>>> What you're explaining would work great if the data had naturally
>> occurring
>>> categories all being conveniently at whatever size JCR happens to handle
>>> ok. This just doesn't work well in actuality. What if I just need to
>> store
>>> a table of 25 million arbitrary records? The "it can't be done" with JCR
>> is
>>> the honest answer. Solving it by creating a bunch of separate buckets is
>> a
>>> massive ugly kluge. Whatever the technical limitation is, it's INSIDE
>>> Jackrabbit, and badly needs to be addressed rather than forcing
>> developers
>>> to jump thru hoops in application code. Surely I can't be the only one to
>>> think this? Is everybody else just afraid to be critical like me, because
>>> they are getting paid to work on JCR? Why don't we just be honest.
>>>
>>> Best regards,
>>> Clay Ferguson
>>> wclayf@gmail.com <javascript:;>
>>>
>>>
>>> On Sat, Nov 14, 2015 at 2:35 AM, Dirk Rudolph <
>> dirk.rudolph@netcentric.biz
>>> <javascript:;>>
>>> wrote:
>>>
>>>>> I am planning on storing a lot of data in JackRabbit (terabytes)
>>>> But that should not mean storing them all as children of a single Node.
>>>> Probably you should think about driving the hierarchy as explained in
>>>> DavidsModel.
>>>>
>>>> So in general you would structure your files in for example categories:
>>>>
>>>> /categoryA
>>>> /categoryB
>>>> /categoryC
>>>>
>>>> Or even
>>>>
>>>> /categoryA/sub1/subsuba
>>>> /categoryA/sub1/subsubb
>>>>
>>>> and so on. Each of them could then be a root of a NodeSequence managed
>> as
>>>> BTree. This would you additionally allow to split the content over
>>> multiple
>>>> jackrabbit instances to increase performance.
>>>>
>>>> In general Jackrabbit is/should be able to handle that many data but
>>>> maintanance might take a lot of time blocking your application. So you
>>>> should try to keep the repository size of a single instance as small as
>>>> possible by for example splitting content by category, region of
>> access,
>>> or
>>>> what ever.
>>>>
>>>>> Or can I simplify it and just do something like this to get a repo
>>>>
>>>> Have a look at:
>>>>
>>>>
>>>>
>> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#getRepository(java.util.Map)
>>>> <
>>>>
>> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#getRepository(java.util.Map)
>>>> The parameterMap contains for example
>>>>
>>>>
>>>>
>> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#REPOSITORY_URI
>>>> <
>>>>
>> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#REPOSITORY_URI
>>>>
>> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_REPOSITORY_URI
>>>> <
>>>>
>> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_REPOSITORY_URI
>>>>
>> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_ITEMINFO_CACHE_SIZE
>>>> <
>>>>
>> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_ITEMINFO_CACHE_SIZE
>>>> Btw. It should not be required to call ServiceLoader#load() by
>> yourself.
>>>> Cheers, D
>>>>
>>>> Dirk Rudolph | Senior Software Engineer
>>>> Netcentric AG
>>>>
>>>> M: +41 79 642 37 11
>>>> D: +49 174 966 84 34
>>>>
>>>> dirk.rudolph@netcentric.biz <javascript:;> <mailto:
>>> dirk.rudolph@netcentric.biz <javascript:;>> |
>>>> www.netcentric.biz <http://www.netcentric.biz/>
>>>>> On 14 Nov 2015, at 01:26, David Marginian <david@butterdev.com
>>> <javascript:;>> wrote:
>>>>> Thanks Dirk, I should have found that page on my own.  I am going to
>>>> look into using the BTreeManager, just curious what are the limitations
>>> for
>>>> documents/file counts within a node?  I am planning on storing a lot of
>>>> data in JackRabbit (terabytes).  Also, is the configuration code I
>> posted
>>>> in my previous posts the best way to do things?  Or can I simplify it
>> and
>>>> just do something like this to get a repo:
>>>>>
>> ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory"));
>>>>> return JcrUtils.getRepository(jackabbitServerUrl);
>>>>>
>>>>> On 11/13/2015 03:47 PM, Dirk Rudolph wrote:
>>>>>> Did I understood you right, you have thousands of child nodes below
>>> the
>>>>>> root node?
>>>>>>
>>>>>> You should avoid this because this is considered bad practice in
>> terms
>>>> of
>>>>>> write performance and depending on your concurrent access this might
>>>> also
>>>>>> block read access.
>>>>>>
>>>>>> http://wiki.apache.org/jackrabbit/Performance
>>>>>>
>>>>>> Try to introduce a structure to your content using BTreeManger
>>>>>>
>>>>>>
>>>>>>
>> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/flat/BTreeManager.html
>>>>>> Cheers, D
>>>>>>
>>>>>>
>>>>>> On Friday, 13 November 2015, David Marginian <david@butterdev.com
>>> <javascript:;>>
>>>> wrote:
>>>>>>> Thanks Clay.  I am not trying to load that many records at once.
>> The
>>>>>>> application is crawling a directory.  It places the files from that
>>>>>>> directory into JackRabbit one at a time, and puts a content id
>> onto a
>>>> queue
>>>>>>> which is picked up by consumers on different servers.  Those
>>> consumers
>>>> then
>>>>>>> use the content id to retrieve the file from JackRabbit. Each piece
>>> of
>>>>>>> content is saved in a node under the root node.  The performance
>>>> slowdown
>>>>>>> is coming from calling session.getRootNode(), from what I can
>> gather
>>>> from
>>>>>>> the docs I need the root node in order to add a child node.  Note
>> the
>>>>>>> slowdown is pretty significant and I don't need to have close to
>> 50k
>>> to
>>>>>>> start seeing it (I start seeing it within a few minutes of running
>> my
>>>>>>> app).  I don't need orderable nodes, how do I disable that?
>>>>>>>
>>>>>>>
>>>>>>> On 11/13/2015 03:10 PM, Clay Ferguson wrote:
>>>>>>>
>>>>>>>> ​Please let us know more about your use case. Why are you even
>>>> "trying" to
>>>>>>>> load that many records all at once. Or at least scan them one by
>>> one,
>>>> I
>>>>>>>> mean. In most use cases you wouldn't need to do this kind of
>> thing,
>>>> unless
>>>>>>>> it's some kind of backup or replication. I say "most" cases... I'm
>>> not
>>>>>>>>    saying you don't need to just asking for a bit more background.
>>>> BTW: If
>>>>>>>> you don't need 'orderable' nodes try to avoid them. That type of
>>> node
>>>> does
>>>>>>>> not work at 'scale'... and 50K is propably pushing it.​
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Clay Ferguson
>>>>>>>> wclayf@gmail.com <javascript:;>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Nov 13, 2015 at 3:33 PM, <david@butterdev.com
>>> <javascript:;>> wrote:
>>>>>>>> Hi,
>>>>>>>>> I am new to JackRabbit and using version 2.11.2.  I am using
>>>> JackRabbit
>>>>>>>>> to
>>>>>>>>> store documents in a multi-threaded environment.  I noticed that
>>> the
>>>> time
>>>>>>>>> it takes to retrieve the root node is inconsistent and slow
>>> (several
>>>>>>>>> seconds +) and degrades over time (after 50K plus child nodes
>>>> retrieval
>>>>>>>>> is
>>>>>>>>> taking ~15 seconds).
>>>>>>>>>
>>>>>>>>> Originally, I was using code as follows to obtain a repository:
>>>>>>>>>
>>>>>>>>>    public Repository getRepository() throws
>> ClassNotFoundException,
>>>>>>>>> RepositoryException {
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>> ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory"));
>>>>>>>>>        return JcrUtils.getRepository(jackabbitServerUrl);
>>>>>>>>>    }
>>>>>>>>>
>>>>>>>>> Then I came across the following thread:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>> http://jackrabbit.510166.n4.nabble.com/getRootNode-takes-27-seconds-td1571027.html#a1571302
>>>>>>>>> This thread had some useful information (BatchReadConfig), but I
>> am
>>>> not
>>>>>>>>> certain how to use the API to take advantage of it.  I have
>> changed
>>>> my
>>>>>>>>> code
>>>>>>>>> to the following but it doesn't appear that node retrieval
>>>> performance
>>>>>>>>> has
>>>>>>>>> improved, is there something I am missing/doing wrong?
>>>>>>>>>
>>>>>>>>> 1) Repository Factory
>>>>>>>>> public Repository getRepository(@SuppressWarnings("rawtypes") Map
>>>>>>>>> parameters) throws RepositoryException {
>>>>>>>>>           String repositoryFactoryName = parameters != null && (
>>>>>>>>>
>>>>>>>>>   parameters.containsKey(PARAM_REPOSITORY_SERVICE_FACTORY) ||
>>>>>>>>>
>>>> parameters.containsKey(PARAM_REPOSITORY_CONFIG))
>>>>>>>>>                   ?
>>>>>>>>> "org.apache.jackrabbit.jcr2spi.Jcr2spiRepositoryFactory"
>>>>>>>>>                   :
>>>> "org.apache.jackrabbit.core.RepositoryFactoryImpl";
>>>>>>>>>           Object repositoryFactory;
>>>>>>>>>           try {
>>>>>>>>>               Class<?> repositoryFactoryClass =
>>>>>>>>> Class.forName(repositoryFactoryName, true,
>>>>>>>>>
>>> Thread.currentThread().getContextClassLoader());
>>>>>>>>>               repositoryFactory =
>>>> repositoryFactoryClass.newInstance();
>>>>>>>>>           }
>>>>>>>>>           catch (Exception e) {
>>>>>>>>>               throw new RepositoryException(e);
>>>>>>>>>           }
>>>>>>>>>
>>>>>>>>>           if (repositoryFactory instanceof RepositoryFactory) {
>>>>>>>>>               return ((RepositoryFactory)
>>>>>>>>> repositoryFactory).getRepository(parameters);
>>>>>>>>>           }
>>>>>>>>>           else {
>>>>>>>>>               throw new RepositoryException(repositoryFactory + "
>> is
>>>> not a
>>>>>>>>> RepositoryFactory");
>>>>>>>>>           }
>>>>>>>>>       }
>>>>>>>>>
>>>>>>>>> 2) Use the factory to get a repo:
>>>>>>>>>    public Repository getRepository() throws
>> ClassNotFoundException,
>>>>>>>>> RepositoryException {
>>>>>>>>>           Map<String, RepositoryConfig> parameters =
>>>>>>>>> Collections.singletonMap(
>>>>>>>>>
>> "org.apache.jackrabbit.jcr2spi.RepositoryConfig",
>>>>>>>>>                   (RepositoryConfig) new
>>>>>>>>> RepositoryConfigImpl(jackabbitServerUrl));
>>>>>>>>>
>>>>>>>>>           return getRepository(parameters);
>>>>>>>>>       }
>>>>>>>>>
>>>>>>>>> 3) Repository Config:
>>>>>>>>> private static final class RepositoryConfigImpl implements
>>>>>>>>> RepositoryConfig {
>>>>>>>>>
>>>>>>>>>           private String jackabbitServerUrl;
>>>>>>>>>
>>>>>>>>>           private RepositoryConfigImpl(String jackabbitServerUrl)
>> {
>>>>>>>>>               super();
>>>>>>>>>               this.jackabbitServerUrl = jackabbitServerUrl;
>>>>>>>>>           }
>>>>>>>>>
>>>>>>>>>           public CacheBehaviour getCacheBehaviour() {
>>>>>>>>>               return CacheBehaviour.INVALIDATE;
>>>>>>>>>           }
>>>>>>>>>
>>>>>>>>>           public int getItemCacheSize() {
>>>>>>>>>               return 100;
>>>>>>>>>           }
>>>>>>>>>
>>>>>>>>>           public int getPollTimeout() {
>>>>>>>>>               return 5000;
>>>>>>>>>           }
>>>>>>>>>
>>>>>>>>>           public RepositoryService getRepositoryService() throws
>>>>>>>>> RepositoryException {
>>>>>>>>>               BatchReadConfig brc = new BatchReadConfig() {
>>>>>>>>>                   public int getDepth(Path path, PathResolver
>>>> resolver)
>>>>>>>>> throws NamespaceException {
>>>>>>>>>                       return 1;
>>>>>>>>>                   }
>>>>>>>>>               };
>>>>>>>>>               return new RepositoryServiceImpl(jackabbitServerUrl,
>>>> brc);
>>>>>>>>>           }
>>>>>>>>>
>>>>>>>>>       }
>>>>>>>>>
>>>>>>>>> Thanks for your time.
>>>>>>>>>
>>>>>>>>> David
>>>>
>>
>> --
>>
>> Dirk Rudolph | Senior Software Engineer
>>
>> Netcentric AG
>>
>> M: +41 79 642 37 11
>> D: +49 174 966 84 34
>>
>> dirk.rudolph@netcentric.biz | www.netcentric.biz
>>


-- 
Ron Wheeler
President
Artifact Software Inc
email: rwheeler@artifact-software.com
skype: ronaldmwheeler
phone: 866-970-2435, ext 102


Re: Node Retrieval Performance

Posted by Clay Ferguson <wc...@gmail.com>.
Dirk,
You are not adding new information. Everything you just said was a known
and a given. We all realize we can be creative and solve this problem, and
avoid large numbers of children in all manor of creative and
straightforward ways. However, can you imagine yourself making the same
statement about RDBMS tables? If you were a developer on a RDBMS,
struggling to get scale working, would you ever say this to your boss: "Oh
well, if the table gets over 50K, we can just add new tables, because since
the DB can't deal with it we can just put the responsibility on the app
developers." If that would be a silly statement in the RDBMS world, it will
be silly in the NoSQL world for all the same exact reasons.


Best regards,
Clay Ferguson
wclayf@gmail.com


On Sat, Nov 14, 2015 at 10:07 AM, Dirk Rudolph <di...@netcentric.biz>
wrote:

> Each of the records has an primary key I guess. So build the uuid or any
> hash from it and use it as key in a BTree structure. Simple and
> straightforward.
>
> Actually the idea is to find structure in your data. This is a core idea of
> structured document stores. In case you have a large amount of siblings the
> detail level of your structure might not be deep enough.
>
> Anyway if you want to store key value tables somewhere there is a broad
> pool of available open source solutions.
>
> Cheers, D
>
> On Saturday, 14 November 2015, Clay Ferguson <wc...@gmail.com> wrote:
>
> > Dirk,
> > What you're explaining would work great if the data had naturally
> occurring
> > categories all being conveniently at whatever size JCR happens to handle
> > ok. This just doesn't work well in actuality. What if I just need to
> store
> > a table of 25 million arbitrary records? The "it can't be done" with JCR
> is
> > the honest answer. Solving it by creating a bunch of separate buckets is
> a
> > massive ugly kluge. Whatever the technical limitation is, it's INSIDE
> > Jackrabbit, and badly needs to be addressed rather than forcing
> developers
> > to jump thru hoops in application code. Surely I can't be the only one to
> > think this? Is everybody else just afraid to be critical like me, because
> > they are getting paid to work on JCR? Why don't we just be honest.
> >
> > Best regards,
> > Clay Ferguson
> > wclayf@gmail.com <javascript:;>
> >
> >
> > On Sat, Nov 14, 2015 at 2:35 AM, Dirk Rudolph <
> dirk.rudolph@netcentric.biz
> > <javascript:;>>
> > wrote:
> >
> > > > I am planning on storing a lot of data in JackRabbit (terabytes)
> > >
> > > But that should not mean storing them all as children of a single Node.
> > > Probably you should think about driving the hierarchy as explained in
> > > DavidsModel.
> > >
> > > So in general you would structure your files in for example categories:
> > >
> > > /categoryA
> > > /categoryB
> > > /categoryC
> > >
> > > Or even
> > >
> > > /categoryA/sub1/subsuba
> > > /categoryA/sub1/subsubb
> > >
> > > and so on. Each of them could then be a root of a NodeSequence managed
> as
> > > BTree. This would you additionally allow to split the content over
> > multiple
> > > jackrabbit instances to increase performance.
> > >
> > > In general Jackrabbit is/should be able to handle that many data but
> > > maintanance might take a lot of time blocking your application. So you
> > > should try to keep the repository size of a single instance as small as
> > > possible by for example splitting content by category, region of
> access,
> > or
> > > what ever.
> > >
> > > > Or can I simplify it and just do something like this to get a repo
> > >
> > >
> > > Have a look at:
> > >
> > >
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#getRepository(java.util.Map)
> > > <
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#getRepository(java.util.Map)
> > > >
> > >
> > > The parameterMap contains for example
> > >
> > >
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#REPOSITORY_URI
> > > <
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#REPOSITORY_URI
> > > >
> > >
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_REPOSITORY_URI
> > > <
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_REPOSITORY_URI
> > > >
> > >
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_ITEMINFO_CACHE_SIZE
> > > <
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_ITEMINFO_CACHE_SIZE
> > > >
> > >
> > > Btw. It should not be required to call ServiceLoader#load() by
> yourself.
> > >
> > > Cheers, D
> > >
> > > Dirk Rudolph | Senior Software Engineer
> > > Netcentric AG
> > >
> > > M: +41 79 642 37 11
> > > D: +49 174 966 84 34
> > >
> > > dirk.rudolph@netcentric.biz <javascript:;> <mailto:
> > dirk.rudolph@netcentric.biz <javascript:;>> |
> > > www.netcentric.biz <http://www.netcentric.biz/>
> > > > On 14 Nov 2015, at 01:26, David Marginian <david@butterdev.com
> > <javascript:;>> wrote:
> > > >
> > > > Thanks Dirk, I should have found that page on my own.  I am going to
> > > look into using the BTreeManager, just curious what are the limitations
> > for
> > > documents/file counts within a node?  I am planning on storing a lot of
> > > data in JackRabbit (terabytes).  Also, is the configuration code I
> posted
> > > in my previous posts the best way to do things?  Or can I simplify it
> and
> > > just do something like this to get a repo:
> > > >
> > > >
> > >
> >
> ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory"));
> > > > return JcrUtils.getRepository(jackabbitServerUrl);
> > > >
> > > > On 11/13/2015 03:47 PM, Dirk Rudolph wrote:
> > > >> Did I understood you right, you have thousands of child nodes below
> > the
> > > >> root node?
> > > >>
> > > >> You should avoid this because this is considered bad practice in
> terms
> > > of
> > > >> write performance and depending on your concurrent access this might
> > > also
> > > >> block read access.
> > > >>
> > > >> http://wiki.apache.org/jackrabbit/Performance
> > > >>
> > > >> Try to introduce a structure to your content using BTreeManger
> > > >>
> > > >>
> > > >>
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/flat/BTreeManager.html
> > > >>
> > > >> Cheers, D
> > > >>
> > > >>
> > > >> On Friday, 13 November 2015, David Marginian <david@butterdev.com
> > <javascript:;>>
> > > wrote:
> > > >>
> > > >>> Thanks Clay.  I am not trying to load that many records at once.
> The
> > > >>> application is crawling a directory.  It places the files from that
> > > >>> directory into JackRabbit one at a time, and puts a content id
> onto a
> > > queue
> > > >>> which is picked up by consumers on different servers.  Those
> > consumers
> > > then
> > > >>> use the content id to retrieve the file from JackRabbit. Each piece
> > of
> > > >>> content is saved in a node under the root node.  The performance
> > > slowdown
> > > >>> is coming from calling session.getRootNode(), from what I can
> gather
> > > from
> > > >>> the docs I need the root node in order to add a child node.  Note
> the
> > > >>> slowdown is pretty significant and I don't need to have close to
> 50k
> > to
> > > >>> start seeing it (I start seeing it within a few minutes of running
> my
> > > >>> app).  I don't need orderable nodes, how do I disable that?
> > > >>>
> > > >>>
> > > >>> On 11/13/2015 03:10 PM, Clay Ferguson wrote:
> > > >>>
> > > >>>> ​Please let us know more about your use case. Why are you even
> > > "trying" to
> > > >>>> load that many records all at once. Or at least scan them one by
> > one,
> > > I
> > > >>>> mean. In most use cases you wouldn't need to do this kind of
> thing,
> > > unless
> > > >>>> it's some kind of backup or replication. I say "most" cases... I'm
> > not
> > > >>>>   saying you don't need to just asking for a bit more background.
> > > BTW: If
> > > >>>> you don't need 'orderable' nodes try to avoid them. That type of
> > node
> > > does
> > > >>>> not work at 'scale'... and 50K is propably pushing it.​
> > > >>>>
> > > >>>> Best regards,
> > > >>>> Clay Ferguson
> > > >>>> wclayf@gmail.com <javascript:;>
> > > >>>>
> > > >>>>
> > > >>>> On Fri, Nov 13, 2015 at 3:33 PM, <david@butterdev.com
> > <javascript:;>> wrote:
> > > >>>>
> > > >>>> Hi,
> > > >>>>> I am new to JackRabbit and using version 2.11.2.  I am using
> > > JackRabbit
> > > >>>>> to
> > > >>>>> store documents in a multi-threaded environment.  I noticed that
> > the
> > > time
> > > >>>>> it takes to retrieve the root node is inconsistent and slow
> > (several
> > > >>>>> seconds +) and degrades over time (after 50K plus child nodes
> > > retrieval
> > > >>>>> is
> > > >>>>> taking ~15 seconds).
> > > >>>>>
> > > >>>>> Originally, I was using code as follows to obtain a repository:
> > > >>>>>
> > > >>>>>   public Repository getRepository() throws
> ClassNotFoundException,
> > > >>>>> RepositoryException {
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > >
> >
> ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory"));
> > > >>>>>       return JcrUtils.getRepository(jackabbitServerUrl);
> > > >>>>>   }
> > > >>>>>
> > > >>>>> Then I came across the following thread:
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > >
> >
> http://jackrabbit.510166.n4.nabble.com/getRootNode-takes-27-seconds-td1571027.html#a1571302
> > > >>>>>
> > > >>>>> This thread had some useful information (BatchReadConfig), but I
> am
> > > not
> > > >>>>> certain how to use the API to take advantage of it.  I have
> changed
> > > my
> > > >>>>> code
> > > >>>>> to the following but it doesn't appear that node retrieval
> > > performance
> > > >>>>> has
> > > >>>>> improved, is there something I am missing/doing wrong?
> > > >>>>>
> > > >>>>> 1) Repository Factory
> > > >>>>> public Repository getRepository(@SuppressWarnings("rawtypes") Map
> > > >>>>> parameters) throws RepositoryException {
> > > >>>>>          String repositoryFactoryName = parameters != null && (
> > > >>>>>
> > > >>>>>  parameters.containsKey(PARAM_REPOSITORY_SERVICE_FACTORY) ||
> > > >>>>>
> > > parameters.containsKey(PARAM_REPOSITORY_CONFIG))
> > > >>>>>                  ?
> > > >>>>> "org.apache.jackrabbit.jcr2spi.Jcr2spiRepositoryFactory"
> > > >>>>>                  :
> > > "org.apache.jackrabbit.core.RepositoryFactoryImpl";
> > > >>>>>
> > > >>>>>          Object repositoryFactory;
> > > >>>>>          try {
> > > >>>>>              Class<?> repositoryFactoryClass =
> > > >>>>> Class.forName(repositoryFactoryName, true,
> > > >>>>>
> > Thread.currentThread().getContextClassLoader());
> > > >>>>>
> > > >>>>>              repositoryFactory =
> > > repositoryFactoryClass.newInstance();
> > > >>>>>          }
> > > >>>>>          catch (Exception e) {
> > > >>>>>              throw new RepositoryException(e);
> > > >>>>>          }
> > > >>>>>
> > > >>>>>          if (repositoryFactory instanceof RepositoryFactory) {
> > > >>>>>              return ((RepositoryFactory)
> > > >>>>> repositoryFactory).getRepository(parameters);
> > > >>>>>          }
> > > >>>>>          else {
> > > >>>>>              throw new RepositoryException(repositoryFactory + "
> is
> > > not a
> > > >>>>> RepositoryFactory");
> > > >>>>>          }
> > > >>>>>      }
> > > >>>>>
> > > >>>>> 2) Use the factory to get a repo:
> > > >>>>>   public Repository getRepository() throws
> ClassNotFoundException,
> > > >>>>> RepositoryException {
> > > >>>>>          Map<String, RepositoryConfig> parameters =
> > > >>>>> Collections.singletonMap(
> > > >>>>>
> "org.apache.jackrabbit.jcr2spi.RepositoryConfig",
> > > >>>>>                  (RepositoryConfig) new
> > > >>>>> RepositoryConfigImpl(jackabbitServerUrl));
> > > >>>>>
> > > >>>>>          return getRepository(parameters);
> > > >>>>>      }
> > > >>>>>
> > > >>>>> 3) Repository Config:
> > > >>>>> private static final class RepositoryConfigImpl implements
> > > >>>>> RepositoryConfig {
> > > >>>>>
> > > >>>>>          private String jackabbitServerUrl;
> > > >>>>>
> > > >>>>>          private RepositoryConfigImpl(String jackabbitServerUrl)
> {
> > > >>>>>              super();
> > > >>>>>              this.jackabbitServerUrl = jackabbitServerUrl;
> > > >>>>>          }
> > > >>>>>
> > > >>>>>          public CacheBehaviour getCacheBehaviour() {
> > > >>>>>              return CacheBehaviour.INVALIDATE;
> > > >>>>>          }
> > > >>>>>
> > > >>>>>          public int getItemCacheSize() {
> > > >>>>>              return 100;
> > > >>>>>          }
> > > >>>>>
> > > >>>>>          public int getPollTimeout() {
> > > >>>>>              return 5000;
> > > >>>>>          }
> > > >>>>>
> > > >>>>>          public RepositoryService getRepositoryService() throws
> > > >>>>> RepositoryException {
> > > >>>>>              BatchReadConfig brc = new BatchReadConfig() {
> > > >>>>>                  public int getDepth(Path path, PathResolver
> > > resolver)
> > > >>>>> throws NamespaceException {
> > > >>>>>                      return 1;
> > > >>>>>                  }
> > > >>>>>              };
> > > >>>>>              return new RepositoryServiceImpl(jackabbitServerUrl,
> > > brc);
> > > >>>>>          }
> > > >>>>>
> > > >>>>>      }
> > > >>>>>
> > > >>>>> Thanks for your time.
> > > >>>>>
> > > >>>>> David
> > >
> > >
> >
>
>
> --
>
> Dirk Rudolph | Senior Software Engineer
>
> Netcentric AG
>
> M: +41 79 642 37 11
> D: +49 174 966 84 34
>
> dirk.rudolph@netcentric.biz | www.netcentric.biz
>

Re: Node Retrieval Performance

Posted by Dirk Rudolph <di...@netcentric.biz>.
Each of the records has an primary key I guess. So build the uuid or any
hash from it and use it as key in a BTree structure. Simple and
straightforward.

Actually the idea is to find structure in your data. This is a core idea of
structured document stores. In case you have a large amount of siblings the
detail level of your structure might not be deep enough.

Anyway if you want to store key value tables somewhere there is a broad
pool of available open source solutions.

Cheers, D

On Saturday, 14 November 2015, Clay Ferguson <wc...@gmail.com> wrote:

> Dirk,
> What you're explaining would work great if the data had naturally occurring
> categories all being conveniently at whatever size JCR happens to handle
> ok. This just doesn't work well in actuality. What if I just need to store
> a table of 25 million arbitrary records? The "it can't be done" with JCR is
> the honest answer. Solving it by creating a bunch of separate buckets is a
> massive ugly kluge. Whatever the technical limitation is, it's INSIDE
> Jackrabbit, and badly needs to be addressed rather than forcing developers
> to jump thru hoops in application code. Surely I can't be the only one to
> think this? Is everybody else just afraid to be critical like me, because
> they are getting paid to work on JCR? Why don't we just be honest.
>
> Best regards,
> Clay Ferguson
> wclayf@gmail.com <javascript:;>
>
>
> On Sat, Nov 14, 2015 at 2:35 AM, Dirk Rudolph <dirk.rudolph@netcentric.biz
> <javascript:;>>
> wrote:
>
> > > I am planning on storing a lot of data in JackRabbit (terabytes)
> >
> > But that should not mean storing them all as children of a single Node.
> > Probably you should think about driving the hierarchy as explained in
> > DavidsModel.
> >
> > So in general you would structure your files in for example categories:
> >
> > /categoryA
> > /categoryB
> > /categoryC
> >
> > Or even
> >
> > /categoryA/sub1/subsuba
> > /categoryA/sub1/subsubb
> >
> > and so on. Each of them could then be a root of a NodeSequence managed as
> > BTree. This would you additionally allow to split the content over
> multiple
> > jackrabbit instances to increase performance.
> >
> > In general Jackrabbit is/should be able to handle that many data but
> > maintanance might take a lot of time blocking your application. So you
> > should try to keep the repository size of a single instance as small as
> > possible by for example splitting content by category, region of access,
> or
> > what ever.
> >
> > > Or can I simplify it and just do something like this to get a repo
> >
> >
> > Have a look at:
> >
> >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#getRepository(java.util.Map)
> > <
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#getRepository(java.util.Map)
> > >
> >
> > The parameterMap contains for example
> >
> >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#REPOSITORY_URI
> > <
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#REPOSITORY_URI
> > >
> >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_REPOSITORY_URI
> > <
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_REPOSITORY_URI
> > >
> >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_ITEMINFO_CACHE_SIZE
> > <
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_ITEMINFO_CACHE_SIZE
> > >
> >
> > Btw. It should not be required to call ServiceLoader#load() by yourself.
> >
> > Cheers, D
> >
> > Dirk Rudolph | Senior Software Engineer
> > Netcentric AG
> >
> > M: +41 79 642 37 11
> > D: +49 174 966 84 34
> >
> > dirk.rudolph@netcentric.biz <javascript:;> <mailto:
> dirk.rudolph@netcentric.biz <javascript:;>> |
> > www.netcentric.biz <http://www.netcentric.biz/>
> > > On 14 Nov 2015, at 01:26, David Marginian <david@butterdev.com
> <javascript:;>> wrote:
> > >
> > > Thanks Dirk, I should have found that page on my own.  I am going to
> > look into using the BTreeManager, just curious what are the limitations
> for
> > documents/file counts within a node?  I am planning on storing a lot of
> > data in JackRabbit (terabytes).  Also, is the configuration code I posted
> > in my previous posts the best way to do things?  Or can I simplify it and
> > just do something like this to get a repo:
> > >
> > >
> >
> ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory"));
> > > return JcrUtils.getRepository(jackabbitServerUrl);
> > >
> > > On 11/13/2015 03:47 PM, Dirk Rudolph wrote:
> > >> Did I understood you right, you have thousands of child nodes below
> the
> > >> root node?
> > >>
> > >> You should avoid this because this is considered bad practice in terms
> > of
> > >> write performance and depending on your concurrent access this might
> > also
> > >> block read access.
> > >>
> > >> http://wiki.apache.org/jackrabbit/Performance
> > >>
> > >> Try to introduce a structure to your content using BTreeManger
> > >>
> > >>
> > >>
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/flat/BTreeManager.html
> > >>
> > >> Cheers, D
> > >>
> > >>
> > >> On Friday, 13 November 2015, David Marginian <david@butterdev.com
> <javascript:;>>
> > wrote:
> > >>
> > >>> Thanks Clay.  I am not trying to load that many records at once.  The
> > >>> application is crawling a directory.  It places the files from that
> > >>> directory into JackRabbit one at a time, and puts a content id onto a
> > queue
> > >>> which is picked up by consumers on different servers.  Those
> consumers
> > then
> > >>> use the content id to retrieve the file from JackRabbit. Each piece
> of
> > >>> content is saved in a node under the root node.  The performance
> > slowdown
> > >>> is coming from calling session.getRootNode(), from what I can gather
> > from
> > >>> the docs I need the root node in order to add a child node.  Note the
> > >>> slowdown is pretty significant and I don't need to have close to 50k
> to
> > >>> start seeing it (I start seeing it within a few minutes of running my
> > >>> app).  I don't need orderable nodes, how do I disable that?
> > >>>
> > >>>
> > >>> On 11/13/2015 03:10 PM, Clay Ferguson wrote:
> > >>>
> > >>>> ​Please let us know more about your use case. Why are you even
> > "trying" to
> > >>>> load that many records all at once. Or at least scan them one by
> one,
> > I
> > >>>> mean. In most use cases you wouldn't need to do this kind of thing,
> > unless
> > >>>> it's some kind of backup or replication. I say "most" cases... I'm
> not
> > >>>>   saying you don't need to just asking for a bit more background.
> > BTW: If
> > >>>> you don't need 'orderable' nodes try to avoid them. That type of
> node
> > does
> > >>>> not work at 'scale'... and 50K is propably pushing it.​
> > >>>>
> > >>>> Best regards,
> > >>>> Clay Ferguson
> > >>>> wclayf@gmail.com <javascript:;>
> > >>>>
> > >>>>
> > >>>> On Fri, Nov 13, 2015 at 3:33 PM, <david@butterdev.com
> <javascript:;>> wrote:
> > >>>>
> > >>>> Hi,
> > >>>>> I am new to JackRabbit and using version 2.11.2.  I am using
> > JackRabbit
> > >>>>> to
> > >>>>> store documents in a multi-threaded environment.  I noticed that
> the
> > time
> > >>>>> it takes to retrieve the root node is inconsistent and slow
> (several
> > >>>>> seconds +) and degrades over time (after 50K plus child nodes
> > retrieval
> > >>>>> is
> > >>>>> taking ~15 seconds).
> > >>>>>
> > >>>>> Originally, I was using code as follows to obtain a repository:
> > >>>>>
> > >>>>>   public Repository getRepository() throws ClassNotFoundException,
> > >>>>> RepositoryException {
> > >>>>>
> > >>>>>
> > >>>>>
> >
> ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory"));
> > >>>>>       return JcrUtils.getRepository(jackabbitServerUrl);
> > >>>>>   }
> > >>>>>
> > >>>>> Then I came across the following thread:
> > >>>>>
> > >>>>>
> > >>>>>
> >
> http://jackrabbit.510166.n4.nabble.com/getRootNode-takes-27-seconds-td1571027.html#a1571302
> > >>>>>
> > >>>>> This thread had some useful information (BatchReadConfig), but I am
> > not
> > >>>>> certain how to use the API to take advantage of it.  I have changed
> > my
> > >>>>> code
> > >>>>> to the following but it doesn't appear that node retrieval
> > performance
> > >>>>> has
> > >>>>> improved, is there something I am missing/doing wrong?
> > >>>>>
> > >>>>> 1) Repository Factory
> > >>>>> public Repository getRepository(@SuppressWarnings("rawtypes") Map
> > >>>>> parameters) throws RepositoryException {
> > >>>>>          String repositoryFactoryName = parameters != null && (
> > >>>>>
> > >>>>>  parameters.containsKey(PARAM_REPOSITORY_SERVICE_FACTORY) ||
> > >>>>>
> > parameters.containsKey(PARAM_REPOSITORY_CONFIG))
> > >>>>>                  ?
> > >>>>> "org.apache.jackrabbit.jcr2spi.Jcr2spiRepositoryFactory"
> > >>>>>                  :
> > "org.apache.jackrabbit.core.RepositoryFactoryImpl";
> > >>>>>
> > >>>>>          Object repositoryFactory;
> > >>>>>          try {
> > >>>>>              Class<?> repositoryFactoryClass =
> > >>>>> Class.forName(repositoryFactoryName, true,
> > >>>>>
> Thread.currentThread().getContextClassLoader());
> > >>>>>
> > >>>>>              repositoryFactory =
> > repositoryFactoryClass.newInstance();
> > >>>>>          }
> > >>>>>          catch (Exception e) {
> > >>>>>              throw new RepositoryException(e);
> > >>>>>          }
> > >>>>>
> > >>>>>          if (repositoryFactory instanceof RepositoryFactory) {
> > >>>>>              return ((RepositoryFactory)
> > >>>>> repositoryFactory).getRepository(parameters);
> > >>>>>          }
> > >>>>>          else {
> > >>>>>              throw new RepositoryException(repositoryFactory + " is
> > not a
> > >>>>> RepositoryFactory");
> > >>>>>          }
> > >>>>>      }
> > >>>>>
> > >>>>> 2) Use the factory to get a repo:
> > >>>>>   public Repository getRepository() throws ClassNotFoundException,
> > >>>>> RepositoryException {
> > >>>>>          Map<String, RepositoryConfig> parameters =
> > >>>>> Collections.singletonMap(
> > >>>>>                  "org.apache.jackrabbit.jcr2spi.RepositoryConfig",
> > >>>>>                  (RepositoryConfig) new
> > >>>>> RepositoryConfigImpl(jackabbitServerUrl));
> > >>>>>
> > >>>>>          return getRepository(parameters);
> > >>>>>      }
> > >>>>>
> > >>>>> 3) Repository Config:
> > >>>>> private static final class RepositoryConfigImpl implements
> > >>>>> RepositoryConfig {
> > >>>>>
> > >>>>>          private String jackabbitServerUrl;
> > >>>>>
> > >>>>>          private RepositoryConfigImpl(String jackabbitServerUrl) {
> > >>>>>              super();
> > >>>>>              this.jackabbitServerUrl = jackabbitServerUrl;
> > >>>>>          }
> > >>>>>
> > >>>>>          public CacheBehaviour getCacheBehaviour() {
> > >>>>>              return CacheBehaviour.INVALIDATE;
> > >>>>>          }
> > >>>>>
> > >>>>>          public int getItemCacheSize() {
> > >>>>>              return 100;
> > >>>>>          }
> > >>>>>
> > >>>>>          public int getPollTimeout() {
> > >>>>>              return 5000;
> > >>>>>          }
> > >>>>>
> > >>>>>          public RepositoryService getRepositoryService() throws
> > >>>>> RepositoryException {
> > >>>>>              BatchReadConfig brc = new BatchReadConfig() {
> > >>>>>                  public int getDepth(Path path, PathResolver
> > resolver)
> > >>>>> throws NamespaceException {
> > >>>>>                      return 1;
> > >>>>>                  }
> > >>>>>              };
> > >>>>>              return new RepositoryServiceImpl(jackabbitServerUrl,
> > brc);
> > >>>>>          }
> > >>>>>
> > >>>>>      }
> > >>>>>
> > >>>>> Thanks for your time.
> > >>>>>
> > >>>>> David
> >
> >
>


-- 

Dirk Rudolph | Senior Software Engineer

Netcentric AG

M: +41 79 642 37 11
D: +49 174 966 84 34

dirk.rudolph@netcentric.biz | www.netcentric.biz

Re: Node Retrieval Performance

Posted by Clay Ferguson <wc...@gmail.com>.
Dirk,
What you're explaining would work great if the data had naturally occurring
categories all being conveniently at whatever size JCR happens to handle
ok. This just doesn't work well in actuality. What if I just need to store
a table of 25 million arbitrary records? The "it can't be done" with JCR is
the honest answer. Solving it by creating a bunch of separate buckets is a
massive ugly kluge. Whatever the technical limitation is, it's INSIDE
Jackrabbit, and badly needs to be addressed rather than forcing developers
to jump thru hoops in application code. Surely I can't be the only one to
think this? Is everybody else just afraid to be critical like me, because
they are getting paid to work on JCR? Why don't we just be honest.

Best regards,
Clay Ferguson
wclayf@gmail.com


On Sat, Nov 14, 2015 at 2:35 AM, Dirk Rudolph <di...@netcentric.biz>
wrote:

> > I am planning on storing a lot of data in JackRabbit (terabytes)
>
> But that should not mean storing them all as children of a single Node.
> Probably you should think about driving the hierarchy as explained in
> DavidsModel.
>
> So in general you would structure your files in for example categories:
>
> /categoryA
> /categoryB
> /categoryC
>
> Or even
>
> /categoryA/sub1/subsuba
> /categoryA/sub1/subsubb
>
> and so on. Each of them could then be a root of a NodeSequence managed as
> BTree. This would you additionally allow to split the content over multiple
> jackrabbit instances to increase performance.
>
> In general Jackrabbit is/should be able to handle that many data but
> maintanance might take a lot of time blocking your application. So you
> should try to keep the repository size of a single instance as small as
> possible by for example splitting content by category, region of access, or
> what ever.
>
> > Or can I simplify it and just do something like this to get a repo
>
>
> Have a look at:
>
>
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#getRepository(java.util.Map)
> <
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#getRepository(java.util.Map)
> >
>
> The parameterMap contains for example
>
>
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#REPOSITORY_URI
> <
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#REPOSITORY_URI
> >
>
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_REPOSITORY_URI
> <
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_REPOSITORY_URI
> >
>
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_ITEMINFO_CACHE_SIZE
> <
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_ITEMINFO_CACHE_SIZE
> >
>
> Btw. It should not be required to call ServiceLoader#load() by yourself.
>
> Cheers, D
>
> Dirk Rudolph | Senior Software Engineer
> Netcentric AG
>
> M: +41 79 642 37 11
> D: +49 174 966 84 34
>
> dirk.rudolph@netcentric.biz <ma...@netcentric.biz> |
> www.netcentric.biz <http://www.netcentric.biz/>
> > On 14 Nov 2015, at 01:26, David Marginian <da...@butterdev.com> wrote:
> >
> > Thanks Dirk, I should have found that page on my own.  I am going to
> look into using the BTreeManager, just curious what are the limitations for
> documents/file counts within a node?  I am planning on storing a lot of
> data in JackRabbit (terabytes).  Also, is the configuration code I posted
> in my previous posts the best way to do things?  Or can I simplify it and
> just do something like this to get a repo:
> >
> >
> ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory"));
> > return JcrUtils.getRepository(jackabbitServerUrl);
> >
> > On 11/13/2015 03:47 PM, Dirk Rudolph wrote:
> >> Did I understood you right, you have thousands of child nodes below the
> >> root node?
> >>
> >> You should avoid this because this is considered bad practice in terms
> of
> >> write performance and depending on your concurrent access this might
> also
> >> block read access.
> >>
> >> http://wiki.apache.org/jackrabbit/Performance
> >>
> >> Try to introduce a structure to your content using BTreeManger
> >>
> >>
> >>
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/flat/BTreeManager.html
> >>
> >> Cheers, D
> >>
> >>
> >> On Friday, 13 November 2015, David Marginian <da...@butterdev.com>
> wrote:
> >>
> >>> Thanks Clay.  I am not trying to load that many records at once.  The
> >>> application is crawling a directory.  It places the files from that
> >>> directory into JackRabbit one at a time, and puts a content id onto a
> queue
> >>> which is picked up by consumers on different servers.  Those consumers
> then
> >>> use the content id to retrieve the file from JackRabbit. Each piece of
> >>> content is saved in a node under the root node.  The performance
> slowdown
> >>> is coming from calling session.getRootNode(), from what I can gather
> from
> >>> the docs I need the root node in order to add a child node.  Note the
> >>> slowdown is pretty significant and I don't need to have close to 50k to
> >>> start seeing it (I start seeing it within a few minutes of running my
> >>> app).  I don't need orderable nodes, how do I disable that?
> >>>
> >>>
> >>> On 11/13/2015 03:10 PM, Clay Ferguson wrote:
> >>>
> >>>> ​Please let us know more about your use case. Why are you even
> "trying" to
> >>>> load that many records all at once. Or at least scan them one by one,
> I
> >>>> mean. In most use cases you wouldn't need to do this kind of thing,
> unless
> >>>> it's some kind of backup or replication. I say "most" cases... I'm not
> >>>>   saying you don't need to just asking for a bit more background.
> BTW: If
> >>>> you don't need 'orderable' nodes try to avoid them. That type of node
> does
> >>>> not work at 'scale'... and 50K is propably pushing it.​
> >>>>
> >>>> Best regards,
> >>>> Clay Ferguson
> >>>> wclayf@gmail.com
> >>>>
> >>>>
> >>>> On Fri, Nov 13, 2015 at 3:33 PM, <da...@butterdev.com> wrote:
> >>>>
> >>>> Hi,
> >>>>> I am new to JackRabbit and using version 2.11.2.  I am using
> JackRabbit
> >>>>> to
> >>>>> store documents in a multi-threaded environment.  I noticed that the
> time
> >>>>> it takes to retrieve the root node is inconsistent and slow (several
> >>>>> seconds +) and degrades over time (after 50K plus child nodes
> retrieval
> >>>>> is
> >>>>> taking ~15 seconds).
> >>>>>
> >>>>> Originally, I was using code as follows to obtain a repository:
> >>>>>
> >>>>>   public Repository getRepository() throws ClassNotFoundException,
> >>>>> RepositoryException {
> >>>>>
> >>>>>
> >>>>>
> ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory"));
> >>>>>       return JcrUtils.getRepository(jackabbitServerUrl);
> >>>>>   }
> >>>>>
> >>>>> Then I came across the following thread:
> >>>>>
> >>>>>
> >>>>>
> http://jackrabbit.510166.n4.nabble.com/getRootNode-takes-27-seconds-td1571027.html#a1571302
> >>>>>
> >>>>> This thread had some useful information (BatchReadConfig), but I am
> not
> >>>>> certain how to use the API to take advantage of it.  I have changed
> my
> >>>>> code
> >>>>> to the following but it doesn't appear that node retrieval
> performance
> >>>>> has
> >>>>> improved, is there something I am missing/doing wrong?
> >>>>>
> >>>>> 1) Repository Factory
> >>>>> public Repository getRepository(@SuppressWarnings("rawtypes") Map
> >>>>> parameters) throws RepositoryException {
> >>>>>          String repositoryFactoryName = parameters != null && (
> >>>>>
> >>>>>  parameters.containsKey(PARAM_REPOSITORY_SERVICE_FACTORY) ||
> >>>>>
> parameters.containsKey(PARAM_REPOSITORY_CONFIG))
> >>>>>                  ?
> >>>>> "org.apache.jackrabbit.jcr2spi.Jcr2spiRepositoryFactory"
> >>>>>                  :
> "org.apache.jackrabbit.core.RepositoryFactoryImpl";
> >>>>>
> >>>>>          Object repositoryFactory;
> >>>>>          try {
> >>>>>              Class<?> repositoryFactoryClass =
> >>>>> Class.forName(repositoryFactoryName, true,
> >>>>>                      Thread.currentThread().getContextClassLoader());
> >>>>>
> >>>>>              repositoryFactory =
> repositoryFactoryClass.newInstance();
> >>>>>          }
> >>>>>          catch (Exception e) {
> >>>>>              throw new RepositoryException(e);
> >>>>>          }
> >>>>>
> >>>>>          if (repositoryFactory instanceof RepositoryFactory) {
> >>>>>              return ((RepositoryFactory)
> >>>>> repositoryFactory).getRepository(parameters);
> >>>>>          }
> >>>>>          else {
> >>>>>              throw new RepositoryException(repositoryFactory + " is
> not a
> >>>>> RepositoryFactory");
> >>>>>          }
> >>>>>      }
> >>>>>
> >>>>> 2) Use the factory to get a repo:
> >>>>>   public Repository getRepository() throws ClassNotFoundException,
> >>>>> RepositoryException {
> >>>>>          Map<String, RepositoryConfig> parameters =
> >>>>> Collections.singletonMap(
> >>>>>                  "org.apache.jackrabbit.jcr2spi.RepositoryConfig",
> >>>>>                  (RepositoryConfig) new
> >>>>> RepositoryConfigImpl(jackabbitServerUrl));
> >>>>>
> >>>>>          return getRepository(parameters);
> >>>>>      }
> >>>>>
> >>>>> 3) Repository Config:
> >>>>> private static final class RepositoryConfigImpl implements
> >>>>> RepositoryConfig {
> >>>>>
> >>>>>          private String jackabbitServerUrl;
> >>>>>
> >>>>>          private RepositoryConfigImpl(String jackabbitServerUrl) {
> >>>>>              super();
> >>>>>              this.jackabbitServerUrl = jackabbitServerUrl;
> >>>>>          }
> >>>>>
> >>>>>          public CacheBehaviour getCacheBehaviour() {
> >>>>>              return CacheBehaviour.INVALIDATE;
> >>>>>          }
> >>>>>
> >>>>>          public int getItemCacheSize() {
> >>>>>              return 100;
> >>>>>          }
> >>>>>
> >>>>>          public int getPollTimeout() {
> >>>>>              return 5000;
> >>>>>          }
> >>>>>
> >>>>>          public RepositoryService getRepositoryService() throws
> >>>>> RepositoryException {
> >>>>>              BatchReadConfig brc = new BatchReadConfig() {
> >>>>>                  public int getDepth(Path path, PathResolver
> resolver)
> >>>>> throws NamespaceException {
> >>>>>                      return 1;
> >>>>>                  }
> >>>>>              };
> >>>>>              return new RepositoryServiceImpl(jackabbitServerUrl,
> brc);
> >>>>>          }
> >>>>>
> >>>>>      }
> >>>>>
> >>>>> Thanks for your time.
> >>>>>
> >>>>> David
>
>

Re: Node Retrieval Performance

Posted by Dirk Rudolph <di...@netcentric.biz>.
> I am planning on storing a lot of data in JackRabbit (terabytes)

But that should not mean storing them all as children of a single Node. Probably you should think about driving the hierarchy as explained in DavidsModel.

So in general you would structure your files in for example categories:

/categoryA
/categoryB
/categoryC

Or even

/categoryA/sub1/subsuba
/categoryA/sub1/subsubb

and so on. Each of them could then be a root of a NodeSequence managed as BTree. This would you additionally allow to split the content over multiple jackrabbit instances to increase performance.

In general Jackrabbit is/should be able to handle that many data but maintanance might take a lot of time blocking your application. So you should try to keep the repository size of a single instance as small as possible by for example splitting content by category, region of access, or what ever.

> Or can I simplify it and just do something like this to get a repo


Have a look at: 

https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#getRepository(java.util.Map) <https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#getRepository(java.util.Map)>

The parameterMap contains for example

https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#REPOSITORY_URI <https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#REPOSITORY_URI>
https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_REPOSITORY_URI <https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_REPOSITORY_URI>
https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_ITEMINFO_CACHE_SIZE <https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_ITEMINFO_CACHE_SIZE>

Btw. It should not be required to call ServiceLoader#load() by yourself. 

Cheers, D

Dirk Rudolph | Senior Software Engineer
Netcentric AG

M: +41 79 642 37 11
D: +49 174 966 84 34

dirk.rudolph@netcentric.biz <ma...@netcentric.biz> | www.netcentric.biz <http://www.netcentric.biz/>
> On 14 Nov 2015, at 01:26, David Marginian <da...@butterdev.com> wrote:
> 
> Thanks Dirk, I should have found that page on my own.  I am going to look into using the BTreeManager, just curious what are the limitations for documents/file counts within a node?  I am planning on storing a lot of data in JackRabbit (terabytes).  Also, is the configuration code I posted in my previous posts the best way to do things?  Or can I simplify it and just do something like this to get a repo:
> 
> ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory")); 
> return JcrUtils.getRepository(jackabbitServerUrl);
> 
> On 11/13/2015 03:47 PM, Dirk Rudolph wrote:
>> Did I understood you right, you have thousands of child nodes below the
>> root node?
>> 
>> You should avoid this because this is considered bad practice in terms of
>> write performance and depending on your concurrent access this might also
>> block read access.
>> 
>> http://wiki.apache.org/jackrabbit/Performance
>> 
>> Try to introduce a structure to your content using BTreeManger
>> 
>> 
>> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/flat/BTreeManager.html
>> 
>> Cheers, D
>> 
>> 
>> On Friday, 13 November 2015, David Marginian <da...@butterdev.com> wrote:
>> 
>>> Thanks Clay.  I am not trying to load that many records at once.  The
>>> application is crawling a directory.  It places the files from that
>>> directory into JackRabbit one at a time, and puts a content id onto a queue
>>> which is picked up by consumers on different servers.  Those consumers then
>>> use the content id to retrieve the file from JackRabbit. Each piece of
>>> content is saved in a node under the root node.  The performance slowdown
>>> is coming from calling session.getRootNode(), from what I can gather from
>>> the docs I need the root node in order to add a child node.  Note the
>>> slowdown is pretty significant and I don't need to have close to 50k to
>>> start seeing it (I start seeing it within a few minutes of running my
>>> app).  I don't need orderable nodes, how do I disable that?
>>> 
>>> 
>>> On 11/13/2015 03:10 PM, Clay Ferguson wrote:
>>> 
>>>> ​Please let us know more about your use case. Why are you even "trying" to
>>>> load that many records all at once. Or at least scan them one by one, I
>>>> mean. In most use cases you wouldn't need to do this kind of thing, unless
>>>> it's some kind of backup or replication. I say "most" cases... I'm not
>>>>   saying you don't need to just asking for a bit more background. BTW: If
>>>> you don't need 'orderable' nodes try to avoid them. That type of node does
>>>> not work at 'scale'... and 50K is propably pushing it.​
>>>> 
>>>> Best regards,
>>>> Clay Ferguson
>>>> wclayf@gmail.com
>>>> 
>>>> 
>>>> On Fri, Nov 13, 2015 at 3:33 PM, <da...@butterdev.com> wrote:
>>>> 
>>>> Hi,
>>>>> I am new to JackRabbit and using version 2.11.2.  I am using JackRabbit
>>>>> to
>>>>> store documents in a multi-threaded environment.  I noticed that the time
>>>>> it takes to retrieve the root node is inconsistent and slow (several
>>>>> seconds +) and degrades over time (after 50K plus child nodes retrieval
>>>>> is
>>>>> taking ~15 seconds).
>>>>> 
>>>>> Originally, I was using code as follows to obtain a repository:
>>>>> 
>>>>>   public Repository getRepository() throws ClassNotFoundException,
>>>>> RepositoryException {
>>>>> 
>>>>> 
>>>>> ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory"));
>>>>>       return JcrUtils.getRepository(jackabbitServerUrl);
>>>>>   }
>>>>> 
>>>>> Then I came across the following thread:
>>>>> 
>>>>> 
>>>>> http://jackrabbit.510166.n4.nabble.com/getRootNode-takes-27-seconds-td1571027.html#a1571302
>>>>> 
>>>>> This thread had some useful information (BatchReadConfig), but I am not
>>>>> certain how to use the API to take advantage of it.  I have changed my
>>>>> code
>>>>> to the following but it doesn't appear that node retrieval performance
>>>>> has
>>>>> improved, is there something I am missing/doing wrong?
>>>>> 
>>>>> 1) Repository Factory
>>>>> public Repository getRepository(@SuppressWarnings("rawtypes") Map
>>>>> parameters) throws RepositoryException {
>>>>>          String repositoryFactoryName = parameters != null && (
>>>>> 
>>>>>  parameters.containsKey(PARAM_REPOSITORY_SERVICE_FACTORY) ||
>>>>>                          parameters.containsKey(PARAM_REPOSITORY_CONFIG))
>>>>>                  ?
>>>>> "org.apache.jackrabbit.jcr2spi.Jcr2spiRepositoryFactory"
>>>>>                  : "org.apache.jackrabbit.core.RepositoryFactoryImpl";
>>>>> 
>>>>>          Object repositoryFactory;
>>>>>          try {
>>>>>              Class<?> repositoryFactoryClass =
>>>>> Class.forName(repositoryFactoryName, true,
>>>>>                      Thread.currentThread().getContextClassLoader());
>>>>> 
>>>>>              repositoryFactory = repositoryFactoryClass.newInstance();
>>>>>          }
>>>>>          catch (Exception e) {
>>>>>              throw new RepositoryException(e);
>>>>>          }
>>>>> 
>>>>>          if (repositoryFactory instanceof RepositoryFactory) {
>>>>>              return ((RepositoryFactory)
>>>>> repositoryFactory).getRepository(parameters);
>>>>>          }
>>>>>          else {
>>>>>              throw new RepositoryException(repositoryFactory + " is not a
>>>>> RepositoryFactory");
>>>>>          }
>>>>>      }
>>>>> 
>>>>> 2) Use the factory to get a repo:
>>>>>   public Repository getRepository() throws ClassNotFoundException,
>>>>> RepositoryException {
>>>>>          Map<String, RepositoryConfig> parameters =
>>>>> Collections.singletonMap(
>>>>>                  "org.apache.jackrabbit.jcr2spi.RepositoryConfig",
>>>>>                  (RepositoryConfig) new
>>>>> RepositoryConfigImpl(jackabbitServerUrl));
>>>>> 
>>>>>          return getRepository(parameters);
>>>>>      }
>>>>> 
>>>>> 3) Repository Config:
>>>>> private static final class RepositoryConfigImpl implements
>>>>> RepositoryConfig {
>>>>> 
>>>>>          private String jackabbitServerUrl;
>>>>> 
>>>>>          private RepositoryConfigImpl(String jackabbitServerUrl) {
>>>>>              super();
>>>>>              this.jackabbitServerUrl = jackabbitServerUrl;
>>>>>          }
>>>>> 
>>>>>          public CacheBehaviour getCacheBehaviour() {
>>>>>              return CacheBehaviour.INVALIDATE;
>>>>>          }
>>>>> 
>>>>>          public int getItemCacheSize() {
>>>>>              return 100;
>>>>>          }
>>>>> 
>>>>>          public int getPollTimeout() {
>>>>>              return 5000;
>>>>>          }
>>>>> 
>>>>>          public RepositoryService getRepositoryService() throws
>>>>> RepositoryException {
>>>>>              BatchReadConfig brc = new BatchReadConfig() {
>>>>>                  public int getDepth(Path path, PathResolver resolver)
>>>>> throws NamespaceException {
>>>>>                      return 1;
>>>>>                  }
>>>>>              };
>>>>>              return new RepositoryServiceImpl(jackabbitServerUrl, brc);
>>>>>          }
>>>>> 
>>>>>      }
>>>>> 
>>>>> Thanks for your time.
>>>>> 
>>>>> David


Re: Node Retrieval Performance

Posted by David Marginian <da...@butterdev.com>.
Thanks Dirk, I should have found that page on my own.  I am going to 
look into using the BTreeManager, just curious what are the limitations 
for documents/file counts within a node?  I am planning on storing a lot 
of data in JackRabbit (terabytes).  Also, is the configuration code I 
posted in my previous posts the best way to do things?  Or can I 
simplify it and just do something like this to get a repo:

ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory")); 

return JcrUtils.getRepository(jackabbitServerUrl);

On 11/13/2015 03:47 PM, Dirk Rudolph wrote:
> Did I understood you right, you have thousands of child nodes below the
> root node?
>
> You should avoid this because this is considered bad practice in terms of
> write performance and depending on your concurrent access this might also
> block read access.
>
> http://wiki.apache.org/jackrabbit/Performance
>
> Try to introduce a structure to your content using BTreeManger
>
>
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/flat/BTreeManager.html
>
> Cheers, D
>
>
> On Friday, 13 November 2015, David Marginian <da...@butterdev.com> wrote:
>
>> Thanks Clay.  I am not trying to load that many records at once.  The
>> application is crawling a directory.  It places the files from that
>> directory into JackRabbit one at a time, and puts a content id onto a queue
>> which is picked up by consumers on different servers.  Those consumers then
>> use the content id to retrieve the file from JackRabbit. Each piece of
>> content is saved in a node under the root node.  The performance slowdown
>> is coming from calling session.getRootNode(), from what I can gather from
>> the docs I need the root node in order to add a child node.  Note the
>> slowdown is pretty significant and I don't need to have close to 50k to
>> start seeing it (I start seeing it within a few minutes of running my
>> app).  I don't need orderable nodes, how do I disable that?
>>
>>
>> On 11/13/2015 03:10 PM, Clay Ferguson wrote:
>>
>>> ​Please let us know more about your use case. Why are you even "trying" to
>>> load that many records all at once. Or at least scan them one by one, I
>>> mean. In most use cases you wouldn't need to do this kind of thing, unless
>>> it's some kind of backup or replication. I say "most" cases... I'm not
>>>    saying you don't need to just asking for a bit more background. BTW: If
>>> you don't need 'orderable' nodes try to avoid them. That type of node does
>>> not work at 'scale'... and 50K is propably pushing it.​
>>>
>>> Best regards,
>>> Clay Ferguson
>>> wclayf@gmail.com
>>>
>>>
>>> On Fri, Nov 13, 2015 at 3:33 PM, <da...@butterdev.com> wrote:
>>>
>>> Hi,
>>>> I am new to JackRabbit and using version 2.11.2.  I am using JackRabbit
>>>> to
>>>> store documents in a multi-threaded environment.  I noticed that the time
>>>> it takes to retrieve the root node is inconsistent and slow (several
>>>> seconds +) and degrades over time (after 50K plus child nodes retrieval
>>>> is
>>>> taking ~15 seconds).
>>>>
>>>> Originally, I was using code as follows to obtain a repository:
>>>>
>>>>    public Repository getRepository() throws ClassNotFoundException,
>>>> RepositoryException {
>>>>
>>>>
>>>> ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory"));
>>>>        return JcrUtils.getRepository(jackabbitServerUrl);
>>>>    }
>>>>
>>>> Then I came across the following thread:
>>>>
>>>>
>>>> http://jackrabbit.510166.n4.nabble.com/getRootNode-takes-27-seconds-td1571027.html#a1571302
>>>>
>>>> This thread had some useful information (BatchReadConfig), but I am not
>>>> certain how to use the API to take advantage of it.  I have changed my
>>>> code
>>>> to the following but it doesn't appear that node retrieval performance
>>>> has
>>>> improved, is there something I am missing/doing wrong?
>>>>
>>>> 1) Repository Factory
>>>> public Repository getRepository(@SuppressWarnings("rawtypes") Map
>>>> parameters) throws RepositoryException {
>>>>           String repositoryFactoryName = parameters != null && (
>>>>
>>>>   parameters.containsKey(PARAM_REPOSITORY_SERVICE_FACTORY) ||
>>>>                           parameters.containsKey(PARAM_REPOSITORY_CONFIG))
>>>>                   ?
>>>> "org.apache.jackrabbit.jcr2spi.Jcr2spiRepositoryFactory"
>>>>                   : "org.apache.jackrabbit.core.RepositoryFactoryImpl";
>>>>
>>>>           Object repositoryFactory;
>>>>           try {
>>>>               Class<?> repositoryFactoryClass =
>>>> Class.forName(repositoryFactoryName, true,
>>>>                       Thread.currentThread().getContextClassLoader());
>>>>
>>>>               repositoryFactory = repositoryFactoryClass.newInstance();
>>>>           }
>>>>           catch (Exception e) {
>>>>               throw new RepositoryException(e);
>>>>           }
>>>>
>>>>           if (repositoryFactory instanceof RepositoryFactory) {
>>>>               return ((RepositoryFactory)
>>>> repositoryFactory).getRepository(parameters);
>>>>           }
>>>>           else {
>>>>               throw new RepositoryException(repositoryFactory + " is not a
>>>> RepositoryFactory");
>>>>           }
>>>>       }
>>>>
>>>> 2) Use the factory to get a repo:
>>>>    public Repository getRepository() throws ClassNotFoundException,
>>>> RepositoryException {
>>>>           Map<String, RepositoryConfig> parameters =
>>>> Collections.singletonMap(
>>>>                   "org.apache.jackrabbit.jcr2spi.RepositoryConfig",
>>>>                   (RepositoryConfig) new
>>>> RepositoryConfigImpl(jackabbitServerUrl));
>>>>
>>>>           return getRepository(parameters);
>>>>       }
>>>>
>>>> 3) Repository Config:
>>>> private static final class RepositoryConfigImpl implements
>>>> RepositoryConfig {
>>>>
>>>>           private String jackabbitServerUrl;
>>>>
>>>>           private RepositoryConfigImpl(String jackabbitServerUrl) {
>>>>               super();
>>>>               this.jackabbitServerUrl = jackabbitServerUrl;
>>>>           }
>>>>
>>>>           public CacheBehaviour getCacheBehaviour() {
>>>>               return CacheBehaviour.INVALIDATE;
>>>>           }
>>>>
>>>>           public int getItemCacheSize() {
>>>>               return 100;
>>>>           }
>>>>
>>>>           public int getPollTimeout() {
>>>>               return 5000;
>>>>           }
>>>>
>>>>           public RepositoryService getRepositoryService() throws
>>>> RepositoryException {
>>>>               BatchReadConfig brc = new BatchReadConfig() {
>>>>                   public int getDepth(Path path, PathResolver resolver)
>>>> throws NamespaceException {
>>>>                       return 1;
>>>>                   }
>>>>               };
>>>>               return new RepositoryServiceImpl(jackabbitServerUrl, brc);
>>>>           }
>>>>
>>>>       }
>>>>
>>>> Thanks for your time.
>>>>
>>>> David
>>>>
>>>>
>>>>
>>>>
>>>>


Re: Node Retrieval Performance

Posted by Robert Munteanu <ro...@apache.org>.
On Sat, Nov 14, 2015 at 9:23 PM, David Marginian <da...@butterdev.com> wrote:
> Thanks Robert.  I considered moving to Oak, but the system was originally
> designed using JackRabbit and I recently discovered this limitation doing
> load/performance testing for a future requirement.  Moving to Oak now would
> be too large of a change for us to take on now.  Back to JackRabbit, is
> there documentation somewhere on the different node types and which are
> ordered or not?

Note that Jackrabbit has the child nodes limitation irrespective on
whether they are orderable or not.

As for the node types, the JCR spec would be a good start, see section
3.7.11 Standard Application Node Types

  http://www.day.com/specs/jcr/2.0/3_Repository_Model.html#3.7.11%20Standard%20Application%20Node%20Types

Robert

> I don't need ordered nodes but I can't find documentation
> talking about the nodes and which are ordered (currently using nt:folder).
>
>
> On 11/14/2015 12:17 PM, Robert Munteanu wrote:
>>
>> Hi Clay,
>>
>> On Sat, Nov 14, 2015 at 5:46 PM, Clay Ferguson <wc...@gmail.com> wrote:
>>>
>>> Robert, I don't think any of us, including myself, had a misunderstanding
>>> about the fact that the limitation is for a large number of child nodes
>>> under SAME parent. No one said 50K in the entire repository was causing
>>> problems, but 50K children under same parent IS a problem if it's slow.
>>> It's a very significant issue for actual application developers trying to
>>> build something, because everything looks like its performing great but
>>> will fail miserably when you scale it up. It's hard to call JCR
>>> 'enterprise
>>> scale' with such a silly limitation staring is all right in the face
>>> defying any solution.
>>
>> That may or may not be true - the original post said that 'after 50K
>> plus child nodes retrieval is taking ~15 seconds'. I obviously added a
>> note that this - with the current JCR implementations - is expected.
>>
>> What you consider a limitation is something that I personally consider
>> an implementation constraint - if you want to use JCR it's something
>> that you need to take into account.
>>
>> That being said, Oak is expected to perform much better with flat
>> hierarchies, as long as the child nodes are not sortable. So you might
>> want to try this as well. Just be careful since nt:unstructured does
>> have orderable child nodes so you're better off using something like
>> oak:unstructured.
>>
>> Thanks,
>>
>> Robert
>>
>

Re: Node Retrieval Performance

Posted by David Marginian <da...@butterdev.com>.
Thanks Robert.  I considered moving to Oak, but the system was 
originally designed using JackRabbit and I recently discovered this 
limitation doing load/performance testing for a future requirement.  
Moving to Oak now would be too large of a change for us to take on now.  
Back to JackRabbit, is there documentation somewhere on the different 
node types and which are ordered or not?  I don't need ordered nodes but 
I can't find documentation talking about the nodes and which are ordered 
(currently using nt:folder).

On 11/14/2015 12:17 PM, Robert Munteanu wrote:
> Hi Clay,
>
> On Sat, Nov 14, 2015 at 5:46 PM, Clay Ferguson <wc...@gmail.com> wrote:
>> Robert, I don't think any of us, including myself, had a misunderstanding
>> about the fact that the limitation is for a large number of child nodes
>> under SAME parent. No one said 50K in the entire repository was causing
>> problems, but 50K children under same parent IS a problem if it's slow.
>> It's a very significant issue for actual application developers trying to
>> build something, because everything looks like its performing great but
>> will fail miserably when you scale it up. It's hard to call JCR 'enterprise
>> scale' with such a silly limitation staring is all right in the face
>> defying any solution.
> That may or may not be true - the original post said that 'after 50K
> plus child nodes retrieval is taking ~15 seconds'. I obviously added a
> note that this - with the current JCR implementations - is expected.
>
> What you consider a limitation is something that I personally consider
> an implementation constraint - if you want to use JCR it's something
> that you need to take into account.
>
> That being said, Oak is expected to perform much better with flat
> hierarchies, as long as the child nodes are not sortable. So you might
> want to try this as well. Just be careful since nt:unstructured does
> have orderable child nodes so you're better off using something like
> oak:unstructured.
>
> Thanks,
>
> Robert
>


Re: Node Retrieval Performance

Posted by David Marginian <da...@butterdev.com>.
I tried using BTreeManager but I don't think it will work well for my 
case.  Instead I am going to introduce some artificial structure to my 
data - build a structure N levels deep, fill it up with N number of 
nodes, and then move up the tree as the nodes fill up.

On 11/14/2015 12:17 PM, Robert Munteanu wrote:
> Hi Clay,
>
> On Sat, Nov 14, 2015 at 5:46 PM, Clay Ferguson <wc...@gmail.com> wrote:
>> Robert, I don't think any of us, including myself, had a misunderstanding
>> about the fact that the limitation is for a large number of child nodes
>> under SAME parent. No one said 50K in the entire repository was causing
>> problems, but 50K children under same parent IS a problem if it's slow.
>> It's a very significant issue for actual application developers trying to
>> build something, because everything looks like its performing great but
>> will fail miserably when you scale it up. It's hard to call JCR 'enterprise
>> scale' with such a silly limitation staring is all right in the face
>> defying any solution.
> That may or may not be true - the original post said that 'after 50K
> plus child nodes retrieval is taking ~15 seconds'. I obviously added a
> note that this - with the current JCR implementations - is expected.
>
> What you consider a limitation is something that I personally consider
> an implementation constraint - if you want to use JCR it's something
> that you need to take into account.
>
> That being said, Oak is expected to perform much better with flat
> hierarchies, as long as the child nodes are not sortable. So you might
> want to try this as well. Just be careful since nt:unstructured does
> have orderable child nodes so you're better off using something like
> oak:unstructured.
>
> Thanks,
>
> Robert
>


Re: Node Retrieval Performance

Posted by Robert Munteanu <ro...@apache.org>.
Hi Clay,

On Sat, Nov 14, 2015 at 5:46 PM, Clay Ferguson <wc...@gmail.com> wrote:
> Robert, I don't think any of us, including myself, had a misunderstanding
> about the fact that the limitation is for a large number of child nodes
> under SAME parent. No one said 50K in the entire repository was causing
> problems, but 50K children under same parent IS a problem if it's slow.
> It's a very significant issue for actual application developers trying to
> build something, because everything looks like its performing great but
> will fail miserably when you scale it up. It's hard to call JCR 'enterprise
> scale' with such a silly limitation staring is all right in the face
> defying any solution.

That may or may not be true - the original post said that 'after 50K
plus child nodes retrieval is taking ~15 seconds'. I obviously added a
note that this - with the current JCR implementations - is expected.

What you consider a limitation is something that I personally consider
an implementation constraint - if you want to use JCR it's something
that you need to take into account.

That being said, Oak is expected to perform much better with flat
hierarchies, as long as the child nodes are not sortable. So you might
want to try this as well. Just be careful since nt:unstructured does
have orderable child nodes so you're better off using something like
oak:unstructured.

Thanks,

Robert

Re: Node Retrieval Performance

Posted by Clay Ferguson <wc...@gmail.com>.
Robert, I don't think any of us, including myself, had a misunderstanding
about the fact that the limitation is for a large number of child nodes
under SAME parent. No one said 50K in the entire repository was causing
problems, but 50K children under same parent IS a problem if it's slow.
It's a very significant issue for actual application developers trying to
build something, because everything looks like its performing great but
will fail miserably when you scale it up. It's hard to call JCR 'enterprise
scale' with such a silly limitation staring is all right in the face
defying any solution.

Best regards,
Clay Ferguson
wclayf@gmail.com


On Sat, Nov 14, 2015 at 2:02 AM, Robert Munteanu <ro...@apache.org> wrote:

> On Nov 14, 2015 2:21 AM, "Clay Ferguson" <wc...@gmail.com> wrote:
> >
> > In my opinion this one issue is the single most crippling achilies heel
> of
> > the entire JCR. Very likely to drive away many potential users of this
> API.
> > It's touted as an enterprise-scale API, but yet chokes on just a few tens
> > of thousands of nodes. This, IMO urgently needs to be addressed. I know
> > it's a technical limitation, and not a design decision, but to me that
> just
> > means it's an 'unsolved' problem. I'm not complaining or criticizing
> > developers, i'm just saying that as a community we need to solve this. I
> > should be able to have a 50 million nodes, and not be a problem, in an
> > ideal situation. RDBMS have solved these issues years ago, by a "never
> load
> > everything all at once" rule. However somehow the "It's ok to load all
> > children in memory" mentality caught on in the JCR and we are now stuck
> > with the results.
>
> Nope that this usually applies to direct child nodes, i.e. 50k nodes with
> the same parent.
>
> Such a number spread throughout the repository is not an issue.
>
> Robert
>

Re: Node Retrieval Performance

Posted by Robert Munteanu <ro...@apache.org>.
On Nov 14, 2015 2:21 AM, "Clay Ferguson" <wc...@gmail.com> wrote:
>
> In my opinion this one issue is the single most crippling achilies heel of
> the entire JCR. Very likely to drive away many potential users of this
API.
> It's touted as an enterprise-scale API, but yet chokes on just a few tens
> of thousands of nodes. This, IMO urgently needs to be addressed. I know
> it's a technical limitation, and not a design decision, but to me that
just
> means it's an 'unsolved' problem. I'm not complaining or criticizing
> developers, i'm just saying that as a community we need to solve this. I
> should be able to have a 50 million nodes, and not be a problem, in an
> ideal situation. RDBMS have solved these issues years ago, by a "never
load
> everything all at once" rule. However somehow the "It's ok to load all
> children in memory" mentality caught on in the JCR and we are now stuck
> with the results.

Nope that this usually applies to direct child nodes, i.e. 50k nodes with
the same parent.

Such a number spread throughout the repository is not an issue.

Robert

>
>
> Best regards,
> Clay Ferguson
> wclayf@gmail.com
>
>
> On Fri, Nov 13, 2015 at 4:47 PM, Dirk Rudolph <dirk.rudolph@netcentric.biz
>
> wrote:
>
> > Did I understood you right, you have thousands of child nodes below the
> > root node?
> >
> > You should avoid this because this is considered bad practice in terms
of
> > write performance and depending on your concurrent access this might
also
> > block read access.
> >
> > http://wiki.apache.org/jackrabbit/Performance
> >
> > Try to introduce a structure to your content using BTreeManger
> >
> >
> >
> >
https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/flat/BTreeManager.html
> >
> > Cheers, D
> >
> >
> > On Friday, 13 November 2015, David Marginian <da...@butterdev.com>
wrote:
> >
> > > Thanks Clay.  I am not trying to load that many records at once.  The
> > > application is crawling a directory.  It places the files from that
> > > directory into JackRabbit one at a time, and puts a content id onto a
> > queue
> > > which is picked up by consumers on different servers.  Those consumers
> > then
> > > use the content id to retrieve the file from JackRabbit. Each piece of
> > > content is saved in a node under the root node.  The performance
slowdown
> > > is coming from calling session.getRootNode(), from what I can gather
from
> > > the docs I need the root node in order to add a child node.  Note the
> > > slowdown is pretty significant and I don't need to have close to 50k
to
> > > start seeing it (I start seeing it within a few minutes of running my
> > > app).  I don't need orderable nodes, how do I disable that?
> > >
> > >
> > > On 11/13/2015 03:10 PM, Clay Ferguson wrote:
> > >
> > >> ​Please let us know more about your use case. Why are you even
"trying"
> > to
> > >> load that many records all at once. Or at least scan them one by
one, I
> > >> mean. In most use cases you wouldn't need to do this kind of thing,
> > unless
> > >> it's some kind of backup or replication. I say "most" cases... I'm
not
> > >>   saying you don't need to just asking for a bit more background.
BTW:
> > If
> > >> you don't need 'orderable' nodes try to avoid them. That type of node
> > does
> > >> not work at 'scale'... and 50K is propably pushing it.​
> > >>
> > >> Best regards,
> > >> Clay Ferguson
> > >> wclayf@gmail.com
> > >>
> > >>
> > >> On Fri, Nov 13, 2015 at 3:33 PM, <da...@butterdev.com> wrote:
> > >>
> > >> Hi,
> > >>> I am new to JackRabbit and using version 2.11.2.  I am using
JackRabbit
> > >>> to
> > >>> store documents in a multi-threaded environment.  I noticed that the
> > time
> > >>> it takes to retrieve the root node is inconsistent and slow (several
> > >>> seconds +) and degrades over time (after 50K plus child nodes
retrieval
> > >>> is
> > >>> taking ~15 seconds).
> > >>>
> > >>> Originally, I was using code as follows to obtain a repository:
> > >>>
> > >>>   public Repository getRepository() throws ClassNotFoundException,
> > >>> RepositoryException {
> > >>>
> > >>>
> > >>>
> >
ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory"));
> > >>>       return JcrUtils.getRepository(jackabbitServerUrl);
> > >>>   }
> > >>>
> > >>> Then I came across the following thread:
> > >>>
> > >>>
> > >>>
> >
http://jackrabbit.510166.n4.nabble.com/getRootNode-takes-27-seconds-td1571027.html#a1571302
> > >>>
> > >>> This thread had some useful information (BatchReadConfig), but I am
not
> > >>> certain how to use the API to take advantage of it.  I have changed
my
> > >>> code
> > >>> to the following but it doesn't appear that node retrieval
performance
> > >>> has
> > >>> improved, is there something I am missing/doing wrong?
> > >>>
> > >>> 1) Repository Factory
> > >>> public Repository getRepository(@SuppressWarnings("rawtypes") Map
> > >>> parameters) throws RepositoryException {
> > >>>          String repositoryFactoryName = parameters != null && (
> > >>>
> > >>>  parameters.containsKey(PARAM_REPOSITORY_SERVICE_FACTORY) ||
> > >>>
> > parameters.containsKey(PARAM_REPOSITORY_CONFIG))
> > >>>                  ?
> > >>> "org.apache.jackrabbit.jcr2spi.Jcr2spiRepositoryFactory"
> > >>>                  :
"org.apache.jackrabbit.core.RepositoryFactoryImpl";
> > >>>
> > >>>          Object repositoryFactory;
> > >>>          try {
> > >>>              Class<?> repositoryFactoryClass =
> > >>> Class.forName(repositoryFactoryName, true,
> > >>>
Thread.currentThread().getContextClassLoader());
> > >>>
> > >>>              repositoryFactory =
repositoryFactoryClass.newInstance();
> > >>>          }
> > >>>          catch (Exception e) {
> > >>>              throw new RepositoryException(e);
> > >>>          }
> > >>>
> > >>>          if (repositoryFactory instanceof RepositoryFactory) {
> > >>>              return ((RepositoryFactory)
> > >>> repositoryFactory).getRepository(parameters);
> > >>>          }
> > >>>          else {
> > >>>              throw new RepositoryException(repositoryFactory + " is
> > not a
> > >>> RepositoryFactory");
> > >>>          }
> > >>>      }
> > >>>
> > >>> 2) Use the factory to get a repo:
> > >>>   public Repository getRepository() throws ClassNotFoundException,
> > >>> RepositoryException {
> > >>>          Map<String, RepositoryConfig> parameters =
> > >>> Collections.singletonMap(
> > >>>                  "org.apache.jackrabbit.jcr2spi.RepositoryConfig",
> > >>>                  (RepositoryConfig) new
> > >>> RepositoryConfigImpl(jackabbitServerUrl));
> > >>>
> > >>>          return getRepository(parameters);
> > >>>      }
> > >>>
> > >>> 3) Repository Config:
> > >>> private static final class RepositoryConfigImpl implements
> > >>> RepositoryConfig {
> > >>>
> > >>>          private String jackabbitServerUrl;
> > >>>
> > >>>          private RepositoryConfigImpl(String jackabbitServerUrl) {
> > >>>              super();
> > >>>              this.jackabbitServerUrl = jackabbitServerUrl;
> > >>>          }
> > >>>
> > >>>          public CacheBehaviour getCacheBehaviour() {
> > >>>              return CacheBehaviour.INVALIDATE;
> > >>>          }
> > >>>
> > >>>          public int getItemCacheSize() {
> > >>>              return 100;
> > >>>          }
> > >>>
> > >>>          public int getPollTimeout() {
> > >>>              return 5000;
> > >>>          }
> > >>>
> > >>>          public RepositoryService getRepositoryService() throws
> > >>> RepositoryException {
> > >>>              BatchReadConfig brc = new BatchReadConfig() {
> > >>>                  public int getDepth(Path path, PathResolver
resolver)
> > >>> throws NamespaceException {
> > >>>                      return 1;
> > >>>                  }
> > >>>              };
> > >>>              return new RepositoryServiceImpl(jackabbitServerUrl,
brc);
> > >>>          }
> > >>>
> > >>>      }
> > >>>
> > >>> Thanks for your time.
> > >>>
> > >>> David
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >
> >
> > --
> >
> > Dirk Rudolph | Senior Software Engineer
> >
> > Netcentric AG
> >
> > M: +41 79 642 37 11
> > D: +49 174 966 84 34
> >
> > dirk.rudolph@netcentric.biz | www.netcentric.biz
> >

Re: Node Retrieval Performance

Posted by Clay Ferguson <wc...@gmail.com>.
In my opinion this one issue is the single most crippling achilies heel of
the entire JCR. Very likely to drive away many potential users of this API.
It's touted as an enterprise-scale API, but yet chokes on just a few tens
of thousands of nodes. This, IMO urgently needs to be addressed. I know
it's a technical limitation, and not a design decision, but to me that just
means it's an 'unsolved' problem. I'm not complaining or criticizing
developers, i'm just saying that as a community we need to solve this. I
should be able to have a 50 million nodes, and not be a problem, in an
ideal situation. RDBMS have solved these issues years ago, by a "never load
everything all at once" rule. However somehow the "It's ok to load all
children in memory" mentality caught on in the JCR and we are now stuck
with the results.


Best regards,
Clay Ferguson
wclayf@gmail.com


On Fri, Nov 13, 2015 at 4:47 PM, Dirk Rudolph <di...@netcentric.biz>
wrote:

> Did I understood you right, you have thousands of child nodes below the
> root node?
>
> You should avoid this because this is considered bad practice in terms of
> write performance and depending on your concurrent access this might also
> block read access.
>
> http://wiki.apache.org/jackrabbit/Performance
>
> Try to introduce a structure to your content using BTreeManger
>
>
>
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/flat/BTreeManager.html
>
> Cheers, D
>
>
> On Friday, 13 November 2015, David Marginian <da...@butterdev.com> wrote:
>
> > Thanks Clay.  I am not trying to load that many records at once.  The
> > application is crawling a directory.  It places the files from that
> > directory into JackRabbit one at a time, and puts a content id onto a
> queue
> > which is picked up by consumers on different servers.  Those consumers
> then
> > use the content id to retrieve the file from JackRabbit. Each piece of
> > content is saved in a node under the root node.  The performance slowdown
> > is coming from calling session.getRootNode(), from what I can gather from
> > the docs I need the root node in order to add a child node.  Note the
> > slowdown is pretty significant and I don't need to have close to 50k to
> > start seeing it (I start seeing it within a few minutes of running my
> > app).  I don't need orderable nodes, how do I disable that?
> >
> >
> > On 11/13/2015 03:10 PM, Clay Ferguson wrote:
> >
> >> ​Please let us know more about your use case. Why are you even "trying"
> to
> >> load that many records all at once. Or at least scan them one by one, I
> >> mean. In most use cases you wouldn't need to do this kind of thing,
> unless
> >> it's some kind of backup or replication. I say "most" cases... I'm not
> >>   saying you don't need to just asking for a bit more background. BTW:
> If
> >> you don't need 'orderable' nodes try to avoid them. That type of node
> does
> >> not work at 'scale'... and 50K is propably pushing it.​
> >>
> >> Best regards,
> >> Clay Ferguson
> >> wclayf@gmail.com
> >>
> >>
> >> On Fri, Nov 13, 2015 at 3:33 PM, <da...@butterdev.com> wrote:
> >>
> >> Hi,
> >>> I am new to JackRabbit and using version 2.11.2.  I am using JackRabbit
> >>> to
> >>> store documents in a multi-threaded environment.  I noticed that the
> time
> >>> it takes to retrieve the root node is inconsistent and slow (several
> >>> seconds +) and degrades over time (after 50K plus child nodes retrieval
> >>> is
> >>> taking ~15 seconds).
> >>>
> >>> Originally, I was using code as follows to obtain a repository:
> >>>
> >>>   public Repository getRepository() throws ClassNotFoundException,
> >>> RepositoryException {
> >>>
> >>>
> >>>
> ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory"));
> >>>       return JcrUtils.getRepository(jackabbitServerUrl);
> >>>   }
> >>>
> >>> Then I came across the following thread:
> >>>
> >>>
> >>>
> http://jackrabbit.510166.n4.nabble.com/getRootNode-takes-27-seconds-td1571027.html#a1571302
> >>>
> >>> This thread had some useful information (BatchReadConfig), but I am not
> >>> certain how to use the API to take advantage of it.  I have changed my
> >>> code
> >>> to the following but it doesn't appear that node retrieval performance
> >>> has
> >>> improved, is there something I am missing/doing wrong?
> >>>
> >>> 1) Repository Factory
> >>> public Repository getRepository(@SuppressWarnings("rawtypes") Map
> >>> parameters) throws RepositoryException {
> >>>          String repositoryFactoryName = parameters != null && (
> >>>
> >>>  parameters.containsKey(PARAM_REPOSITORY_SERVICE_FACTORY) ||
> >>>
> parameters.containsKey(PARAM_REPOSITORY_CONFIG))
> >>>                  ?
> >>> "org.apache.jackrabbit.jcr2spi.Jcr2spiRepositoryFactory"
> >>>                  : "org.apache.jackrabbit.core.RepositoryFactoryImpl";
> >>>
> >>>          Object repositoryFactory;
> >>>          try {
> >>>              Class<?> repositoryFactoryClass =
> >>> Class.forName(repositoryFactoryName, true,
> >>>                      Thread.currentThread().getContextClassLoader());
> >>>
> >>>              repositoryFactory = repositoryFactoryClass.newInstance();
> >>>          }
> >>>          catch (Exception e) {
> >>>              throw new RepositoryException(e);
> >>>          }
> >>>
> >>>          if (repositoryFactory instanceof RepositoryFactory) {
> >>>              return ((RepositoryFactory)
> >>> repositoryFactory).getRepository(parameters);
> >>>          }
> >>>          else {
> >>>              throw new RepositoryException(repositoryFactory + " is
> not a
> >>> RepositoryFactory");
> >>>          }
> >>>      }
> >>>
> >>> 2) Use the factory to get a repo:
> >>>   public Repository getRepository() throws ClassNotFoundException,
> >>> RepositoryException {
> >>>          Map<String, RepositoryConfig> parameters =
> >>> Collections.singletonMap(
> >>>                  "org.apache.jackrabbit.jcr2spi.RepositoryConfig",
> >>>                  (RepositoryConfig) new
> >>> RepositoryConfigImpl(jackabbitServerUrl));
> >>>
> >>>          return getRepository(parameters);
> >>>      }
> >>>
> >>> 3) Repository Config:
> >>> private static final class RepositoryConfigImpl implements
> >>> RepositoryConfig {
> >>>
> >>>          private String jackabbitServerUrl;
> >>>
> >>>          private RepositoryConfigImpl(String jackabbitServerUrl) {
> >>>              super();
> >>>              this.jackabbitServerUrl = jackabbitServerUrl;
> >>>          }
> >>>
> >>>          public CacheBehaviour getCacheBehaviour() {
> >>>              return CacheBehaviour.INVALIDATE;
> >>>          }
> >>>
> >>>          public int getItemCacheSize() {
> >>>              return 100;
> >>>          }
> >>>
> >>>          public int getPollTimeout() {
> >>>              return 5000;
> >>>          }
> >>>
> >>>          public RepositoryService getRepositoryService() throws
> >>> RepositoryException {
> >>>              BatchReadConfig brc = new BatchReadConfig() {
> >>>                  public int getDepth(Path path, PathResolver resolver)
> >>> throws NamespaceException {
> >>>                      return 1;
> >>>                  }
> >>>              };
> >>>              return new RepositoryServiceImpl(jackabbitServerUrl, brc);
> >>>          }
> >>>
> >>>      }
> >>>
> >>> Thanks for your time.
> >>>
> >>> David
> >>>
> >>>
> >>>
> >>>
> >>>
> >
>
> --
>
> Dirk Rudolph | Senior Software Engineer
>
> Netcentric AG
>
> M: +41 79 642 37 11
> D: +49 174 966 84 34
>
> dirk.rudolph@netcentric.biz | www.netcentric.biz
>

Re: Node Retrieval Performance

Posted by Dirk Rudolph <di...@netcentric.biz>.
Did I understood you right, you have thousands of child nodes below the
root node?

You should avoid this because this is considered bad practice in terms of
write performance and depending on your concurrent access this might also
block read access.

http://wiki.apache.org/jackrabbit/Performance

Try to introduce a structure to your content using BTreeManger


https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/flat/BTreeManager.html

Cheers, D


On Friday, 13 November 2015, David Marginian <da...@butterdev.com> wrote:

> Thanks Clay.  I am not trying to load that many records at once.  The
> application is crawling a directory.  It places the files from that
> directory into JackRabbit one at a time, and puts a content id onto a queue
> which is picked up by consumers on different servers.  Those consumers then
> use the content id to retrieve the file from JackRabbit. Each piece of
> content is saved in a node under the root node.  The performance slowdown
> is coming from calling session.getRootNode(), from what I can gather from
> the docs I need the root node in order to add a child node.  Note the
> slowdown is pretty significant and I don't need to have close to 50k to
> start seeing it (I start seeing it within a few minutes of running my
> app).  I don't need orderable nodes, how do I disable that?
>
>
> On 11/13/2015 03:10 PM, Clay Ferguson wrote:
>
>> ​Please let us know more about your use case. Why are you even "trying" to
>> load that many records all at once. Or at least scan them one by one, I
>> mean. In most use cases you wouldn't need to do this kind of thing, unless
>> it's some kind of backup or replication. I say "most" cases... I'm not
>>   saying you don't need to just asking for a bit more background. BTW: If
>> you don't need 'orderable' nodes try to avoid them. That type of node does
>> not work at 'scale'... and 50K is propably pushing it.​
>>
>> Best regards,
>> Clay Ferguson
>> wclayf@gmail.com
>>
>>
>> On Fri, Nov 13, 2015 at 3:33 PM, <da...@butterdev.com> wrote:
>>
>> Hi,
>>> I am new to JackRabbit and using version 2.11.2.  I am using JackRabbit
>>> to
>>> store documents in a multi-threaded environment.  I noticed that the time
>>> it takes to retrieve the root node is inconsistent and slow (several
>>> seconds +) and degrades over time (after 50K plus child nodes retrieval
>>> is
>>> taking ~15 seconds).
>>>
>>> Originally, I was using code as follows to obtain a repository:
>>>
>>>   public Repository getRepository() throws ClassNotFoundException,
>>> RepositoryException {
>>>
>>>
>>> ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory"));
>>>       return JcrUtils.getRepository(jackabbitServerUrl);
>>>   }
>>>
>>> Then I came across the following thread:
>>>
>>>
>>> http://jackrabbit.510166.n4.nabble.com/getRootNode-takes-27-seconds-td1571027.html#a1571302
>>>
>>> This thread had some useful information (BatchReadConfig), but I am not
>>> certain how to use the API to take advantage of it.  I have changed my
>>> code
>>> to the following but it doesn't appear that node retrieval performance
>>> has
>>> improved, is there something I am missing/doing wrong?
>>>
>>> 1) Repository Factory
>>> public Repository getRepository(@SuppressWarnings("rawtypes") Map
>>> parameters) throws RepositoryException {
>>>          String repositoryFactoryName = parameters != null && (
>>>
>>>  parameters.containsKey(PARAM_REPOSITORY_SERVICE_FACTORY) ||
>>>                          parameters.containsKey(PARAM_REPOSITORY_CONFIG))
>>>                  ?
>>> "org.apache.jackrabbit.jcr2spi.Jcr2spiRepositoryFactory"
>>>                  : "org.apache.jackrabbit.core.RepositoryFactoryImpl";
>>>
>>>          Object repositoryFactory;
>>>          try {
>>>              Class<?> repositoryFactoryClass =
>>> Class.forName(repositoryFactoryName, true,
>>>                      Thread.currentThread().getContextClassLoader());
>>>
>>>              repositoryFactory = repositoryFactoryClass.newInstance();
>>>          }
>>>          catch (Exception e) {
>>>              throw new RepositoryException(e);
>>>          }
>>>
>>>          if (repositoryFactory instanceof RepositoryFactory) {
>>>              return ((RepositoryFactory)
>>> repositoryFactory).getRepository(parameters);
>>>          }
>>>          else {
>>>              throw new RepositoryException(repositoryFactory + " is not a
>>> RepositoryFactory");
>>>          }
>>>      }
>>>
>>> 2) Use the factory to get a repo:
>>>   public Repository getRepository() throws ClassNotFoundException,
>>> RepositoryException {
>>>          Map<String, RepositoryConfig> parameters =
>>> Collections.singletonMap(
>>>                  "org.apache.jackrabbit.jcr2spi.RepositoryConfig",
>>>                  (RepositoryConfig) new
>>> RepositoryConfigImpl(jackabbitServerUrl));
>>>
>>>          return getRepository(parameters);
>>>      }
>>>
>>> 3) Repository Config:
>>> private static final class RepositoryConfigImpl implements
>>> RepositoryConfig {
>>>
>>>          private String jackabbitServerUrl;
>>>
>>>          private RepositoryConfigImpl(String jackabbitServerUrl) {
>>>              super();
>>>              this.jackabbitServerUrl = jackabbitServerUrl;
>>>          }
>>>
>>>          public CacheBehaviour getCacheBehaviour() {
>>>              return CacheBehaviour.INVALIDATE;
>>>          }
>>>
>>>          public int getItemCacheSize() {
>>>              return 100;
>>>          }
>>>
>>>          public int getPollTimeout() {
>>>              return 5000;
>>>          }
>>>
>>>          public RepositoryService getRepositoryService() throws
>>> RepositoryException {
>>>              BatchReadConfig brc = new BatchReadConfig() {
>>>                  public int getDepth(Path path, PathResolver resolver)
>>> throws NamespaceException {
>>>                      return 1;
>>>                  }
>>>              };
>>>              return new RepositoryServiceImpl(jackabbitServerUrl, brc);
>>>          }
>>>
>>>      }
>>>
>>> Thanks for your time.
>>>
>>> David
>>>
>>>
>>>
>>>
>>>
>

-- 

Dirk Rudolph | Senior Software Engineer

Netcentric AG

M: +41 79 642 37 11
D: +49 174 966 84 34

dirk.rudolph@netcentric.biz | www.netcentric.biz

Re: Node Retrieval Performance

Posted by David Marginian <da...@butterdev.com>.
Thanks Clay.  I am not trying to load that many records at once.  The 
application is crawling a directory.  It places the files from that 
directory into JackRabbit one at a time, and puts a content id onto a 
queue which is picked up by consumers on different servers.  Those 
consumers then use the content id to retrieve the file from JackRabbit. 
Each piece of content is saved in a node under the root node.  The 
performance slowdown is coming from calling session.getRootNode(), from 
what I can gather from the docs I need the root node in order to add a 
child node.  Note the slowdown is pretty significant and I don't need to 
have close to 50k to start seeing it (I start seeing it within a few 
minutes of running my app).  I don't need orderable nodes, how do I 
disable that?


On 11/13/2015 03:10 PM, Clay Ferguson wrote:
> ​Please let us know more about your use case. Why are you even "trying" to
> load that many records all at once. Or at least scan them one by one, I
> mean. In most use cases you wouldn't need to do this kind of thing, unless
> it's some kind of backup or replication. I say "most" cases... I'm not
>   saying you don't need to just asking for a bit more background. BTW: If
> you don't need 'orderable' nodes try to avoid them. That type of node does
> not work at 'scale'... and 50K is propably pushing it.​
>
> Best regards,
> Clay Ferguson
> wclayf@gmail.com
>
>
> On Fri, Nov 13, 2015 at 3:33 PM, <da...@butterdev.com> wrote:
>
>> Hi,
>> I am new to JackRabbit and using version 2.11.2.  I am using JackRabbit to
>> store documents in a multi-threaded environment.  I noticed that the time
>> it takes to retrieve the root node is inconsistent and slow (several
>> seconds +) and degrades over time (after 50K plus child nodes retrieval is
>> taking ~15 seconds).
>>
>> Originally, I was using code as follows to obtain a repository:
>>
>>   public Repository getRepository() throws ClassNotFoundException,
>> RepositoryException {
>>
>>   ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory"));
>>       return JcrUtils.getRepository(jackabbitServerUrl);
>>   }
>>
>> Then I came across the following thread:
>>
>> http://jackrabbit.510166.n4.nabble.com/getRootNode-takes-27-seconds-td1571027.html#a1571302
>>
>> This thread had some useful information (BatchReadConfig), but I am not
>> certain how to use the API to take advantage of it.  I have changed my code
>> to the following but it doesn't appear that node retrieval performance has
>> improved, is there something I am missing/doing wrong?
>>
>> 1) Repository Factory
>> public Repository getRepository(@SuppressWarnings("rawtypes") Map
>> parameters) throws RepositoryException {
>>          String repositoryFactoryName = parameters != null && (
>>                  parameters.containsKey(PARAM_REPOSITORY_SERVICE_FACTORY) ||
>>                          parameters.containsKey(PARAM_REPOSITORY_CONFIG))
>>                  ? "org.apache.jackrabbit.jcr2spi.Jcr2spiRepositoryFactory"
>>                  : "org.apache.jackrabbit.core.RepositoryFactoryImpl";
>>
>>          Object repositoryFactory;
>>          try {
>>              Class<?> repositoryFactoryClass =
>> Class.forName(repositoryFactoryName, true,
>>                      Thread.currentThread().getContextClassLoader());
>>
>>              repositoryFactory = repositoryFactoryClass.newInstance();
>>          }
>>          catch (Exception e) {
>>              throw new RepositoryException(e);
>>          }
>>
>>          if (repositoryFactory instanceof RepositoryFactory) {
>>              return ((RepositoryFactory)
>> repositoryFactory).getRepository(parameters);
>>          }
>>          else {
>>              throw new RepositoryException(repositoryFactory + " is not a
>> RepositoryFactory");
>>          }
>>      }
>>
>> 2) Use the factory to get a repo:
>>   public Repository getRepository() throws ClassNotFoundException,
>> RepositoryException {
>>          Map<String, RepositoryConfig> parameters =
>> Collections.singletonMap(
>>                  "org.apache.jackrabbit.jcr2spi.RepositoryConfig",
>>                  (RepositoryConfig) new
>> RepositoryConfigImpl(jackabbitServerUrl));
>>
>>          return getRepository(parameters);
>>      }
>>
>> 3) Repository Config:
>> private static final class RepositoryConfigImpl implements
>> RepositoryConfig {
>>
>>          private String jackabbitServerUrl;
>>
>>          private RepositoryConfigImpl(String jackabbitServerUrl) {
>>              super();
>>              this.jackabbitServerUrl = jackabbitServerUrl;
>>          }
>>
>>          public CacheBehaviour getCacheBehaviour() {
>>              return CacheBehaviour.INVALIDATE;
>>          }
>>
>>          public int getItemCacheSize() {
>>              return 100;
>>          }
>>
>>          public int getPollTimeout() {
>>              return 5000;
>>          }
>>
>>          public RepositoryService getRepositoryService() throws
>> RepositoryException {
>>              BatchReadConfig brc = new BatchReadConfig() {
>>                  public int getDepth(Path path, PathResolver resolver)
>> throws NamespaceException {
>>                      return 1;
>>                  }
>>              };
>>              return new RepositoryServiceImpl(jackabbitServerUrl, brc);
>>          }
>>
>>      }
>>
>> Thanks for your time.
>>
>> David
>>
>>
>>
>>


Re: Node Retrieval Performance

Posted by Clay Ferguson <wc...@gmail.com>.
​Please let us know more about your use case. Why are you even "trying" to
load that many records all at once. Or at least scan them one by one, I
mean. In most use cases you wouldn't need to do this kind of thing, unless
it's some kind of backup or replication. I say "most" cases... I'm not
 saying you don't need to just asking for a bit more background. BTW: If
you don't need 'orderable' nodes try to avoid them. That type of node does
not work at 'scale'... and 50K is propably pushing it.​

Best regards,
Clay Ferguson
wclayf@gmail.com


On Fri, Nov 13, 2015 at 3:33 PM, <da...@butterdev.com> wrote:

> Hi,
> I am new to JackRabbit and using version 2.11.2.  I am using JackRabbit to
> store documents in a multi-threaded environment.  I noticed that the time
> it takes to retrieve the root node is inconsistent and slow (several
> seconds +) and degrades over time (after 50K plus child nodes retrieval is
> taking ~15 seconds).
>
> Originally, I was using code as follows to obtain a repository:
>
>  public Repository getRepository() throws ClassNotFoundException,
> RepositoryException {
>
>  ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory"));
>      return JcrUtils.getRepository(jackabbitServerUrl);
>  }
>
> Then I came across the following thread:
>
> http://jackrabbit.510166.n4.nabble.com/getRootNode-takes-27-seconds-td1571027.html#a1571302
>
> This thread had some useful information (BatchReadConfig), but I am not
> certain how to use the API to take advantage of it.  I have changed my code
> to the following but it doesn't appear that node retrieval performance has
> improved, is there something I am missing/doing wrong?
>
> 1) Repository Factory
> public Repository getRepository(@SuppressWarnings("rawtypes") Map
> parameters) throws RepositoryException {
>         String repositoryFactoryName = parameters != null && (
>                 parameters.containsKey(PARAM_REPOSITORY_SERVICE_FACTORY) ||
>                         parameters.containsKey(PARAM_REPOSITORY_CONFIG))
>                 ? "org.apache.jackrabbit.jcr2spi.Jcr2spiRepositoryFactory"
>                 : "org.apache.jackrabbit.core.RepositoryFactoryImpl";
>
>         Object repositoryFactory;
>         try {
>             Class<?> repositoryFactoryClass =
> Class.forName(repositoryFactoryName, true,
>                     Thread.currentThread().getContextClassLoader());
>
>             repositoryFactory = repositoryFactoryClass.newInstance();
>         }
>         catch (Exception e) {
>             throw new RepositoryException(e);
>         }
>
>         if (repositoryFactory instanceof RepositoryFactory) {
>             return ((RepositoryFactory)
> repositoryFactory).getRepository(parameters);
>         }
>         else {
>             throw new RepositoryException(repositoryFactory + " is not a
> RepositoryFactory");
>         }
>     }
>
> 2) Use the factory to get a repo:
>  public Repository getRepository() throws ClassNotFoundException,
> RepositoryException {
>         Map<String, RepositoryConfig> parameters =
> Collections.singletonMap(
>                 "org.apache.jackrabbit.jcr2spi.RepositoryConfig",
>                 (RepositoryConfig) new
> RepositoryConfigImpl(jackabbitServerUrl));
>
>         return getRepository(parameters);
>     }
>
> 3) Repository Config:
> private static final class RepositoryConfigImpl implements
> RepositoryConfig {
>
>         private String jackabbitServerUrl;
>
>         private RepositoryConfigImpl(String jackabbitServerUrl) {
>             super();
>             this.jackabbitServerUrl = jackabbitServerUrl;
>         }
>
>         public CacheBehaviour getCacheBehaviour() {
>             return CacheBehaviour.INVALIDATE;
>         }
>
>         public int getItemCacheSize() {
>             return 100;
>         }
>
>         public int getPollTimeout() {
>             return 5000;
>         }
>
>         public RepositoryService getRepositoryService() throws
> RepositoryException {
>             BatchReadConfig brc = new BatchReadConfig() {
>                 public int getDepth(Path path, PathResolver resolver)
> throws NamespaceException {
>                     return 1;
>                 }
>             };
>             return new RepositoryServiceImpl(jackabbitServerUrl, brc);
>         }
>
>     }
>
> Thanks for your time.
>
> David
>
>
>
>

Re: Node Retrieval Performance

Posted by da...@butterdev.com.
A few more snippets of code that may be helpful:

  public String storeContent(String fileName, InputStream inputStream, 
String contentType, String contentEncoding) throws JcrServiceException {
         String documentId;
         Session session = null;

         try {
             session = getJcrSession();
             ... store the content.
  }

  private Session getJcrSession() throws ClassNotFoundException, 
RepositoryException {
         Repository repository = repositoryProvider.getRepository();

         // creates a default workspace
         Session session = repository.login(new 
SimpleCredentials(jcrUsername, jcrPassword.toCharArray()));
         if (logger.isDebugEnabled())
             logger.debug("Content Repository User: " + 
session.getUserID());
         Workspace workspace = session.getWorkspace();
         if (logger.isDebugEnabled())
             logger.debug("Content Repository Workspace: " + 
workspace.getName());

         try {
             
workspace.getNamespaceRegistry().registerNamespace(NAMESPACE_PREFIX, 
NAMESPACE_URI);
         } catch (NamespaceException e) {
             logger.debug("The namespace has already been registered."); 
// This is OK.
         } catch (RepositoryException e) {
             throw new RepositoryException("Could not create JCR 
session.", e);
         }
         return session;
     }

On 2015-11-13 14:33, david@butterdev.com wrote:
> Hi,
> I am new to JackRabbit and using version 2.11.2.  I am using
> JackRabbit to store documents in a multi-threaded environment.  I
> noticed that the time it takes to retrieve the root node is
> inconsistent and slow (several seconds +) and degrades over time
> (after 50K plus child nodes retrieval is taking ~15 seconds).
> 
> Originally, I was using code as follows to obtain a repository:
> 
>  public Repository getRepository() throws ClassNotFoundException,
> RepositoryException {
> 
> ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory"));
>      return JcrUtils.getRepository(jackabbitServerUrl);
>  }
> 
> Then I came across the following thread:
> http://jackrabbit.510166.n4.nabble.com/getRootNode-takes-27-seconds-td1571027.html#a1571302
> 
> This thread had some useful information (BatchReadConfig), but I am
> not certain how to use the API to take advantage of it.  I have
> changed my code to the following but it doesn't appear that node
> retrieval performance has improved, is there something I am
> missing/doing wrong?
> 
> 1) Repository Factory
> public Repository getRepository(@SuppressWarnings("rawtypes") Map
> parameters) throws RepositoryException {
>         String repositoryFactoryName = parameters != null && (
>                 
> parameters.containsKey(PARAM_REPOSITORY_SERVICE_FACTORY) ||
>                         
> parameters.containsKey(PARAM_REPOSITORY_CONFIG))
>                 ? 
> "org.apache.jackrabbit.jcr2spi.Jcr2spiRepositoryFactory"
>                 : "org.apache.jackrabbit.core.RepositoryFactoryImpl";
> 
>         Object repositoryFactory;
>         try {
>             Class<?> repositoryFactoryClass =
> Class.forName(repositoryFactoryName, true,
>                     Thread.currentThread().getContextClassLoader());
> 
>             repositoryFactory = repositoryFactoryClass.newInstance();
>         }
>         catch (Exception e) {
>             throw new RepositoryException(e);
>         }
> 
>         if (repositoryFactory instanceof RepositoryFactory) {
>             return ((RepositoryFactory)
> repositoryFactory).getRepository(parameters);
>         }
>         else {
>             throw new RepositoryException(repositoryFactory + " is not
> a RepositoryFactory");
>         }
>     }
> 
> 2) Use the factory to get a repo:
>  public Repository getRepository() throws ClassNotFoundException,
> RepositoryException {
>         Map<String, RepositoryConfig> parameters = 
> Collections.singletonMap(
>                 "org.apache.jackrabbit.jcr2spi.RepositoryConfig",
>                 (RepositoryConfig) new
> RepositoryConfigImpl(jackabbitServerUrl));
> 
>         return getRepository(parameters);
>     }
> 
> 3) Repository Config:
> private static final class RepositoryConfigImpl implements 
> RepositoryConfig {
> 
>         private String jackabbitServerUrl;
> 
>         private RepositoryConfigImpl(String jackabbitServerUrl) {
>             super();
>             this.jackabbitServerUrl = jackabbitServerUrl;
>         }
> 
>         public CacheBehaviour getCacheBehaviour() {
>             return CacheBehaviour.INVALIDATE;
>         }
> 
>         public int getItemCacheSize() {
>             return 100;
>         }
> 
>         public int getPollTimeout() {
>             return 5000;
>         }
> 
>         public RepositoryService getRepositoryService() throws
> RepositoryException {
>             BatchReadConfig brc = new BatchReadConfig() {
>                 public int getDepth(Path path, PathResolver resolver)
> throws NamespaceException {
>                     return 1;
>                 }
>             };
>             return new RepositoryServiceImpl(jackabbitServerUrl, brc);
>         }
> 
>     }
> 
> Thanks for your time.
> 
> David