You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-dev@jackrabbit.apache.org by William Markmann <bi...@counterpointconsulting.com> on 2018/03/16 15:35:06 UTC

Decreased performance with many inserts

Has anyone experienced a significant slowdown when adding many (tens /
hundreds of thousands) of documents to an Oak repository?

I'm using:


    <tika.version>1.7</tika.version>
    <jackrabbit.version>2.14.1</jackrabbit.version>
    <oak.version>1.8-SNAPSHOT</oak.version>
    <lucene.version>4.7.1</lucene.version>

and creating the repository (Spring Boot app) basically like:

     MBeanExporter mbe = new MBeanExporter();
    mbe.setServer(mbs);
    mbe.setNamingStrategy(new IdentityNamingStrategy());

    GCMonitor gcMonitor = new GCMonitorTracker();
    StatisticsProvider statisticsProvider = new
MetricStatisticsProvider(mbs, Oak.defaultScheduledExecutor());

    FileStoreBuilder fsBuilder = FileStoreBuilder.fileStoreBuilder(new
File(repoDirectory));
    fsBuilder.withGCMonitor(gcMonitor);
    fsBuilder.withIOMonitor(new MetricsIOMonitor(statisticsProvider));
    fsBuilder.withStatisticsProvider(statisticsProvider);

        this.fs = fsBuilder.build();

        SegmentNodeStoreBuilder nsBuilder =
SegmentNodeStoreBuilders.builder(fs);
        nsBuilder.withStatisticsProvider(statisticsProvider);
        this.ns = nsBuilder.build();
        this.executor = Oak.defaultExecutorService();
        this.oak = new Oak(ns);
        this.oak.with(mbs);
        this.oak.withAsyncIndexing("async", 5);
        this.jcr = new Jcr(oak);
        this.repository = jcr.createRepository();


The basic problem is that I'm doing a data migration (~1 million docs) from
a legacy system.  When I start inserting the documents into Oak (the folder
structure is very flat), it absolutely flies in the beginning, but
significantly slows down by the time I get to 75k or so documents (watching
the stats in
"org.apache.jackrabbit.oak:name=oak.segment.segment-write-time,type=Metrics
/ OneMinutRate" shows an 80% slowdown over the course of an hour or so.

Also noticed / possibly related note -- when I just start out, I see the
"async" indexer running and logging as the folders and documents are being
created, but it stops logging anything within the first 20k inserts.

Should probably also add that the sessions doing the writes are probably
adding about 10 file nodes before syncing (ie. not using a session per file
or doing it all in one session).  The actual inserts are being done using a
thread pool with about six workers simultaneously writing the files into
Oak.

Has anyone else seen similar behavior?  Is there anything I should be
taking into account when moving so many files at once?

Any thoughts would be hugely appreciated.  Thanks!


-- 
*Bill Markmann*
*President | 866 809 0394 x 701*
*Counterpoint Consulting*
*Automate. Innovate. Accelerate.*
c20g.com | *Blog <http://www.c20g.com/site/blog> **| Linkedin
<http://www.linkedin.com/company/counterpoint-consulting-inc.>** | Twitter
<https://twitter.com/c20g>*

Re: Decreased performance with many inserts

Posted by Marcel Reutegger <mr...@adobe.com.INVALID>.
Hi,

On 21.03.18 19:18, William Markmann wrote:
> Would this be an appropriate template for how to create the FileStore with
> a FileDataStore BlobStore?  (Am I getting the relationship right there?):
> 
> https://gist.github.com/chetanmeh/6242d0a7fe421955d456

Yes, something along those lines should do it.

> On Wed, Mar 21, 2018 at 11:48 AM, William Markmann <
>> Should I be using the FileDataStore even though I'm not sharing them
>> between repositories?  Would this be a general guideline for using the
>> SegmentNodeStore with a large volume of data?

As soon as you operate with a large volume of data, this would be my 
recommendation, yes.

I created a JIRA issue to update the documentation: 
https://issues.apache.org/jira/browse/OAK-7372

Regards
  Marcel

Re: Decreased performance with many inserts

Posted by William Markmann <bi...@counterpointconsulting.com>.
Would this be an appropriate template for how to create the FileStore with
a FileDataStore BlobStore?  (Am I getting the relationship right there?):

https://gist.github.com/chetanmeh/6242d0a7fe421955d456

Thanks, - Bill

On Wed, Mar 21, 2018 at 11:48 AM, William Markmann <
bill@counterpointconsulting.com> wrote:

> Understand reasoning behind the first point, just wanted to check if I'd
> missed some.
>
> Your second point is more interesting to me, though...  what you're
> suggesting sounds like I'm putting the JCR "tree" and the file content in
> the same tar files when I don't have to?  I assumed they were part and
> parcel... :-)
>
> The completely stripped down configuration I do to get a repository
> instance the rest of my application uses looks like:
>
> FileStoreBuilder fsBuilder = FileStoreBuilder.fileStoreBuilder(new
> File(repoDirectory));
> this.fs = fsBuilder.build();
>
> SegmentNodeStoreBuilder nsBuilder = SegmentNodeStoreBuilders.builder(fs);
> this.ns = nsBuilder.build();
> this.executor = Oak.defaultExecutorService();
> this.oak = new Oak(ns);
> this.jcr = new Jcr(oak);
> this.repository = jcr.createRepository();
>
> ....and that's it.  How would I tell it to split out the binary data?  The
> NodeStore documentation says "By default SegmentNodeStore (aka TarMK)
> does not require a BlobStore. Instead the binary content is directly stored
> as part of segment blob itself...  FileDataStore - This should be used if
> the blobs/binaries have to be shared between multiple repositories."
>
> Should I be using the FileDataStore even though I'm not sharing them
> between repositories?  Would this be a general guideline for using the
> SegmentNodeStore with a large volume of data?
>
> Thanks! - Bill
>
>
> On Wed, Mar 21, 2018 at 9:02 AM, Marcel Reutegger <
> mreutegg@adobe.com.invalid> wrote:
>
>> Hi,
>>
>> On 17.03.18 18:32, William Markmann wrote:
>>
>>> 1) Is it necessary to do the above in two steps, or can a Node be created
>>> and checked in with the VersionManager in one shot?
>>>
>>
>> Yes, this is necessary. The JCR specification defines a checkin as a
>> workspace operation, which can only operate on saved changes. That is, you
>> must first save the node and only then you can check it in.
>>
>> 4) Is there anything inherent in the SegmentNodeStore that would decrease
>>> in performance as the repository grows?
>>>
>>
>> I'm not exactly sure how your deployment looks like, but you may be
>> storing the binary data in the tar files written by the SegmentNodeStore as
>> well. This has an adverse effect on data locality and the general
>> recommendation is to configure a separate DataStore for binary data.
>>
>> Regards
>>  Marcel
>>
>
>
>
> --
> *Bill Markmann*
> *President | 866 809 0394 x 701*
> *Counterpoint Consulting*
> *Automate. Innovate. Accelerate.*
> c20g.com | *Blog <http://www.c20g.com/site/blog> **| Linkedin
> <http://www.linkedin.com/company/counterpoint-consulting-inc.>** |
> Twitter <https://twitter.com/c20g>*
>



-- 
*Bill Markmann*
*President | 866 809 0394 x 701*
*Counterpoint Consulting*
*Automate. Innovate. Accelerate.*
c20g.com | *Blog <http://www.c20g.com/site/blog> **| Linkedin
<http://www.linkedin.com/company/counterpoint-consulting-inc.>** | Twitter
<https://twitter.com/c20g>*

Re: Decreased performance with many inserts

Posted by William Markmann <bi...@counterpointconsulting.com>.
Understand reasoning behind the first point, just wanted to check if I'd
missed some.

Your second point is more interesting to me, though...  what you're
suggesting sounds like I'm putting the JCR "tree" and the file content in
the same tar files when I don't have to?  I assumed they were part and
parcel... :-)

The completely stripped down configuration I do to get a repository
instance the rest of my application uses looks like:

FileStoreBuilder fsBuilder = FileStoreBuilder.fileStoreBuilder(new
File(repoDirectory));
this.fs = fsBuilder.build();

SegmentNodeStoreBuilder nsBuilder = SegmentNodeStoreBuilders.builder(fs);
this.ns = nsBuilder.build();
this.executor = Oak.defaultExecutorService();
this.oak = new Oak(ns);
this.jcr = new Jcr(oak);
this.repository = jcr.createRepository();

....and that's it.  How would I tell it to split out the binary data?  The
NodeStore documentation says "By default SegmentNodeStore (aka TarMK) does
not require a BlobStore. Instead the binary content is directly stored as
part of segment blob itself...  FileDataStore - This should be used if the
blobs/binaries have to be shared between multiple repositories."

Should I be using the FileDataStore even though I'm not sharing them
between repositories?  Would this be a general guideline for using the
SegmentNodeStore with a large volume of data?

Thanks! - Bill


On Wed, Mar 21, 2018 at 9:02 AM, Marcel Reutegger <
mreutegg@adobe.com.invalid> wrote:

> Hi,
>
> On 17.03.18 18:32, William Markmann wrote:
>
>> 1) Is it necessary to do the above in two steps, or can a Node be created
>> and checked in with the VersionManager in one shot?
>>
>
> Yes, this is necessary. The JCR specification defines a checkin as a
> workspace operation, which can only operate on saved changes. That is, you
> must first save the node and only then you can check it in.
>
> 4) Is there anything inherent in the SegmentNodeStore that would decrease
>> in performance as the repository grows?
>>
>
> I'm not exactly sure how your deployment looks like, but you may be
> storing the binary data in the tar files written by the SegmentNodeStore as
> well. This has an adverse effect on data locality and the general
> recommendation is to configure a separate DataStore for binary data.
>
> Regards
>  Marcel
>



-- 
*Bill Markmann*
*President | 866 809 0394 x 701*
*Counterpoint Consulting*
*Automate. Innovate. Accelerate.*
c20g.com | *Blog <http://www.c20g.com/site/blog> **| Linkedin
<http://www.linkedin.com/company/counterpoint-consulting-inc.>** | Twitter
<https://twitter.com/c20g>*

Re: Decreased performance with many inserts

Posted by Marcel Reutegger <mr...@adobe.com.INVALID>.
Hi,

On 17.03.18 18:32, William Markmann wrote:
> 1) Is it necessary to do the above in two steps, or can a Node be created
> and checked in with the VersionManager in one shot?

Yes, this is necessary. The JCR specification defines a checkin as a 
workspace operation, which can only operate on saved changes. That is, 
you must first save the node and only then you can check it in.

> 4) Is there anything inherent in the SegmentNodeStore that would decrease
> in performance as the repository grows?

I'm not exactly sure how your deployment looks like, but you may be 
storing the binary data in the tar files written by the SegmentNodeStore 
as well. This has an adverse effect on data locality and the general 
recommendation is to configure a separate DataStore for binary data.

Regards
  Marcel

Re: Decreased performance with many inserts

Posted by William Markmann <bi...@counterpointconsulting.com>.
If it helps, this is what the Oak configuration shows when it spins up:

2018-03-17 13:15:48.171  INFO 35898 --- [           main]
o.a.j.oak.segment.file.FileStore         : Creating file store
FileStoreBuilder{version=1.8-SNAPSHOT, directory=/dms/oak-repository,
blobStore=null, maxFileSize=256, segmentCacheSize=256, stringCacheSize=256,
templateCacheSize=64, stringDeduplicationCacheSize=15000,
templateDeduplicationCacheSize=3000, nodeDeduplicationCacheSize=1048576,
memoryMapping=true, gcOptions=SegmentGCOptions{paused=false,
estimationDisabled=false, gcSizeDeltaEstimation=1073741824, retryCount=5,
forceTimeout=60, retainedGenerations=2, gcType=FULL}}
2018-03-17 13:15:48.774  INFO 35898 --- [           main]
o.a.j.oak.segment.file.FileStore         : TarMK opened:
/dms/oak-repository (mmap=true)
2018-03-17 13:15:48.806  INFO 35898 --- [           main]
SegmentNodeStore$SegmentNodeStoreBuilder : Creating segment node store
SegmentNodeStoreBuilder{blobStore=inline}
2018-03-17 13:15:48.812  INFO 35898 --- [           main]
o.a.j.o.s.scheduler.LockBasedScheduler   : Initializing SegmentNodeStore
with the commitFairLock option enabled.

At this point, there are 215,275 documents in the repository.  In the TarMK
/ repository folder, the size on disk is about 118G.  Creating and save a
file node take about 3-5 seconds, and calling the VersionManager to check
it in takes about the same.  When there was no content in the repository,
both of those operations are almost instantaneous.

I've been looking at the Adobe AEM documentation / forums, and there seems
to be some mixed guidance on using offline compaction (I'm not using AEM
but assuming the general principles would apply).  I have tried the same
thing with GC disabled:

SegmentGCOptions gcOptions =
SegmentGCOptions.defaultGCOptions().setOffline();
...
fsBuilder.withGCOptions(gcOptions);

...with the same basic performance as I see with it enabled.  (I'm getting
one file imported ever ~3-10 seconds.)

I just came across a presentation:
https://adapt.to/content/dam/adaptto/production/presentations/2016/adaptTo2016-Into-the-tar-pit-a-TarMK-deep-dive-Michael-Duerig-notes.pdf/_jcr_content/renditions/original./adaptTo2016-Into-the-tar-pit-a-TarMK-deep-dive-Michael-Duerig-notes.pdf

It mentions TarMK has "limited scalability"...  my initial tests of a
handful of gigabytes seemed OK, but am I hitting an inherent limitation in
the size of data TarMK can handle?


On Sat, Mar 17, 2018 at 1:32 PM, William Markmann <
bill@counterpointconsulting.com> wrote:

> After adding some JMX instrumentation, I can clearly see that the time is
> being spent in the session.save() called when initially creating the
> document node, and then calling vm.checkin(newNode.getPath()) after
> that.  These slowed way down as more and more nodes were added to the
> repository.
>
> 1) Is it necessary to do the above in two steps, or can a Node be created
> and checked in with the VersionManager in one shot?
> 2) What is actually happening in terms of indexing when I do the above?
> Is there any way to / would it be useful to temporarily disable any
> on-the-fly indexing during a bulk import and run it later?
> 3) Does GC / compaction in the NodeStore come into play when adding
> nodes?  Same question -- is there a way to / would it be useful to disable
> anything related to those during a bulk import and perform them offline
> after?
> 4) Is there anything inherent in the SegmentNodeStore that would decrease
> in performance as the repository grows?
>
> Thanks! - Bill
>
>
> On Fri, Mar 16, 2018 at 12:30 PM, William Markmann <
> bill@counterpointconsulting.com> wrote:
>
>> Is there any reason I'd see:
>>
>> 2018-03-15 21:48:54.673  INFO 20475 --- [ex-update-async]
>> o.a.j.oak.plugins.index.IndexUpdate      : Incremental indexing
>> Traversed #10000 /NJ Foreclosure/342CKA-IANWK/SCRA
>> SEARCHES-AAMZG/FRCL201604PTI00003XEGT-20160425-4351400/jcr:content
>> [Infinity nodes/s, Infinity nodes/hr]
>>
>> ...regularly at the outset, but it stops appearing after a certain point?
>>
>> If I take a thread-dump after it starts slowing down, I see the worker
>> threads (usually all but one once slow-down starts) parked at:
>>
>> "pool-5-thread-3" #25772 prio=5 os_prio=0 tid=0x000000000201e000
>> nid=0xc0c0 waiting on condition [0x00007f42486cc000]
>>    java.lang.Thread.State: WAITING (parking)
>> at sun.misc.Unsafe.park(Native Method)
>> - parking to wait for  <0x00000004c1327098> (a
>> java.util.concurrent.Semaphore$FairSync)
>> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>> at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAn
>> dCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>> at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcqu
>> ireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>> at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquir
>> eSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>> at java.util.concurrent.Semaphore.acquire(Semaphore.java:312)
>> at org.apache.jackrabbit.oak.segment.scheduler.LockBasedSchedul
>> er.schedule(LockBasedScheduler.java:217)
>> at org.apache.jackrabbit.oak.segment.SegmentNodeStore.merge(
>> SegmentNodeStore.java:195)
>> at org.apache.jackrabbit.oak.core.MutableRoot.commit(MutableRoo
>> t.java:248)
>> at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.commi
>> t(SessionDelegate.java:347)
>> at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.commi
>> t(SessionDelegate.java:360)
>> at org.apache.jackrabbit.oak.jcr.version.ReadWriteVersionManage
>> r.checkin(ReadWriteVersionManager.java:129)
>> at org.apache.jackrabbit.oak.jcr.delegate.VersionManagerDelegat
>> e.checkin(VersionManagerDelegate.java:67)
>> at org.apache.jackrabbit.oak.jcr.version.VersionManagerImpl$7.p
>> erform(VersionManagerImpl.java:371)
>> at org.apache.jackrabbit.oak.jcr.version.VersionManagerImpl$7.p
>> erform(VersionManagerImpl.java:362)
>> at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.perfo
>> rm(SessionDelegate.java:208)
>> at org.apache.jackrabbit.oak.jcr.version.VersionManagerImpl.che
>> ckin(VersionManagerImpl.java:362)
>>
>>
>> When I'm inserting documents at the very beginning (no content yet), the
>> individual threads don't park at that state for nearly as long...
>>
>>
>>
>> On Fri, Mar 16, 2018 at 11:55 AM, Julian Reschke <ju...@gmx.de>
>> wrote:
>>
>>> On 2018-03-16 16:46, William Markmann wrote:
>>>
>>>> Folders are: *org.apache.jackrabbit.JcrConstants.NT_FOLDER*
>>>> Documents are:
>>>>
>>>>                  Binary fileBinary = session.getValueFactory().createBinary(new
>>>> ByteArrayInputStream(data));
>>>>                  Node newFile = parentNode.addNode(filename,
>>>> *JcrConstants.NT_FILE*);
>>>>                  newFile.addMixin(*JcrConstants.MIX_VERSIONABLE*);
>>>>                  Node docContents = newFile.addNode(*JcrConstants.JCR_CONTENT*,
>>>> *JcrConstants.NT_RESOURCE*);
>>>>                  // docContents.setProperty(JcrConstants.JCR_MIMETYPE,
>>>> getMimeType(filename, getFileExtension(filename)));
>>>>                  docContents.setProperty(JcrConstants.JCR_MIMETYPE,
>>>> FileUtils.getMimeType(FileUtils.getFileExtension(filename)));
>>>>                  docContents.setProperty(JcrConstants.JCR_ENCODING,
>>>> "");
>>>>                  docContents.setProperty(JcrConstants.JCR_DATA,
>>>> fileBinary);
>>>>
>>>> Is there a better choice?
>>>> ...
>>>>
>>>
>>> I was worried the folder might have "orderable" child nodes, which
>>> creates a significant overhead. But AFAIR that is not the case for
>>> nt:folder (but you may want to check).
>>>
>>> Best regards, Julian
>>>
>>> PS: I wouldn't set JCR_ENCODING if that information isn't present.
>>>
>>
>>
>>
>> --
>> *Bill Markmann*
>> *President | 866 809 0394 x 701*
>> *Counterpoint Consulting*
>> *Automate. Innovate. Accelerate.*
>> c20g.com | *Blog <http://www.c20g.com/site/blog> **| Linkedin
>> <http://www.linkedin.com/company/counterpoint-consulting-inc.>** |
>> Twitter <https://twitter.com/c20g>*
>>
>
>
>
> --
> *Bill Markmann*
> *President | 866 809 0394 x 701*
> *Counterpoint Consulting*
> *Automate. Innovate. Accelerate.*
> c20g.com | *Blog <http://www.c20g.com/site/blog> **| Linkedin
> <http://www.linkedin.com/company/counterpoint-consulting-inc.>** |
> Twitter <https://twitter.com/c20g>*
>



-- 
*Bill Markmann*
*President | 866 809 0394 x 701*
*Counterpoint Consulting*
*Automate. Innovate. Accelerate.*
c20g.com | *Blog <http://www.c20g.com/site/blog> **| Linkedin
<http://www.linkedin.com/company/counterpoint-consulting-inc.>** | Twitter
<https://twitter.com/c20g>*

Re: Decreased performance with many inserts

Posted by William Markmann <bi...@counterpointconsulting.com>.
After adding some JMX instrumentation, I can clearly see that the time is
being spent in the session.save() called when initially creating the
document node, and then calling vm.checkin(newNode.getPath()) after that.
These slowed way down as more and more nodes were added to the repository.

1) Is it necessary to do the above in two steps, or can a Node be created
and checked in with the VersionManager in one shot?
2) What is actually happening in terms of indexing when I do the above?  Is
there any way to / would it be useful to temporarily disable any on-the-fly
indexing during a bulk import and run it later?
3) Does GC / compaction in the NodeStore come into play when adding nodes?
Same question -- is there a way to / would it be useful to disable anything
related to those during a bulk import and perform them offline after?
4) Is there anything inherent in the SegmentNodeStore that would decrease
in performance as the repository grows?

Thanks! - Bill


On Fri, Mar 16, 2018 at 12:30 PM, William Markmann <
bill@counterpointconsulting.com> wrote:

> Is there any reason I'd see:
>
> 2018-03-15 21:48:54.673  INFO 20475 --- [ex-update-async]
> o.a.j.oak.plugins.index.IndexUpdate      : Incremental indexing Traversed
> #10000 /NJ Foreclosure/342CKA-IANWK/SCRA SEARCHES-AAMZG/FRCL201604PTI00
> 003XEGT-20160425-4351400/jcr:content [Infinity nodes/s, Infinity nodes/hr]
>
> ...regularly at the outset, but it stops appearing after a certain point?
>
> If I take a thread-dump after it starts slowing down, I see the worker
> threads (usually all but one once slow-down starts) parked at:
>
> "pool-5-thread-3" #25772 prio=5 os_prio=0 tid=0x000000000201e000
> nid=0xc0c0 waiting on condition [0x00007f42486cc000]
>    java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x00000004c1327098> (a java.util.concurrent.
> Semaphore$FairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer.
> parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer.
> doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer.
> acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at java.util.concurrent.Semaphore.acquire(Semaphore.java:312)
> at org.apache.jackrabbit.oak.segment.scheduler.
> LockBasedScheduler.schedule(LockBasedScheduler.java:217)
> at org.apache.jackrabbit.oak.segment.SegmentNodeStore.
> merge(SegmentNodeStore.java:195)
> at org.apache.jackrabbit.oak.core.MutableRoot.commit(MutableRoot.java:248)
> at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.
> commit(SessionDelegate.java:347)
> at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.
> commit(SessionDelegate.java:360)
> at org.apache.jackrabbit.oak.jcr.version.ReadWriteVersionManager.checkin(
> ReadWriteVersionManager.java:129)
> at org.apache.jackrabbit.oak.jcr.delegate.VersionManagerDelegate.checkin(
> VersionManagerDelegate.java:67)
> at org.apache.jackrabbit.oak.jcr.version.VersionManagerImpl$7.
> perform(VersionManagerImpl.java:371)
> at org.apache.jackrabbit.oak.jcr.version.VersionManagerImpl$7.
> perform(VersionManagerImpl.java:362)
> at org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.
> perform(SessionDelegate.java:208)
> at org.apache.jackrabbit.oak.jcr.version.VersionManagerImpl.
> checkin(VersionManagerImpl.java:362)
>
>
> When I'm inserting documents at the very beginning (no content yet), the
> individual threads don't park at that state for nearly as long...
>
>
>
> On Fri, Mar 16, 2018 at 11:55 AM, Julian Reschke <ju...@gmx.de>
> wrote:
>
>> On 2018-03-16 16:46, William Markmann wrote:
>>
>>> Folders are: *org.apache.jackrabbit.JcrConstants.NT_FOLDER*
>>> Documents are:
>>>
>>>                  Binary fileBinary = session.getValueFactory().createBinary(new
>>> ByteArrayInputStream(data));
>>>                  Node newFile = parentNode.addNode(filename,
>>> *JcrConstants.NT_FILE*);
>>>                  newFile.addMixin(*JcrConstants.MIX_VERSIONABLE*);
>>>                  Node docContents = newFile.addNode(*JcrConstants.JCR_CONTENT*,
>>> *JcrConstants.NT_RESOURCE*);
>>>                  // docContents.setProperty(JcrConstants.JCR_MIMETYPE,
>>> getMimeType(filename, getFileExtension(filename)));
>>>                  docContents.setProperty(JcrConstants.JCR_MIMETYPE,
>>> FileUtils.getMimeType(FileUtils.getFileExtension(filename)));
>>>                  docContents.setProperty(JcrConstants.JCR_ENCODING, "");
>>>                  docContents.setProperty(JcrConstants.JCR_DATA,
>>> fileBinary);
>>>
>>> Is there a better choice?
>>> ...
>>>
>>
>> I was worried the folder might have "orderable" child nodes, which
>> creates a significant overhead. But AFAIR that is not the case for
>> nt:folder (but you may want to check).
>>
>> Best regards, Julian
>>
>> PS: I wouldn't set JCR_ENCODING if that information isn't present.
>>
>
>
>
> --
> *Bill Markmann*
> *President | 866 809 0394 x 701*
> *Counterpoint Consulting*
> *Automate. Innovate. Accelerate.*
> c20g.com | *Blog <http://www.c20g.com/site/blog> **| Linkedin
> <http://www.linkedin.com/company/counterpoint-consulting-inc.>** |
> Twitter <https://twitter.com/c20g>*
>



-- 
*Bill Markmann*
*President | 866 809 0394 x 701*
*Counterpoint Consulting*
*Automate. Innovate. Accelerate.*
c20g.com | *Blog <http://www.c20g.com/site/blog> **| Linkedin
<http://www.linkedin.com/company/counterpoint-consulting-inc.>** | Twitter
<https://twitter.com/c20g>*

Re: Decreased performance with many inserts

Posted by William Markmann <bi...@counterpointconsulting.com>.
Is there any reason I'd see:

2018-03-15 21:48:54.673  INFO 20475 --- [ex-update-async]
o.a.j.oak.plugins.index.IndexUpdate      : Incremental indexing Traversed
#10000 /NJ Foreclosure/342CKA-IANWK/SCRA SEARCHES-AAMZG/
FRCL201604PTI00003XEGT-20160425-4351400/jcr:content [Infinity nodes/s,
Infinity nodes/hr]

...regularly at the outset, but it stops appearing after a certain point?

If I take a thread-dump after it starts slowing down, I see the worker
threads (usually all but one once slow-down starts) parked at:

"pool-5-thread-3" #25772 prio=5 os_prio=0 tid=0x000000000201e000 nid=0xc0c0
waiting on condition [0x00007f42486cc000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00000004c1327098> (a
java.util.concurrent.Semaphore$FairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at java.util.concurrent.Semaphore.acquire(Semaphore.java:312)
at
org.apache.jackrabbit.oak.segment.scheduler.LockBasedScheduler.schedule(LockBasedScheduler.java:217)
at
org.apache.jackrabbit.oak.segment.SegmentNodeStore.merge(SegmentNodeStore.java:195)
at org.apache.jackrabbit.oak.core.MutableRoot.commit(MutableRoot.java:248)
at
org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.commit(SessionDelegate.java:347)
at
org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.commit(SessionDelegate.java:360)
at
org.apache.jackrabbit.oak.jcr.version.ReadWriteVersionManager.checkin(ReadWriteVersionManager.java:129)
at
org.apache.jackrabbit.oak.jcr.delegate.VersionManagerDelegate.checkin(VersionManagerDelegate.java:67)
at
org.apache.jackrabbit.oak.jcr.version.VersionManagerImpl$7.perform(VersionManagerImpl.java:371)
at
org.apache.jackrabbit.oak.jcr.version.VersionManagerImpl$7.perform(VersionManagerImpl.java:362)
at
org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.perform(SessionDelegate.java:208)
at
org.apache.jackrabbit.oak.jcr.version.VersionManagerImpl.checkin(VersionManagerImpl.java:362)


When I'm inserting documents at the very beginning (no content yet), the
individual threads don't park at that state for nearly as long...



On Fri, Mar 16, 2018 at 11:55 AM, Julian Reschke <ju...@gmx.de>
wrote:

> On 2018-03-16 16:46, William Markmann wrote:
>
>> Folders are: *org.apache.jackrabbit.JcrConstants.NT_FOLDER*
>> Documents are:
>>
>>                  Binary fileBinary = session.getValueFactory().createBinary(new
>> ByteArrayInputStream(data));
>>                  Node newFile = parentNode.addNode(filename,
>> *JcrConstants.NT_FILE*);
>>                  newFile.addMixin(*JcrConstants.MIX_VERSIONABLE*);
>>                  Node docContents = newFile.addNode(*JcrConstants.JCR_CONTENT*,
>> *JcrConstants.NT_RESOURCE*);
>>                  // docContents.setProperty(JcrConstants.JCR_MIMETYPE,
>> getMimeType(filename, getFileExtension(filename)));
>>                  docContents.setProperty(JcrConstants.JCR_MIMETYPE,
>> FileUtils.getMimeType(FileUtils.getFileExtension(filename)));
>>                  docContents.setProperty(JcrConstants.JCR_ENCODING, "");
>>                  docContents.setProperty(JcrConstants.JCR_DATA,
>> fileBinary);
>>
>> Is there a better choice?
>> ...
>>
>
> I was worried the folder might have "orderable" child nodes, which creates
> a significant overhead. But AFAIR that is not the case for nt:folder (but
> you may want to check).
>
> Best regards, Julian
>
> PS: I wouldn't set JCR_ENCODING if that information isn't present.
>



-- 
*Bill Markmann*
*President | 866 809 0394 x 701*
*Counterpoint Consulting*
*Automate. Innovate. Accelerate.*
c20g.com | *Blog <http://www.c20g.com/site/blog> **| Linkedin
<http://www.linkedin.com/company/counterpoint-consulting-inc.>** | Twitter
<https://twitter.com/c20g>*

Re: Decreased performance with many inserts

Posted by Julian Reschke <ju...@gmx.de>.
On 2018-03-16 16:46, William Markmann wrote:
> Folders are: *org.apache.jackrabbit.JcrConstants.NT_FOLDER*
> Documents are:
> 
>                  Binary fileBinary = 
> session.getValueFactory().createBinary(new ByteArrayInputStream(data));
>                  Node newFile = parentNode.addNode(filename, 
> *JcrConstants.NT_FILE*);
>                  newFile.addMixin(*JcrConstants.MIX_VERSIONABLE*);
>                  Node docContents = 
> newFile.addNode(*JcrConstants.JCR_CONTENT*, *JcrConstants.NT_RESOURCE*);
>                  // docContents.setProperty(JcrConstants.JCR_MIMETYPE, 
> getMimeType(filename, getFileExtension(filename)));
>                  docContents.setProperty(JcrConstants.JCR_MIMETYPE, 
> FileUtils.getMimeType(FileUtils.getFileExtension(filename)));
>                  docContents.setProperty(JcrConstants.JCR_ENCODING, "");
>                  docContents.setProperty(JcrConstants.JCR_DATA, fileBinary);
> 
> Is there a better choice?
> ...

I was worried the folder might have "orderable" child nodes, which 
creates a significant overhead. But AFAIR that is not the case for 
nt:folder (but you may want to check).

Best regards, Julian

PS: I wouldn't set JCR_ENCODING if that information isn't present.

Re: Decreased performance with many inserts

Posted by William Markmann <bi...@counterpointconsulting.com>.
Folders are: *org.apache.jackrabbit.JcrConstants.NT_FOLDER*
Documents are:

                Binary fileBinary =
session.getValueFactory().createBinary(new ByteArrayInputStream(data));
                Node newFile = parentNode.addNode(filename,
*JcrConstants.NT_FILE*);
                newFile.addMixin(*JcrConstants.MIX_VERSIONABLE*);
                Node docContents = newFile.addNode(
*JcrConstants.JCR_CONTENT*, *JcrConstants.NT_RESOURCE*);
                // docContents.setProperty(JcrConstants.JCR_MIMETYPE,
getMimeType(filename, getFileExtension(filename)));
                docContents.setProperty(JcrConstants.JCR_MIMETYPE,
FileUtils.getMimeType(FileUtils.getFileExtension(filename)));
                docContents.setProperty(JcrConstants.JCR_ENCODING, "");
                docContents.setProperty(JcrConstants.JCR_DATA, fileBinary);

Is there a better choice?

On Fri, Mar 16, 2018 at 11:44 AM, Julian Reschke <ju...@gmx.de>
wrote:

> On 2018-03-16 16:35, William Markmann wrote:
>
>> Has anyone experienced a significant slowdown when adding many (tens /
>> hundreds of thousands) of documents to an Oak repository?
>> ...
>>
>
> What's the JCR node type of the folder you are inserting into?
>
> Best regards, Julian
>



-- 
*Bill Markmann*
*President | 866 809 0394 x 701*
*Counterpoint Consulting*
*Automate. Innovate. Accelerate.*
c20g.com | *Blog <http://www.c20g.com/site/blog> **| Linkedin
<http://www.linkedin.com/company/counterpoint-consulting-inc.>** | Twitter
<https://twitter.com/c20g>*

Re: Decreased performance with many inserts

Posted by Julian Reschke <ju...@gmx.de>.
On 2018-03-16 16:35, William Markmann wrote:
> Has anyone experienced a significant slowdown when adding many (tens /
> hundreds of thousands) of documents to an Oak repository?
> ...

What's the JCR node type of the folder you are inserting into?

Best regards, Julian