You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by aasoj j <aa...@gmail.com> on 2010/02/02 09:19:09 UTC

Patch to limit number of updates while synchronization

Hi,

We have been using Jackrabbit in our web application for a while now. In
cluster deployment, we face synchronization issues when the number of
revisions to be synced is very high. To fix this we decided to limit the
number of revisions returned from the getRecords method of DatabaseJournal.
We also modified the doSync method of Abstract journal, it now loops till
some records are returned from getRecords. The following patch solves our
problem. Please let me know if you foresee any other issues with this.
Thanks in advance.


Index: src/main/java/org/apache/jackrabbit/core/journal/DatabaseJournal.java
===================================================================
--- src/main/java/org/apache/jackrabbit/core/journal/DatabaseJournal.java
(revision
905538)
+++ src/main/java/org/apache/jackrabbit/core/journal/DatabaseJournal.java
(working
copy)
@@ -911,7 +911,7 @@
     protected void buildSQLStatements() {
         selectRevisionsStmtSQL =
             "select REVISION_ID, JOURNAL_ID, PRODUCER_ID, REVISION_DATA
from "
-            + schemaObjectPrefix + "JOURNAL where REVISION_ID > ? order by
REVISION_ID";
+            + schemaObjectPrefix + "JOURNAL where REVISION_ID > ? order by
REVISION_ID *limit 100*";
         updateGlobalStmtSQL =
             "update " + schemaObjectPrefix + "GLOBAL_REVISION"
             + " set REVISION_ID = REVISION_ID + 1";
Index: src/main/java/org/apache/jackrabbit/core/journal/AbstractJournal.java
===================================================================
--- src/main/java/org/apache/jackrabbit/core/journal/AbstractJournal.java
(revision
905538)
+++ src/main/java/org/apache/jackrabbit/core/journal/AbstractJournal.java
(working
copy)
@@ -198,40 +198,44 @@
      * @throws JournalException if an error occurs
      */
     protected void doSync(long startRevision) throws JournalException {
-        RecordIterator iterator = getRecords(startRevision);
         long stopRevision = Long.MIN_VALUE;
-
-        try {
-            while (iterator.hasNext()) {
-                Record record = iterator.nextRecord();
-                if (record.getJournalId().equals(id)) {
-                    log.info("Record with revision '" +
record.getRevision()
-                            + "' created by this journal, skipped.");
-                } else {
-                    RecordConsumer consumer =
getConsumer(record.getProducerId());
-                    if (consumer != null) {
-                        try {
-                            consumer.consume(record);
-                        } catch (IllegalStateException e) {
-                            log.error("Could not synchronize to revision: "
+ record.getRevision() + " due illegal state of RecordConsumer.");
-                            return;
+        do {
+            RecordIterator iterator = getRecords(startRevision);
+            stopRevision = Long.MIN_VALUE;
+
+            try {
+                while (iterator.hasNext()) {
+                    Record record = iterator.nextRecord();
+                    if (record.getJournalId().equals(id)) {
+                        log.info("Record with revision '" +
record.getRevision()
+                                + "' created by this journal, skipped.");
+                    } else {
+                        RecordConsumer consumer =
getConsumer(record.getProducerId());
+                        if (consumer != null) {
+                            try {
+                                consumer.consume(record);
+                            } catch (IllegalStateException e) {
+                                log.error("Could not synchronize to
revision: " + record.getRevision() + " due illegal state of
RecordConsumer.");
+                                return;
+                            }
                         }
                     }
+                    stopRevision = record.getRevision();
                 }
-                stopRevision = record.getRevision();
+            } finally {
+                iterator.close();
             }
-        } finally {
-            iterator.close();
-        }
-
-        if (stopRevision > 0) {
-            Iterator iter = consumers.values().iterator();
-            while (iter.hasNext()) {
-                RecordConsumer consumer = (RecordConsumer) iter.next();
-                consumer.setRevision(stopRevision);
+
+            if (stopRevision > 0) {
+                Iterator iter = consumers.values().iterator();
+                while (iter.hasNext()) {
+                    RecordConsumer consumer = (RecordConsumer) iter.next();
+                    consumer.setRevision(stopRevision);
+                }
+                log.info("Synchronized to revision: " + stopRevision);
+                startRevision = stopRevision;
             }
-            log.info("Synchronized to revision: " + stopRevision);
-        }
+        } while (stopRevision > 0);
     }

     /**

Regards
aasoj

Re: Finetuning (JCR) Search

Posted by Ard Schrijvers <a....@onehippo.com>.
Hello,

On Tue, Feb 2, 2010 at 11:29 AM, Robbert Uittenbroek
<r....@rug.nl> wrote:
> Hi Ard,
>
> Thanks for your feedback.
>
> We use the jcr:like because we want to make sure the virtual path starts
> with the specified path/keyword, rather than containing it.

I mentioned it because some time ago I saw mails from I guess a
colleague of yours mentioning millions of 'cms:virtualPathLC'. If you
are using /corporate/%, and you have, say 100.000 unique
virtualPathLC's starting with  /corporate, do you then realize what
happens in Lucene internally? You might wanna google for Lucene Query
expansion, that should give you an idea why I am worried for you guys
wrt performance. You can best either override existing indexing to
optimize for what you want, or, if you do not want to, make sure that
if you for example have a

 cms:virtualPathLC = /corporate/foo/bar/lux

that you add a multivalued property:

 cms:virtualPathLCs where the values are

/corporate/foo/bar/lux
/corporate/foo/bar
/corporate/foo
/corporate

I guess, you are namely want 'scope' kind of searches, which perfectly
works like you are doing now, but I think not for the number of
documents you are talking about. With my multivalued property
suggestion, you can do simple equals, which translate to single terms
in Lucene having your hundreds of thousands of hits instantly, instead
of a OOM

>
> As for searching jcr:data explicitly, I did some more Google searching
> and it seems to me it is not quite possible.
>
> On this site
> http://wiki.exoplatform.org/xwiki/bin/view/JCR/Fulltext+Search , there
> is a section stating:
> "For example. We have property jcr:data (it' BINARY). Its stored well.
> But you will never find any string with query like:
> SELECT * FROM nt:resource WHERE CONTAINS(jcr:data, 'some string')
> Because,  BINARY is not searchable by full text search on exact property."
>
> You said:
> jcr:contains(.,'foo') is node scope level (.)
> jcr:contains(jcr:data, 'foo') search in jcr:data property
>
> which seems logical and is also what we have tried, where the first
> works, but using jcr:data as property returns no results.
>
> As for my original question, I guess it is not possible to search in the
> jcr:data property only for certain keywords, which I would find most
> weird as in this case it is the contents (and contents only) of a
> document we want to search in.. which are stored in jcr:data.. hmm.

I checked the code, and see indeed that a binary value is only being
indexed on nodescope level. This will be most likely inline with the
spec. If you extend the jr SearchIndex, you can easily use an extended
jr NodeIndexer, and you override the addBinaryValue. Then, next to the
createFulltextField, you also need to index it as a field. I think you
can after this query it like  jcr:contains(jcr:data, 'foo').
Obviously, also indexing it in a property separately is not really
nice wrt performance and indexing size

Regards Ard

>
> Cheers,
>
> Robbert
>
>
>
>
> Ard Schrijvers schreef:
>> Hello Robbert,
>>
>> On Tue, Feb 2, 2010 at 9:24 AM, Robbert Uittenbroek
>> <r....@rug.nl> wrote:
>>
>>> Hello,
>>>
>>> I have a question regarding searching (in) the jcr:data property.
>>>
>>> We store the contents of our documents in the jcr:content/jcr:data
>>> property. We also have added many custom properties to the jcr:content
>>> node, like creator, modifier, storageStatus and paths.
>>>
>>> In most search-cases, we want to search the jcr:data contents only. It
>>> now seems all properties are indexed by Lucene, and when we search we
>>> find files which have the keywords in other properties than jcr:data.
>>> While we do need to be able to search those properties in certain cases,
>>> we also want to be able to search in 'contents only', hence the jcr:data
>>> property. Can this be done, and if so, how? We use the xpath search
>>> expression, and eventhough I've seen the SQL use jcr:data (I believe) as
>>> field to search on, I can't seem to do this with the xpath expression.
>>>
>>> Example of the used xpath expression:
>>>
>>
>> First of all, I really doubt whether you want to use jcr:like. It is
>> really not scaling at all, let alone searching in binaries. Why aren't
>> you using jcr:contains?
>>
>> Futhermore, searching in a single property is as simple as defining
>> which property to search in the jcr:contains:
>>
>> thus:
>>
>> jcr:contains(.,'foo') is node scope level (.)
>> jcr:contains(jcr:data, 'foo') search in jcr:data property
>>
>> Regards Ard
>>
>>
>>
>>> /jcr:root/webplatform/www.rug.nl//element(*,
>>> nt:file)/jcr:content[jcr:like(@cms:virtualPathLC, '/corporate/%') and
>>> @cms:type and not(@cms:type='link') and not(@cms:type='folder') and
>>> not(@cms:type='function') and not(@cms:type='metadata') and
>>> jcr:contains(., 'zernike')]/(rep:excerpt()|@cms:type) order by
>>> jcr:score() descending
>>>
>>> Any help on this matter would be appreciated.
>>>
>>> Kinds Regards,
>>>
>>> Robbert Uittenbroek
>>>
>>>
>>>
>
>
> --
> Robbert M. Uittenbroek
> Webdeveloper
>
> Rijksuniversiteit Groningen
> Donald Smits Centrum voor Informatie Technologie
> Applicatieontwikkeling
>
> Zernikeborg
> Nettelbosje 1
> 9747 AJ Groningen
> Tel. 050 363 9298
> http://www.rug.nl/cit
> --
>
>

Re: Finetuning (JCR) Search

Posted by Robbert Uittenbroek <r....@rug.nl>.
Hi Ard,

Thanks for your feedback.

We use the jcr:like because we want to make sure the virtual path starts
with the specified path/keyword, rather than containing it.

As for searching jcr:data explicitly, I did some more Google searching
and it seems to me it is not quite possible.

On this site
http://wiki.exoplatform.org/xwiki/bin/view/JCR/Fulltext+Search , there
is a section stating:
"For example. We have property jcr:data (it' BINARY). Its stored well.
But you will never find any string with query like:
SELECT * FROM nt:resource WHERE CONTAINS(jcr:data, 'some string')
Because,  BINARY is not searchable by full text search on exact property."

You said:
jcr:contains(.,'foo') is node scope level (.)
jcr:contains(jcr:data, 'foo') search in jcr:data property

which seems logical and is also what we have tried, where the first
works, but using jcr:data as property returns no results.

As for my original question, I guess it is not possible to search in the
jcr:data property only for certain keywords, which I would find most
weird as in this case it is the contents (and contents only) of a
document we want to search in.. which are stored in jcr:data.. hmm.

Cheers,

Robbert




Ard Schrijvers schreef:
> Hello Robbert,
>
> On Tue, Feb 2, 2010 at 9:24 AM, Robbert Uittenbroek
> <r....@rug.nl> wrote:
>   
>> Hello,
>>
>> I have a question regarding searching (in) the jcr:data property.
>>
>> We store the contents of our documents in the jcr:content/jcr:data
>> property. We also have added many custom properties to the jcr:content
>> node, like creator, modifier, storageStatus and paths.
>>
>> In most search-cases, we want to search the jcr:data contents only. It
>> now seems all properties are indexed by Lucene, and when we search we
>> find files which have the keywords in other properties than jcr:data.
>> While we do need to be able to search those properties in certain cases,
>> we also want to be able to search in 'contents only', hence the jcr:data
>> property. Can this be done, and if so, how? We use the xpath search
>> expression, and eventhough I've seen the SQL use jcr:data (I believe) as
>> field to search on, I can't seem to do this with the xpath expression.
>>
>> Example of the used xpath expression:
>>     
>
> First of all, I really doubt whether you want to use jcr:like. It is
> really not scaling at all, let alone searching in binaries. Why aren't
> you using jcr:contains?
>
> Futhermore, searching in a single property is as simple as defining
> which property to search in the jcr:contains:
>
> thus:
>
> jcr:contains(.,'foo') is node scope level (.)
> jcr:contains(jcr:data, 'foo') search in jcr:data property
>
> Regards Ard
>
>
>   
>> /jcr:root/webplatform/www.rug.nl//element(*,
>> nt:file)/jcr:content[jcr:like(@cms:virtualPathLC, '/corporate/%') and
>> @cms:type and not(@cms:type='link') and not(@cms:type='folder') and
>> not(@cms:type='function') and not(@cms:type='metadata') and
>> jcr:contains(., 'zernike')]/(rep:excerpt()|@cms:type) order by
>> jcr:score() descending
>>
>> Any help on this matter would be appreciated.
>>
>> Kinds Regards,
>>
>> Robbert Uittenbroek
>>
>>
>>     


-- 
Robbert M. Uittenbroek
Webdeveloper
 
Rijksuniversiteit Groningen
Donald Smits Centrum voor Informatie Technologie
Applicatieontwikkeling

Zernikeborg
Nettelbosje 1
9747 AJ Groningen
Tel. 050 363 9298
http://www.rug.nl/cit
--


Re: Finetuning (JCR) Search

Posted by Ard Schrijvers <a....@onehippo.com>.
Hello Robbert,

On Tue, Feb 2, 2010 at 9:24 AM, Robbert Uittenbroek
<r....@rug.nl> wrote:
> Hello,
>
> I have a question regarding searching (in) the jcr:data property.
>
> We store the contents of our documents in the jcr:content/jcr:data
> property. We also have added many custom properties to the jcr:content
> node, like creator, modifier, storageStatus and paths.
>
> In most search-cases, we want to search the jcr:data contents only. It
> now seems all properties are indexed by Lucene, and when we search we
> find files which have the keywords in other properties than jcr:data.
> While we do need to be able to search those properties in certain cases,
> we also want to be able to search in 'contents only', hence the jcr:data
> property. Can this be done, and if so, how? We use the xpath search
> expression, and eventhough I've seen the SQL use jcr:data (I believe) as
> field to search on, I can't seem to do this with the xpath expression.
>
> Example of the used xpath expression:

First of all, I really doubt whether you want to use jcr:like. It is
really not scaling at all, let alone searching in binaries. Why aren't
you using jcr:contains?

Futhermore, searching in a single property is as simple as defining
which property to search in the jcr:contains:

thus:

jcr:contains(.,'foo') is node scope level (.)
jcr:contains(jcr:data, 'foo') search in jcr:data property

Regards Ard


>
> /jcr:root/webplatform/www.rug.nl//element(*,
> nt:file)/jcr:content[jcr:like(@cms:virtualPathLC, '/corporate/%') and
> @cms:type and not(@cms:type='link') and not(@cms:type='folder') and
> not(@cms:type='function') and not(@cms:type='metadata') and
> jcr:contains(., 'zernike')]/(rep:excerpt()|@cms:type) order by
> jcr:score() descending
>
> Any help on this matter would be appreciated.
>
> Kinds Regards,
>
> Robbert Uittenbroek
>
>

Finetuning (JCR) Search

Posted by Robbert Uittenbroek <r....@rug.nl>.
Hello,

I have a question regarding searching (in) the jcr:data property.

We store the contents of our documents in the jcr:content/jcr:data
property. We also have added many custom properties to the jcr:content
node, like creator, modifier, storageStatus and paths.

In most search-cases, we want to search the jcr:data contents only. It
now seems all properties are indexed by Lucene, and when we search we
find files which have the keywords in other properties than jcr:data.
While we do need to be able to search those properties in certain cases,
we also want to be able to search in 'contents only', hence the jcr:data
property. Can this be done, and if so, how? We use the xpath search
expression, and eventhough I've seen the SQL use jcr:data (I believe) as
field to search on, I can't seem to do this with the xpath expression.

Example of the used xpath expression:

/jcr:root/webplatform/www.rug.nl//element(*,
nt:file)/jcr:content[jcr:like(@cms:virtualPathLC, '/corporate/%') and 
@cms:type and not(@cms:type='link') and not(@cms:type='folder') and
not(@cms:type='function') and not(@cms:type='metadata') and
jcr:contains(., 'zernike')]/(rep:excerpt()|@cms:type) order by
jcr:score() descending

Any help on this matter would be appreciated.

Kinds Regards,

Robbert Uittenbroek