You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Todd Long <lo...@gmail.com> on 2015/11/13 20:54:55 UTC

DIH Caching w/ BerkleyBackedCache

We currently index using DIH along with the SortedMapBackedCache cache
implementation which has worked well until recently when we needed to index
a much larger table. We were running into memory issues using the
SortedMapBackedCache so we tried switching to the BerkleyBackedCache but
appear to have some configuration issues. I've included our basic setup
below. The issue we're running into is that it appears the Berkley database
is evicting database files (see message below) before they've completed.
When I watch the cache directory I only ever see two database files at a
time with each one being ~1GB in size (this appears to be hard coded). Is
there some additional configuration I'm missing to prevent the process from
"cleaning" up database files before the index has finished? I think this
"cleanup" continues to kickoff the caching which never completes... without
caching the indexing is ~2 hours. Any help would be greatly appreciated.
Thanks.

Cleaning message: "Chose lowest utilized file for cleaning. fileChosen: 0x0
..."

<dataConfig>
  <dataSource type"JdbcDataSource" ... />

  <document>
    <entity name="parent"
               query="select ID, tp.* from TABLE_PARENT tp">

      <entity name="child"
                 query="select ID, NAME, VALUE from TABLE_CHILD"
                
cacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache"
                 cacheKey="ID"
                 cacheLookup="parent.ID"
                 persistCacheName="CHILD"
                 persistCacheBaseDir="/some/cache/dir"
                 persistCacheFieldNames="ID,NAME,VALUE"
                 persistCacheFieldTypes="STRING,STRING,STRING"
                 berkleyInternalCacheSize="1000000"
                 berkleyInternalShared="true" />

    </entity>
  </document>
</dataConfig>



--
View this message in context: http://lucene.472066.n3.nabble.com/DIH-Caching-w-BerkleyBackedCache-tp4240142.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: DIH Caching w/ BerkleyBackedCache

Posted by "Dyer, James" <Ja...@ingramcontent.com>.

Todd,

I have no idea if this will perform acceptable with so many multiple values.  I doubt the solr/patch code was really optimized for such a use case.  In my production environment, I have je-6.2.31.jar on the classpath.  I don't think I've tried it with other versions.

James Dyer
Ingram Content Group

-----Original Message-----
From: Todd Long [mailto:longtm@gmail.com] 
Sent: Wednesday, December 16, 2015 10:21 AM
To: solr-user@lucene.apache.org
Subject: RE: DIH Caching w/ BerkleyBackedCache

James,

I apologize for the late response.


Dyer, James-2 wrote
> With the DIH request, are you specifying "cacheDeletePriorData=false"

We are not specifying that property (it looks like it defaults to "false").
I'm actually seeing this issue when running a full clean/import.

It appears that the Berkeley DB "cleaner" is always removing the oldest file
once there are three. In this case, I'll see two 1GB files and then as the
third file is being written (after ~200MB) the oldest 1GB file will fall off
(i.e. get deleted). I'm only utilizing ~13% disk space at the time. I'm
using Berkeley DB version 4.1.6 with Solr 4.8.1. I'm not specifying any
other configuration properties other than what I mentioned before. I simply
cannot figure out what is going on with the "cleaner" logic that would deem
that file "lowest utilized". Any other Berkeley DB/system configuration I
could consider that would affect this?

It's possible that this caching simply might not be suitable for our data
set where one document might contain a field with tens of thousands of
values... maybe this is the bottleneck with using this database as every add
copies in the prior data and then the "cleaner" removes the old stuff. Maybe
it's working like it should but just incredibly slow... I can get a full
index without caching in about two hours, however, when using this caching
it was still running after 24 hours (still caching the sub-entity).

Thanks again for the reply.

Respectfully,
Todd



--
View this message in context: http://lucene.472066.n3.nabble.com/DIH-Caching-w-BerkleyBackedCache-tp4240142p4245777.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: DIH Caching w/ BerkleyBackedCache

Posted by Todd Long <lo...@gmail.com>.

James,

I apologize for the late response.

Dyer, James-2 wrote
> With the DIH request, are you specifying "cacheDeletePriorData=false"

We are not specifying that property (it looks like it defaults to "false").
I'm actually seeing this issue when running a full clean/import.

It appears that the Berkeley DB "cleaner" is always removing the oldest file
once there are three. In this case, I'll see two 1GB files and then as the
third file is being written (after ~200MB) the oldest 1GB file will fall off
(i.e. get deleted). I'm only utilizing ~13% disk space at the time. I'm
using Berkeley DB version 4.1.6 with Solr 4.8.1. I'm not specifying any
other configuration properties other than what I mentioned before. I simply
cannot figure out what is going on with the "cleaner" logic that would deem
that file "lowest utilized". Any other Berkeley DB/system configuration I
could consider that would affect this?

It's possible that this caching simply might not be suitable for our data
set where one document might contain a field with tens of thousands of
values... maybe this is the bottleneck with using this database as every add
copies in the prior data and then the "cleaner" removes the old stuff. Maybe
it's working like it should but just incredibly slow... I can get a full
index without caching in about two hours, however, when using this caching
it was still running after 24 hours (still caching the sub-entity).

Thanks again for the reply.

Respectfully,
Todd

--
View this message in context: http://lucene.472066.n3.nabble.com/DIH-Caching-w-BerkleyBackedCache-tp4240142p4245777.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: DIH Caching w/ BerkleyBackedCache

Posted by "Dyer, James" <Ja...@ingramcontent.com>.

Todd,

With the DIH request, are you specifying "cacheDeletePriorData=false".  Looking at the BerkleyBackedCache code if this is set to true, it deletes the cache and assumes the current update is to fully repopulate it.  If you want to do an incremental update to the cache, it needs to be false.  You might also need to specify "clean=false", but I'm not sure if this is a requirement.

I've used DIH with BerkleyBackedCache for a few years and it works well for us.  But rather than using it inline, we have a number of DIH handlers that just build caches, then when they're all built, a final DIH joins data from the caches and indexes it to solr.  We also do like you are, with several handlers running at once, each doing part of the data.

But I have to warn you this code hasn't been maintained by anyone.  I'm using an older DIH jar (4.6) with newer solr.  I think there might have been an api change or something that prevented the uncommitted caching code from working with newer versions, but I honestly forget.  This is probably a viable solution if you don't want to write any code, but it might take some trial and error getting it to work.

James Dyer
Ingram Content Group


-----Original Message-----
From: Todd Long [mailto:longtm@gmail.com] 
Sent: Tuesday, November 17, 2015 8:11 AM
To: solr-user@lucene.apache.org
Subject: Re: DIH Caching w/ BerkleyBackedCache

Mikhail Khludnev wrote
> It's worth to mention that for really complex relations scheme it might be
> challenging to organize all of them into parallel ordered streams.

This will most likely be the issue for us which is why I would like to have
the Berkley cache solution to fall back on, if possible. Again, I'm not sure
why but it appears that the Berkley cache is overwriting itself (i.e.
cleaning up unused data) when building the database... I've read plenty of
other threads where it appears folks are having success using that caching
solution.


Mikhail Khludnev wrote
> threads... you said? Which ones? Declarative parallelization in
> EntityProcessor worked only with certain 3.x version.

We are running multiple DIH instances which query against specific
partitions of the data (i.e. mod of the document id we're indexing).



--
View this message in context: http://lucene.472066.n3.nabble.com/DIH-Caching-w-BerkleyBackedCache-tp4240142p4240562.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: DIH Caching w/ BerkleyBackedCache

Posted by Todd Long <lo...@gmail.com>.

Mikhail Khludnev wrote
> It's worth to mention that for really complex relations scheme it might be
> challenging to organize all of them into parallel ordered streams.

This will most likely be the issue for us which is why I would like to have
the Berkley cache solution to fall back on, if possible. Again, I'm not sure
why but it appears that the Berkley cache is overwriting itself (i.e.
cleaning up unused data) when building the database... I've read plenty of
other threads where it appears folks are having success using that caching
solution.


Mikhail Khludnev wrote
> threads... you said? Which ones? Declarative parallelization in
> EntityProcessor worked only with certain 3.x version.

We are running multiple DIH instances which query against specific
partitions of the data (i.e. mod of the document id we're indexing).



--
View this message in context: http://lucene.472066.n3.nabble.com/DIH-Caching-w-BerkleyBackedCache-tp4240142p4240562.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: DIH Caching w/ BerkleyBackedCache

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

On Mon, Nov 16, 2015 at 5:08 PM, Todd Long <lo...@gmail.com> wrote:

> Mikhail Khludnev wrote
> > "External merge" join helps to avoid boilerplate caching in such simple
> > cases.
>
> Thank you for the reply. I can certainly look into this though I would have
> to apply the patch for our version (i.e. 4.8.1). I really just simplified
> our data configuration here which actually consists of many sub-entities
> that are successfully using the SortedMapBackedCache cache. I imagine this
> would still apply to those as the queries themselves are simple for the
> most
> part.

It's worth to mention that for really complex relations scheme it might be
challenging to organize all of them into parallel ordered streams.


> I assume performance-wise this would only require the single table
> scan?
>
It sounds like that. But I'm an expert to comment in precise terms.


>
> I'm still very much interested in resolving this Berkley database cache
> issue. I'm sure there is some minor configuration I'm missing that is
> causing this behavior. Again, I've had no issues with the
> SortedMapBackedCache for its caching purpose... I've tried simplifying our
> data configuration to only one thread with a single sub-entity with the
> same
> results. Again, any help would be greatly appreciated with this.
>

threads... you said? Which ones? Declarative parallelization in
EntityProcessor worked only with certain 3.x version.



>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/DIH-Caching-w-BerkleyBackedCache-tp4240142p4240356.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mk...@griddynamics.com>

Re: DIH Caching w/ BerkleyBackedCache

Posted by Todd Long <lo...@gmail.com>.

Mikhail Khludnev wrote
> "External merge" join helps to avoid boilerplate caching in such simple
> cases.

Thank you for the reply. I can certainly look into this though I would have
to apply the patch for our version (i.e. 4.8.1). I really just simplified
our data configuration here which actually consists of many sub-entities
that are successfully using the SortedMapBackedCache cache. I imagine this
would still apply to those as the queries themselves are simple for the most
part. I assume performance-wise this would only require the single table
scan?

I'm still very much interested in resolving this Berkley database cache
issue. I'm sure there is some minor configuration I'm missing that is
causing this behavior. Again, I've had no issues with the
SortedMapBackedCache for its caching purpose... I've tried simplifying our
data configuration to only one thread with a single sub-entity with the same
results. Again, any help would be greatly appreciated with this.



--
View this message in context: http://lucene.472066.n3.nabble.com/DIH-Caching-w-BerkleyBackedCache-tp4240142p4240356.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: DIH Caching w/ BerkleyBackedCache

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Hello Todd,

"External merge" join helps to avoid boilerplate caching in such simple
cases.

it should be something

  <entity name="parent"
               query="select ID, tp.* from TABLE_PARENT tp ORDER  BY ID">

      <entity name="child"
                 query="select ID, NAME, VALUE from TABLE_CHILD ORDER  BY
ID"

                 cacheKey="ID"
                 cacheLookup="parent.ID"
                 join="zipper" />

    </entity>


On Fri, Nov 13, 2015 at 10:54 PM, Todd Long <lo...@gmail.com> wrote:

> We currently index using DIH along with the SortedMapBackedCache cache
> implementation which has worked well until recently when we needed to index
> a much larger table. We were running into memory issues using the
> SortedMapBackedCache so we tried switching to the BerkleyBackedCache but
> appear to have some configuration issues. I've included our basic setup
> below. The issue we're running into is that it appears the Berkley database
> is evicting database files (see message below) before they've completed.
> When I watch the cache directory I only ever see two database files at a
> time with each one being ~1GB in size (this appears to be hard coded). Is
> there some additional configuration I'm missing to prevent the process from
> "cleaning" up database files before the index has finished? I think this
> "cleanup" continues to kickoff the caching which never completes... without
> caching the indexing is ~2 hours. Any help would be greatly appreciated.
> Thanks.
>
> Cleaning message: "Chose lowest utilized file for cleaning. fileChosen: 0x0
> ..."
>
> <dataConfig>
>   <dataSource type"JdbcDataSource" ... />
>
>   <document>
>     <entity name="parent"
>                query="select ID, tp.* from TABLE_PARENT tp">
>
>       <entity name="child"
>                  query="select ID, NAME, VALUE from TABLE_CHILD"
>
> cacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache"
>                  cacheKey="ID"
>                  cacheLookup="parent.ID"
>                  persistCacheName="CHILD"
>                  persistCacheBaseDir="/some/cache/dir"
>                  persistCacheFieldNames="ID,NAME,VALUE"
>                  persistCacheFieldTypes="STRING,STRING,STRING"
>                  berkleyInternalCacheSize="1000000"
>                  berkleyInternalShared="true" />
>
>     </entity>
>   </document>
> </dataConfig>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/DIH-Caching-w-BerkleyBackedCache-tp4240142.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mk...@griddynamics.com>