You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Maria Vazquez <ma...@dexone.com> on 2011/09/30 20:28:31 UTC

Multithreaded JdbcDataStore and CachedSqlEntityProcessor

Hi,
I¹m using threads with JdbcDataStore and CachedSqlEntityProcessor.
I noticed that if I make it single threaded CachedSqlEntityProcessor behaves
as expected (it only queries the db once and caches all the rows). If I make
it multi threaded it seems to make multiple db queries and when I debug the
source code it looks like it is not thread safe.
Any ideas?
Thanks,
Maria


Re: Multithreaded JdbcDataStore and CachedSqlEntityProcessor

Posted by Maria Vazquez <ma...@dexone.com>.
I’ll try to test the patch this week.
Thanks!


On 10/10/11 2:01 AM, "Mikhail Khludnev" <mk...@griddynamics.com> wrote:

> Hello,
> 
> Pls have a look to attached patch for
> http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_4_0/
> 
> It makes all tests green, but I have a several doubts in it.
> 
> Key points:
> 
> # EntityProcessorBase has context and  dataSourceRowCache fields which can't
> be shared among threads (per thread ThreadedEntityProcessorWrappers share
> single instance of EntityProcessor).
> * I removed context field, and provide it via Context.CURRENT threadlocal.
> * here is the really bad thing EntityProcessorBase.init() has a special case
> for "white box" tests, which bypass putting context into threadlocal. i.e.
> during full circuit run init() argument is ignored, but for tests it's pushing
> into threadlocal. I don't know how to improve it.  
> * I removed dataSourceRowCache and getFromRowCacheTransformed() by inlining
> it, because I see no purpose in it - it's iteration state. As a drawback
> consuming CachedSqlEntProc empties it, and after some time of thinking I've
> got that I'm wrong here. CachedSqlEntProc should be reused during debug
> sessions. I'm going to return dataSourceRowCache back, but put into
> threadlocal context too. Nevertheless, right now cacheWithWhereClause is used
> to distribute locks by PK.
> * synchronized CachedSqlEntityProcessor.init(), because it calls thread
> sensitive cacheInit()
> 
> # DocBuilder 
> * creating EntityProcessor in EntityRunner.<init> was wrong because it leads
> to creating CachedSqlEntProc per thread as Maria has spotted. I created per
> entity processors map in DocBuilder. It allows to share single
> CachedSqlEntProc among threads.
> * I implemented stacking Context which is passed through thread local.
> (push/pop&oldCtx in runAThread())
> * I was wonder that there is the second code path in additiond to
> EntityRunners: buildDocument() - stacking contexts was already done there, but
> I had to reorder init() and push rows;
> * and the third code path collectDelta() - here I had to add stacking context
> too. 
> 
> # plenty changes in tests, there two kinds of it
> * clearing context in threadlocal  before and after "white-box" tests. (in
> whiteboxing I mean unit tests that avoid "official" DocBuilder entry points)
> * and in TestCachedSqlEntityProcessor you can see that I actually broke
> Caching nature of this processor, required for debug most.
> 
> # Summary:
> * by applying this patch you'll have CachedSqlEntProc works in multiple
> threads, but for every debug request it will pull all data every time.
> * I'm going to fix this issue and provide non-invasive fix by putting
> dataSourceRowCache into a context.entitySession.
> 
> # Questions to DIH committers
> * isn't ContextImpl.putVal(String, Object, Map) wrong (it always put into
> entitySession)
> * is there any plan to refactor DIH into single codepath on EntityRunners (and
> remove separate flows for debug and delta);
> 
> --
> Mikhail
> 
> On Sat, Oct 8, 2011 at 1:47 AM, Mikhail Khludnev <mk...@griddynamics.com>
> wrote:
>> Hello,
>> 
>> First of all I've got the same feeling for some time. 
>> I've checked TestThreaded
>> <http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_4_0/solr/contr
>> ib/dataimporthandler/src/test/org/apache/solr/handler/dataimport/TestThreaded
>> .java>  and got that it doesn't cover CachedSqlEntityProcessor with fancy
>> where="xid=x.id <http://x.id> " feature.
>> Pls find the test for it attached:
>> dih-mt-cached-race-condition-TestThreaded.java.patch (as a patch for
>> TestThreaded from tags/lucene_solr_3_4_0) 
>> 
>> Most times testCachedMultiThread_FullImport() fails (but really rare it
>> passed) and testCachedSingleThread_FullImport() is always green. The
>> difference is one and two threads. 
>> Pls also find attached:
>> 
>> dih-mt-cached-race-condition-TestThreaded.log contains FINEST messages
>> including the root cause (my guess):
>> 2011/10/08 0:01:13
>> org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper nextRow
>> 詳細レベル (低): arow : null
>> 
>> (I'm sorry for hieroglyphs, I don't know where they comes from, it should be
>> just FINEST)
>> and corrupted documents, where some of child entity fields are missed.
>> 
>> dih-mt-cached-race-condition-TestThreaded.test.out - is the actual test
>> failure (I'm sorry but it was another run than in .log). The failed assert is
>> the searching root entities by the children terms.
>> 
>> I'm not sure but creating children entity runners in multiple threads looks
>> really suspicious for me, I hadn't have a time to look deeper into the code.
>> 
>> PS. I want to contribute two bleeding ideas (and implementations too), which
>> improves DIH much, but this race condition is a blocker for them.
>> 
>> Regards
>> 
>> On Mon, Oct 3, 2011 at 10:09 PM, Maria Vazquez <ma...@dexone.com>
>> wrote:
>>> When I'm debugging, if it is single threaded
>>> CachedSqlEntityProcessor.getAllNonCachedRows is called only once, all the
>>> rows cached and next time it requests a row it gets it from the cached data.
>>> In the logs I see the SQL call only once.
>>> 
>>> If I use multiple threads, it calls
>>> CachedSqlEntityProcessor.getAllNonCachedRows multiple times, so it creates a
>>> new cache per thread.
>>> 
>>> The cache should be shared among threads and be safe, right?
>>> 
>>> Thanks!
>>> Maria
>>> 
>>> 
>>> 
>>> 
>>> On 10/1/11 8:00 PM, "pulkitsinghal@gmail.com" <pu...@gmail.com>
>>> wrote:
>>> 
>>>> > What part of the source code in debug mode behaved in a fashion such as
>>>> to
>>>> > make it seem like it is not thread-safe?
>>>> >
>>>> > If it feels difficult to put into words then you can always make a small
>>>> 5 min
>>>> > screencast to demo the issue and talk about it. I do that for really
>>>> complex
>>>> > stuff with Jing by techsmith (free version).
>>>> >
>>>> > Or you can put up a small and simplified test case together to demo the
>>>> issue
>>>> > and then paste the link for that hosted attachment :)
>>>> >
>>>> > Sent from my iPhone
>>>> >
>>>> > On Sep 30, 2011, at 1:28 PM, Maria Vazquez <ma...@dexone.com>
>>>> wrote:
>>>> >
>>>>> >> Hi,
>>>>> >> I¹m using threads with JdbcDataStore and CachedSqlEntityProcessor.
>>>>> >> I noticed that if I make it single threaded CachedSqlEntityProcessor
>>>>> behaves
>>>>> >> as expected (it only queries the db once and caches all the rows). If I
>>>>> make
>>>>> >> it multi threaded it seems to make multiple db queries and when I debug
>>>>> the
>>>>> >> source code it looks like it is not thread safe.
>>>>> >> Any ideas?
>>>>> >> Thanks,
>>>>> >> Maria
>>>>> >>
>>> 
>> 
>> 


Re: Multithreaded JdbcDataStore and CachedSqlEntityProcessor

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
Hello,

Pls have a look to attached patch for
http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_4_0/

It makes all tests green, but I have a several doubts in it.

Key points:

# EntityProcessorBase has context and  dataSourceRowCache fields which can't
be shared among threads (per thread ThreadedEntityProcessorWrappers share
single instance of EntityProcessor).
* I removed context field, and provide it via Context.CURRENT threadlocal.
* here is the really bad thing EntityProcessorBase.init() has a special case
for "white box" tests, which bypass putting context into threadlocal. i.e.
during full circuit run init() argument is ignored, but for tests it's
pushing into threadlocal. I don't know how to improve it.
* I removed dataSourceRowCache and getFromRowCacheTransformed() by inlining
it, because I see no purpose in it - it's iteration state. As a drawback
consuming CachedSqlEntProc empties it, and after some time of thinking I've
got that I'm wrong here. CachedSqlEntProc should be reused during debug
sessions. I'm going to return dataSourceRowCache back, but put into
threadlocal context too. Nevertheless, right now cacheWithWhereClause is
used to distribute locks by PK.
* synchronized CachedSqlEntityProcessor.init(), because it calls thread
sensitive cacheInit()

# DocBuilder
* creating EntityProcessor in EntityRunner.<init> was wrong because it leads
to creating CachedSqlEntProc per thread as Maria has spotted. I created per
entity processors map in DocBuilder. It allows to share single
CachedSqlEntProc among threads.
* I implemented stacking Context which is passed through thread local.
(push/pop&oldCtx in runAThread())
* I was wonder that there is the second code path in additiond to
EntityRunners: buildDocument() - stacking contexts was already done there,
but I had to reorder init() and push rows;
* and the third code path collectDelta() - here I had to add stacking
context too.

# plenty changes in tests, there two kinds of it
* clearing context in threadlocal before and after "white-box" tests. (in
whiteboxing I mean unit tests that avoid "official" DocBuilder entry points)
* and in TestCachedSqlEntityProcessor you can see that I actually broke
Caching nature of this processor, required for debug most.

# Summary:
* by applying this patch you'll have CachedSqlEntProc works in multiple
threads, but for every debug request it will pull all data every time.
* I'm going to fix this issue and provide non-invasive fix by putting
dataSourceRowCache into a context.entitySession.

# Questions to DIH committers
* isn't ContextImpl.putVal(String, Object, Map) wrong (it always put into
entitySession)
* is there any plan to refactor DIH into single codepath on EntityRunners
(and remove separate flows for debug and delta);

--
Mikhail

On Sat, Oct 8, 2011 at 1:47 AM, Mikhail Khludnev <mkhludnev@griddynamics.com
> wrote:

> Hello,
>
> First of all I've got the same feeling for some time.
> I've checked TestThreaded<http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_4_0/solr/contrib/dataimporthandler/src/test/org/apache/solr/handler/dataimport/TestThreaded.java> and
> got that it doesn't cover CachedSqlEntityProcessor with fancy where="xid=
> x.id" feature.
> Pls find the test for it attached:
> dih-mt-cached-race-condition-TestThreaded.java.patch (as a patch for
> TestThreaded from tags/lucene_solr_3_4_0)
>
> Most times testCached*MultiThread*_FullImport() fails (but really rare it
> passed) and testCached*SingleThread*_FullImport() is always green. The
> difference is one and two threads.
> Pls also find attached:
>
> dih-mt-cached-race-condition-TestThreaded.log contains FINEST messages
> including the root cause (my guess):
> 2011/10/08 0:01:13 org.apache.solr.handler.dataimport.*ThreadedEntityProcessorWrapper
> nextRow*
> 詳細レベル (低): *arow : null*
> *
> *
> (I'm sorry for hieroglyphs, I don't know where they comes from, it should
> be just FINEST)
> and corrupted documents, where some of child entity fields are missed.
>
> dih-mt-cached-race-condition-TestThreaded.test.out - is the actual test
> failure (I'm sorry but it was another run than in .log). The failed assert
> is the searching root entities by the children terms.
>
> I'm not sure but creating children entity runners in multiple threads looks
> really suspicious for me, I hadn't have a time to look deeper into the code.
>
> PS. I want to contribute two bleeding ideas (and implementations too),
> which improves DIH much, but this race condition is a blocker for them.
>
> Regards
>
> On Mon, Oct 3, 2011 at 10:09 PM, Maria Vazquez <ma...@dexone.com>wrote:
>
>> When I'm debugging, if it is single threaded
>> CachedSqlEntityProcessor.getAllNonCachedRows is called only once, all the
>> rows cached and next time it requests a row it gets it from the cached
>> data.
>> In the logs I see the SQL call only once.
>>
>> If I use multiple threads, it calls
>> CachedSqlEntityProcessor.getAllNonCachedRows multiple times, so it creates
>> a
>> new cache per thread.
>>
>> The cache should be shared among threads and be safe, right?
>>
>> Thanks!
>> Maria
>>
>>
>>
>>
>> On 10/1/11 8:00 PM, "pulkitsinghal@gmail.com" <pu...@gmail.com>
>> wrote:
>>
>> > What part of the source code in debug mode behaved in a fashion such as
>> to
>> > make it seem like it is not thread-safe?
>> >
>> > If it feels difficult to put into words then you can always make a small
>> 5 min
>> > screencast to demo the issue and talk about it. I do that for really
>> complex
>> > stuff with Jing by techsmith (free version).
>> >
>> > Or you can put up a small and simplified test case together to demo the
>> issue
>> > and then paste the link for that hosted attachment :)
>> >
>> > Sent from my iPhone
>> >
>> > On Sep 30, 2011, at 1:28 PM, Maria Vazquez <ma...@dexone.com>
>> wrote:
>> >
>> >> Hi,
>> >> I¹m using threads with JdbcDataStore and CachedSqlEntityProcessor.
>> >> I noticed that if I make it single threaded CachedSqlEntityProcessor
>> behaves
>> >> as expected (it only queries the db once and caches all the rows). If I
>> make
>> >> it multi threaded it seems to make multiple db queries and when I debug
>> the
>> >> source code it looks like it is not thread safe.
>> >> Any ideas?
>> >> Thanks,
>> >> Maria
>> >>
>>
>>
>
>
> --
> Sincerely yours
> Mikhail (Mike) Khludnev
> Developer
> Grid Dynamics
> tel. 1-415-738-8644
> Skype: mkhludnev
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>



-- 
Sincerely yours
Mikhail (Mike) Khludnev
Developer
Grid Dynamics
tel. 1-415-738-8644
Skype: mkhludnev
<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: Multithreaded JdbcDataStore and CachedSqlEntityProcessor

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
Hello,

First of all I've got the same feeling for some time.
I've checked TestThreaded<http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_4_0/solr/contrib/dataimporthandler/src/test/org/apache/solr/handler/dataimport/TestThreaded.java>
and
got that it doesn't cover CachedSqlEntityProcessor with fancy where="xid=
x.id" feature.
Pls find the test for it attached:
dih-mt-cached-race-condition-TestThreaded.java.patch (as a patch for
TestThreaded from tags/lucene_solr_3_4_0)

Most times testCached*MultiThread*_FullImport() fails (but really rare it
passed) and testCached*SingleThread*_FullImport() is always green. The
difference is one and two threads.
Pls also find attached:

dih-mt-cached-race-condition-TestThreaded.log contains FINEST messages
including the root cause (my guess):
2011/10/08 0:01:13
org.apache.solr.handler.dataimport.*ThreadedEntityProcessorWrapper
nextRow*
詳細レベル (低): *arow : null*
*
*
(I'm sorry for hieroglyphs, I don't know where they comes from, it should be
just FINEST)
and corrupted documents, where some of child entity fields are missed.

dih-mt-cached-race-condition-TestThreaded.test.out - is the actual test
failure (I'm sorry but it was another run than in .log). The failed assert
is the searching root entities by the children terms.

I'm not sure but creating children entity runners in multiple threads looks
really suspicious for me, I hadn't have a time to look deeper into the code.

PS. I want to contribute two bleeding ideas (and implementations too), which
improves DIH much, but this race condition is a blocker for them.

Regards

On Mon, Oct 3, 2011 at 10:09 PM, Maria Vazquez <ma...@dexone.com>wrote:

> When I'm debugging, if it is single threaded
> CachedSqlEntityProcessor.getAllNonCachedRows is called only once, all the
> rows cached and next time it requests a row it gets it from the cached
> data.
> In the logs I see the SQL call only once.
>
> If I use multiple threads, it calls
> CachedSqlEntityProcessor.getAllNonCachedRows multiple times, so it creates
> a
> new cache per thread.
>
> The cache should be shared among threads and be safe, right?
>
> Thanks!
> Maria
>
>
>
>
> On 10/1/11 8:00 PM, "pulkitsinghal@gmail.com" <pu...@gmail.com>
> wrote:
>
> > What part of the source code in debug mode behaved in a fashion such as
> to
> > make it seem like it is not thread-safe?
> >
> > If it feels difficult to put into words then you can always make a small
> 5 min
> > screencast to demo the issue and talk about it. I do that for really
> complex
> > stuff with Jing by techsmith (free version).
> >
> > Or you can put up a small and simplified test case together to demo the
> issue
> > and then paste the link for that hosted attachment :)
> >
> > Sent from my iPhone
> >
> > On Sep 30, 2011, at 1:28 PM, Maria Vazquez <ma...@dexone.com>
> wrote:
> >
> >> Hi,
> >> I¹m using threads with JdbcDataStore and CachedSqlEntityProcessor.
> >> I noticed that if I make it single threaded CachedSqlEntityProcessor
> behaves
> >> as expected (it only queries the db once and caches all the rows). If I
> make
> >> it multi threaded it seems to make multiple db queries and when I debug
> the
> >> source code it looks like it is not thread safe.
> >> Any ideas?
> >> Thanks,
> >> Maria
> >>
>
>


-- 
Sincerely yours
Mikhail (Mike) Khludnev
Developer
Grid Dynamics
tel. 1-415-738-8644
Skype: mkhludnev
<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: Multithreaded JdbcDataStore and CachedSqlEntityProcessor

Posted by Maria Vazquez <ma...@dexone.com>.
When I'm debugging, if it is single threaded
CachedSqlEntityProcessor.getAllNonCachedRows is called only once, all the
rows cached and next time it requests a row it gets it from the cached data.
In the logs I see the SQL call only once.

If I use multiple threads, it calls
CachedSqlEntityProcessor.getAllNonCachedRows multiple times, so it creates a
new cache per thread.

The cache should be shared among threads and be safe, right?

Thanks!
Maria




On 10/1/11 8:00 PM, "pulkitsinghal@gmail.com" <pu...@gmail.com>
wrote:

> What part of the source code in debug mode behaved in a fashion such as to
> make it seem like it is not thread-safe?
> 
> If it feels difficult to put into words then you can always make a small 5 min
> screencast to demo the issue and talk about it. I do that for really complex
> stuff with Jing by techsmith (free version).
> 
> Or you can put up a small and simplified test case together to demo the issue
> and then paste the link for that hosted attachment :)
> 
> Sent from my iPhone
> 
> On Sep 30, 2011, at 1:28 PM, Maria Vazquez <ma...@dexone.com> wrote:
> 
>> Hi,
>> I¹m using threads with JdbcDataStore and CachedSqlEntityProcessor.
>> I noticed that if I make it single threaded CachedSqlEntityProcessor behaves
>> as expected (it only queries the db once and caches all the rows). If I make
>> it multi threaded it seems to make multiple db queries and when I debug the
>> source code it looks like it is not thread safe.
>> Any ideas?
>> Thanks,
>> Maria
>> 


Re: Multithreaded JdbcDataStore and CachedSqlEntityProcessor

Posted by pu...@gmail.com.
What part of the source code in debug mode behaved in a fashion such as to make it seem like it is not thread-safe?

If it feels difficult to put into words then you can always make a small 5 min screencast to demo the issue and talk about it. I do that for really complex stuff with Jing by techsmith (free version).

Or you can put up a small and simplified test case together to demo the issue and then paste the link for that hosted attachment :)

Sent from my iPhone

On Sep 30, 2011, at 1:28 PM, Maria Vazquez <ma...@dexone.com> wrote:

> Hi,
> I’m using threads with JdbcDataStore and CachedSqlEntityProcessor.
> I noticed that if I make it single threaded CachedSqlEntityProcessor behaves
> as expected (it only queries the db once and caches all the rows). If I make
> it multi threaded it seems to make multiple db queries and when I debug the
> source code it looks like it is not thread safe.
> Any ideas?
> Thanks,
> Maria
>