You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Amit Nithian <an...@gmail.com> on 2008/11/25 09:22:02 UTC

CachedSqlEntityProcessor's purpose

I am starting to look at Solr's Data Import Handler framework and am quite
impressed with it so far. My question is in trying to reduce the number of
SQL queries issued to the database and saw this entity processor.

In the following example:
<entity name="x" query="select * from x">
    <entity name="y" query="select * from y where xid=${x.id}"
processor="CachedSqlEntityProcessor">
    </entity>
<entity>

I like the concept of having multiple entity blocks for clarity but why
wouldn't I have (for DB efficiency), the following as one entity's SQL
statement "select * from X,Y where x.id=y.xid" and have two fields pointing
at X and Y columns?  My main question though is how the
CachedSQLEntityProcessor helps in this case for I want to use the multiple
entity blocks for cleanliness. If I have 500,000 X records, how many SQL
queries in the second entity block (y) would get executed, 500000?

If there is any more detailed information about the number of queries
executed in different circumstances, memory overhead or way that the data is
brought from the database into Java  it would be much appreciated for it's
important for my application.

Thanks in advance!
Amit

Re: CachedSqlEntityProcessor's purpose

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

On Tue, Nov 25, 2008 at 1:52 PM, Amit Nithian <an...@gmail.com> wrote:

>
> I like the concept of having multiple entity blocks for clarity but why
> wouldn't I have (for DB efficiency), the following as one entity's SQL
> statement "select * from X,Y where x.id=y.xid" and have two fields
> pointing
> at X and Y columns?

You can certainly do that. However, it is a problem when you need field X or
Y to be multi-valued. You'd get repeated rows for that query and
DataImportHandler will have no way to figure out what to put where. In the
nested entities approach, DataImportHandler multiple values will come from a
nested entity which can be very easily represented as a List. If you do not
have multi-valued fields then you can go for that approach.

> My main question though is how the
> CachedSQLEntityProcessor helps in this case for I want to use the multiple
> entity blocks for cleanliness. If I have 500,000 X records, how many SQL
> queries in the second entity block (y) would get executed, 500000?

For each row fetched from the parent entity, the query for its nested entity
is executed after replacing the variables with known values. When the nested
entity has few records in the database, it is more efficient to use
CachedSqlEntityProcessor which executes the query only once and keeps all
the returned rows in memory. After that for each row returned by parent
entity, the cached entity needs to do a lookup in the cache which is quite
fast. Since all rows are stored in-memory, you trade memory for number of
queries to the db when you use CachedSqlEntityProcessor.

http://wiki.apache.org/solr/DataImportHandler#head-4465e39677ec06e4b14fd6a574434bac6e4d01e1

-- 
Regards,
Shalin Shekhar Mangar.

Re: CachedSqlEntityProcessor's purpose

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.

I am raising an issue for better error checking in CachedSqlEntityprocessor
https://issues.apache.org/jira/browse/SOLR-884

On Wed, Nov 26, 2008 at 9:34 PM, Steffen B. <s....@fhtw-berlin.de> wrote:
>
>
> Noble Paul നോബിള്‍ नोब्ळ् wrote:
>>
>> I suspect only one thing
>> are the data types same for productid and product.id
>> in the db?
That should not be the case . The names do not matter only the values do.
I am running out of clues
If you are good at java it might do some help to hook it up on a debugger.
I guess you are using a trunk build.
The place where you may find something useful information is
EntityProcessorBase#getIdCacheData()


>>
>
> Good point. I just checked and they are both int(11). Could the problem be
> caused by the different column names, id in the products table and productid
> in the data-table?
>
>
>
>>> Yes, I wouldn't call the cache PK-attribute "where", as there can be
>>> another
>>> "WHERE" inside the query, too, and because it isn't actually an SQL-WHERE
>>> that's executed with that attribute value, right? Maybe "cacheid" or
>>> "cachekey" would be clearer? And an additional check for whether the
>>> chosen
>>> cachekey is in the result-set could minimize problems, because right now
>>> omitting the primary key column inside the query doesn't yield any
>>> errors.
>> true. that check would be helpful
>>
>> as to the configuration, any changes can break back compat
>>
>
> Then why not support both? "where" and "cachekey"? =)
> Thanks for your help!
> Steffen
> --
> View this message in context: http://www.nabble.com/CachedSqlEntityProcessor%27s-purpose-tp20676874p20703855.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
--Noble Paul

Re: CachedSqlEntityProcessor's purpose

Posted by "Steffen B." <s....@fhtw-berlin.de>.


Noble Paul നോബിള്‍ नोब्ळ् wrote:
> 
> I suspect only one thing
> are the data types same for productid and product.id
> in the db?
> 

Good point. I just checked and they are both int(11). Could the problem be
caused by the different column names, id in the products table and productid
in the data-table?



>> Yes, I wouldn't call the cache PK-attribute "where", as there can be
>> another
>> "WHERE" inside the query, too, and because it isn't actually an SQL-WHERE
>> that's executed with that attribute value, right? Maybe "cacheid" or
>> "cachekey" would be clearer? And an additional check for whether the
>> chosen
>> cachekey is in the result-set could minimize problems, because right now
>> omitting the primary key column inside the query doesn't yield any
>> errors.
> true. that check would be helpful
> 
> as to the configuration, any changes can break back compat
> 

Then why not support both? "where" and "cachekey"? =)
Thanks for your help!
Steffen
-- 
View this message in context: http://www.nabble.com/CachedSqlEntityProcessor%27s-purpose-tp20676874p20703855.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: CachedSqlEntityProcessor's purpose

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.

I suspect only one thing
are the data types same for productid and product.id
in the db?



On Wed, Nov 26, 2008 at 5:38 PM, Steffen B. <s....@fhtw-berlin.de> wrote:
>
> Hi Noble Paul,
> thanks for your quick response.
>
>
> Noble Paul നോബിള്‍ नोब्ळ् wrote:
>>
>> What i expect to happen is when you run the query
>> SELECT manufacturer,id FROM product_data WHERE type='manu'
>>
>> you must get all the rows from the DB in the table product_data
>>
>
> Unfortunately, that's not the case. Taken from a real project, verbose mode:
> (names replaced)
> # run 1 - no cache processor
> Config:
> <entity name="myEntity" dataSource="my-db" pk="productid"
>                query="SELECT data FROM data2productid WHERE
> productid='${product.id}' AND isdeleted=0" />
>
> Result:
> <lst name="entity:myEntity">
> <str name="query">
> SELECT data FROM data2productid WHERE productid='6' AND isdeleted=0
> </str>
> <str name="time-taken">0:0:0.25</str>
> <str>----------- row #1-------------</str>
> <str name="data">SM224950</str>
> <str>---------------------------------------------</str>
> <str>----------- row #2-------------</str>
> <str name="data">04138D01</str>
> <str>---------------------------------------------</str>
> </lst>
>
> # run 2 - cache processor
> Config:
> <entity name="myEntity" dataSource="my-db" pk="productid"
>                query="SELECT data,productid FROM data2productid WHERE
> isdeleted=0" where="productid=product.id"
> processor="CachedSqlEntityProcessor" />
>
> Result:
> <lst name="entity:myEntity">
> <str name="query">SELECT data,productid  FROM data2productid WHERE
> isdeleted=0</str>
> <str name="time-taken">0:0:0.249</str>
> </lst>
>
> It seems like the processor is unable to associate the cached rows with the
> matching documents and therefore won't add any cached entity data to my
> documents. Or did I make any mistakes? :(
>
>
>
>> So it is likely that you get an OOM if the DB table has a lot of rows.
>>
>
> Yes, that at least indicates, that the rows are loaded into memory. So far
> it seems to work for me.
>
>
>
>> processor="CachedSQLEntityProcessor" is wrong
>> processor="CachedSqlEntityProcessor" is right
>>
>
> Uh, little typo. My bad.
>
>
>
>> counter intuitive? which part is counter-intuitive ? the where part?
>>
>> The cache store the entire data with id value as the key
>>
>
> Yes, I wouldn't call the cache PK-attribute "where", as there can be another
> "WHERE" inside the query, too, and because it isn't actually an SQL-WHERE
> that's executed with that attribute value, right? Maybe "cacheid" or
> "cachekey" would be clearer? And an additional check for whether the chosen
> cachekey is in the result-set could minimize problems, because right now
> omitting the primary key column inside the query doesn't yield any errors.
true. that check would be helpful

as to the configuration, any changes can break back compat
> --
> View this message in context: http://www.nabble.com/CachedSqlEntityProcessor%27s-purpose-tp20676874p20699494.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
--Noble Paul

Re: CachedSqlEntityProcessor's purpose

Posted by "Steffen B." <s....@fhtw-berlin.de>.

Hi Noble Paul,
thanks for your quick response.


Noble Paul നോബിള്‍ नोब्ळ् wrote:
> 
> What i expect to happen is when you run the query
> SELECT manufacturer,id FROM product_data WHERE type='manu'
> 
> you must get all the rows from the DB in the table product_data
> 

Unfortunately, that's not the case. Taken from a real project, verbose mode:
(names replaced)
# run 1 - no cache processor
Config:
<entity name="myEntity" dataSource="my-db" pk="productid"
            	query="SELECT data FROM data2productid WHERE
productid='${product.id}' AND isdeleted=0" />

Result:
<lst name="entity:myEntity">
<str name="query">
SELECT data FROM data2productid WHERE productid='6' AND isdeleted=0
</str>
<str name="time-taken">0:0:0.25</str>
<str>----------- row #1-------------</str>
<str name="data">SM224950</str>
<str>---------------------------------------------</str>
<str>----------- row #2-------------</str>
<str name="data">04138D01</str>
<str>---------------------------------------------</str>
</lst>

# run 2 - cache processor
Config:
<entity name="myEntity" dataSource="my-db" pk="productid"
            	query="SELECT data,productid FROM data2productid WHERE
isdeleted=0" where="productid=product.id"
processor="CachedSqlEntityProcessor" />

Result: 
<lst name="entity:myEntity">
<str name="query">SELECT data,productid  FROM data2productid WHERE
isdeleted=0</str>
<str name="time-taken">0:0:0.249</str>
</lst>

It seems like the processor is unable to associate the cached rows with the
matching documents and therefore won't add any cached entity data to my
documents. Or did I make any mistakes? :(



> So it is likely that you get an OOM if the DB table has a lot of rows.
> 

Yes, that at least indicates, that the rows are loaded into memory. So far
it seems to work for me.



> processor="CachedSQLEntityProcessor" is wrong
> processor="CachedSqlEntityProcessor" is right
> 

Uh, little typo. My bad.



> counter intuitive? which part is counter-intuitive ? the where part?
> 
> The cache store the entire data with id value as the key
> 

Yes, I wouldn't call the cache PK-attribute "where", as there can be another
"WHERE" inside the query, too, and because it isn't actually an SQL-WHERE
that's executed with that attribute value, right? Maybe "cacheid" or
"cachekey" would be clearer? And an additional check for whether the chosen
cachekey is in the result-set could minimize problems, because right now
omitting the primary key column inside the query doesn't yield any errors.
-- 
View this message in context: http://www.nabble.com/CachedSqlEntityProcessor%27s-purpose-tp20676874p20699494.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: CachedSqlEntityProcessor's purpose

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.

What i expect to happen is when you run the query
SELECT manufacturer,id FROM product_data WHERE type='manu'

you must get all the rows from the DB in the table product_data

So it is likely that you get an OOM if the DB table has a lot of rows.

processor="CachedSQLEntityProcessor" is wrong
processor="CachedSqlEntityProcessor" is right

counter intuitive? which part is counter-intuitive ? the where part?






The cache store the entire data with id value as the key

On Wed, Nov 26, 2008 at 4:27 PM, Steffen B. <s....@fhtw-berlin.de> wrote:
>
> I've been discussing this topic with Amit and also have some questions...
>
>
> Noble Paul നോബിള്‍ नोब्ळ् wrote:
>>
>> On Tue, Nov 25, 2008 at 11:35 PM, Amit Nithian <an...@gmail.com> wrote:
>>> 2) In the example, there were two use cases, one that is like
>>> query="select
>>> * from Y where xid=${X.ID}" and another where it's query="select * from
>>> Y"
>>> where="xid=${x.ID}. Is there any difference in how
>>> CachedSQLEntityPRocessor
>>> behaves? Does it know to strip off the WHERE clause and simply cache the
>>> "select * from Y"?
>> It fetches all the rows using the 'query' first.
>>
>> he where="xid=x.id" (see no ${} here )
>> is evaluated in the map. In the map all the xid values will be kept as
>> keys and the lookup is done on the map after evaluating the value of
>> 'x.id' as ${x.ID}
>>
>
> If I'm not mistaken, this behaviour would require that the query I use
> returns the column I want to match on, too. Without the
> CachedSQLEntityProcessor, my entity for fetching multi-valued manufacturers
> could look like this: ("product" is the parent entity)
> <entity query="SELECT manufacturer FROM product_data WHERE id=${product.id}
> AND type='manu'" />
> Now if I wanted to use the cache, what would the query look like...? I tried
> this:
> <entity query="SELECT manufacturer,id FROM product_data WHERE type='manu'"
> where="id=product.id" processor="CachedSQLEntityProcessor" />
> That seems to be a pretty counter-intuitive change, which isn't explained at
> all in the Wiki. =| Also, it didn't work for me, the query that is executed
> returns zero rows, when I debug it with the dataimport.jsp-tool. The cache
> seems to get filled though, as I encountered some JVM out-of-memory errors
> when trying to cache huge tables... ;)
>
> --
> View this message in context: http://www.nabble.com/CachedSqlEntityProcessor%27s-purpose-tp20676874p20698724.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
--Noble Paul

Re: CachedSqlEntityProcessor's purpose

Posted by "Steffen B." <s....@fhtw-berlin.de>.

I've been discussing this topic with Amit and also have some questions...

Noble Paul നോബിള്‍ नोब्ळ् wrote:
> 
> On Tue, Nov 25, 2008 at 11:35 PM, Amit Nithian <an...@gmail.com> wrote:
>> 2) In the example, there were two use cases, one that is like
>> query="select
>> * from Y where xid=${X.ID}" and another where it's query="select * from
>> Y"
>> where="xid=${x.ID}. Is there any difference in how
>> CachedSQLEntityPRocessor
>> behaves? Does it know to strip off the WHERE clause and simply cache the
>> "select * from Y"?
> It fetches all the rows using the 'query' first.
> 
> he where="xid=x.id" (see no ${} here )
> is evaluated in the map. In the map all the xid values will be kept as
> keys and the lookup is done on the map after evaluating the value of
> 'x.id' as ${x.ID}
> 

If I'm not mistaken, this behaviour would require that the query I use
returns the column I want to match on, too. Without the
CachedSQLEntityProcessor, my entity for fetching multi-valued manufacturers
could look like this: ("product" is the parent entity)
<entity query="SELECT manufacturer FROM product_data WHERE id=${product.id}
AND type='manu'" />
Now if I wanted to use the cache, what would the query look like...? I tried
this:
<entity query="SELECT manufacturer,id FROM product_data WHERE type='manu'"
where="id=product.id" processor="CachedSQLEntityProcessor" />
That seems to be a pretty counter-intuitive change, which isn't explained at
all in the Wiki. =| Also, it didn't work for me, the query that is executed
returns zero rows, when I debug it with the dataimport.jsp-tool. The cache
seems to get filled though, as I encountered some JVM out-of-memory errors
when trying to cache huge tables... ;)

-- 
View this message in context: http://www.nabble.com/CachedSqlEntityProcessor%27s-purpose-tp20676874p20698724.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: CachedSqlEntityProcessor's purpose

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.

On Tue, Nov 25, 2008 at 11:35 PM, Amit Nithian <an...@gmail.com> wrote:
> Thanks for the responses. Few follow-ups:
> 1) It seems that the CachedSQLEntityProcessor performs the where clause in
> memory on the cache. Is this cache an in memory RDBMS or maps?
It is a hashmap in memory
> 2) In the example, there were two use cases, one that is like query="select
> * from Y where xid=${X.ID}" and another where it's query="select * from Y"
> where="xid=${x.ID}. Is there any difference in how CachedSQLEntityPRocessor
> behaves? Does it know to strip off the WHERE clause and simply cache the
> "select * from Y"?
It fetches all the rows using the 'query' first.

he where="xid=x.id" (see no ${} here )
is evaluated in the map. In the map all the xid values will be kept as
keys and the lookup is done on the map after evaluating the value of
'x.id' as ${x.ID}


Then for subsequent requests it looks
>
> What are some dataset sizes that have been tested using this framework and
> what are some performance metrics?
>
> Thanks again
> Amit
>
> On Tue, Nov 25, 2008 at 7:32 AM, Noble Paul നോബിള്‍ नोब्ळ् <
> noble.paul@gmail.com> wrote:
>
>> every row emitted by an outer entity results in a new Sql query in the
>> inner entity. (yes 500000 queries on inner entity)So,if you wish to
>> join multiple tables then nested entities is the way to go.
>>
>> CachedSqlEntityProcessor is meant to help you reduce the number of
>> queries fired on sub-entities.
>>
>> If you get the entire table in one query (by using select * from y)
>> and use a separate where attribute , The entire set of rows in y get
>> loaded into RAM.
>>
>> If you use it w/o the where attribute, it still ends up loading the
>> entire table into the memory (it is an unbounded cache ).It can easily
>> give you an OOM.
>>
>> dod not use CachedSqlEntityProcessor for tidying up. use it if you
>> wish to save time and you have a lot of RAM
>>
>>
>> On Tue, Nov 25, 2008 at 1:52 PM, Amit Nithian <an...@gmail.com> wrote:
>> > I am starting to look at Solr's Data Import Handler framework and am
>> quite
>> > impressed with it so far. My question is in trying to reduce the number
>> of
>> > SQL queries issued to the database and saw this entity processor.
>> >
>> > In the following example:
>> > <entity name="x" query="select * from x">
>> >    <entity name="y" query="select * from y where xid=${x.id}"
>> > processor="CachedSqlEntityProcessor">
>> >    </entity>
>> > <entity>
>> >
>> > I like the concept of having multiple entity blocks for clarity but why
>> > wouldn't I have (for DB efficiency), the following as one entity's SQL
>> > statement "select * from X,Y where x.id=y.xid" and have two fields
>> pointing
>> > at X and Y columns?  My main question though is how the
>> > CachedSQLEntityProcessor helps in this case for I want to use the
>> multiple
>> > entity blocks for cleanliness. If I have 500,000 X records, how many SQL
>> > queries in the second entity block (y) would get executed, 500000?
>> >
>> > If there is any more detailed information about the number of queries
>> > executed in different circumstances, memory overhead or way that the data
>> is
>> > brought from the database into Java  it would be much appreciated for
>> it's
>> > important for my application.
>> >
>> > Thanks in advance!
>> > Amit
>> >
>>
>>
>>
>> --
>> --Noble Paul
>>
>



-- 
--Noble Paul

Re: CachedSqlEntityProcessor's purpose

Posted by Amit Nithian <an...@gmail.com>.

Thanks for the responses. Few follow-ups:
1) It seems that the CachedSQLEntityProcessor performs the where clause in
memory on the cache. Is this cache an in memory RDBMS or maps?
2) In the example, there were two use cases, one that is like query="select
* from Y where xid=${X.ID}" and another where it's query="select * from Y"
where="xid=${x.ID}. Is there any difference in how CachedSQLEntityPRocessor
behaves? Does it know to strip off the WHERE clause and simply cache the
"select * from Y"?

What are some dataset sizes that have been tested using this framework and
what are some performance metrics?

Thanks again
Amit

On Tue, Nov 25, 2008 at 7:32 AM, Noble Paul നോബിള്‍ नोब्ळ् <
noble.paul@gmail.com> wrote:

> every row emitted by an outer entity results in a new Sql query in the
> inner entity. (yes 500000 queries on inner entity)So,if you wish to
> join multiple tables then nested entities is the way to go.
>
> CachedSqlEntityProcessor is meant to help you reduce the number of
> queries fired on sub-entities.
>
> If you get the entire table in one query (by using select * from y)
> and use a separate where attribute , The entire set of rows in y get
> loaded into RAM.
>
> If you use it w/o the where attribute, it still ends up loading the
> entire table into the memory (it is an unbounded cache ).It can easily
> give you an OOM.
>
> dod not use CachedSqlEntityProcessor for tidying up. use it if you
> wish to save time and you have a lot of RAM
>
>
> On Tue, Nov 25, 2008 at 1:52 PM, Amit Nithian <an...@gmail.com> wrote:
> > I am starting to look at Solr's Data Import Handler framework and am
> quite
> > impressed with it so far. My question is in trying to reduce the number
> of
> > SQL queries issued to the database and saw this entity processor.
> >
> > In the following example:
> > <entity name="x" query="select * from x">
> >    <entity name="y" query="select * from y where xid=${x.id}"
> > processor="CachedSqlEntityProcessor">
> >    </entity>
> > <entity>
> >
> > I like the concept of having multiple entity blocks for clarity but why
> > wouldn't I have (for DB efficiency), the following as one entity's SQL
> > statement "select * from X,Y where x.id=y.xid" and have two fields
> pointing
> > at X and Y columns?  My main question though is how the
> > CachedSQLEntityProcessor helps in this case for I want to use the
> multiple
> > entity blocks for cleanliness. If I have 500,000 X records, how many SQL
> > queries in the second entity block (y) would get executed, 500000?
> >
> > If there is any more detailed information about the number of queries
> > executed in different circumstances, memory overhead or way that the data
> is
> > brought from the database into Java  it would be much appreciated for
> it's
> > important for my application.
> >
> > Thanks in advance!
> > Amit
> >
>
>
>
> --
> --Noble Paul
>

Re: CachedSqlEntityProcessor's purpose

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.

every row emitted by an outer entity results in a new Sql query in the
inner entity. (yes 500000 queries on inner entity)So,if you wish to
join multiple tables then nested entities is the way to go.

CachedSqlEntityProcessor is meant to help you reduce the number of
queries fired on sub-entities.

If you get the entire table in one query (by using select * from y)
and use a separate where attribute , The entire set of rows in y get
loaded into RAM.

If you use it w/o the where attribute, it still ends up loading the
entire table into the memory (it is an unbounded cache ).It can easily
give you an OOM.

dod not use CachedSqlEntityProcessor for tidying up. use it if you
wish to save time and you have a lot of RAM

On Tue, Nov 25, 2008 at 1:52 PM, Amit Nithian <an...@gmail.com> wrote:
> I am starting to look at Solr's Data Import Handler framework and am quite
> impressed with it so far. My question is in trying to reduce the number of
> SQL queries issued to the database and saw this entity processor.
>
> In the following example:
> <entity name="x" query="select * from x">
>    <entity name="y" query="select * from y where xid=${x.id}"
> processor="CachedSqlEntityProcessor">
>    </entity>
> <entity>
>
> I like the concept of having multiple entity blocks for clarity but why
> wouldn't I have (for DB efficiency), the following as one entity's SQL
> statement "select * from X,Y where x.id=y.xid" and have two fields pointing
> at X and Y columns?  My main question though is how the
> CachedSQLEntityProcessor helps in this case for I want to use the multiple
> entity blocks for cleanliness. If I have 500,000 X records, how many SQL
> queries in the second entity block (y) would get executed, 500000?
>
> If there is any more detailed information about the number of queries
> executed in different circumstances, memory overhead or way that the data is
> brought from the database into Java  it would be much appreciated for it's
> important for my application.
>
> Thanks in advance!
> Amit
>

-- 
--Noble Paul