You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tim Gilbert <TI...@morningstar.com> on 2011/04/14 15:12:01 UTC

Fast DIH with 1:M multValue entities

We are working on importing a large number of records into Solr using
DIH.  We have one schema with ~2000 fields declared which map off to
several database schemas so that typically each document will have ~500
fields in use.  We have about 2 million "rows" which we are importing,
and we are seeing < 20 minutes in test across 14 different "entity's"
which really map off to one virtual document.  Then we added our
multiValue stuff and, well, it didn't work out nearly as well. :-)

 

We have several fields which are 1:M and so in our data-config.xml we
might have something like this:

 

<document name="allfund">

<entity name="FundId" dataSource="getFundManager" query="{call
dbo.getFundManager_Id()}">

<field column="FundId" name="HS04C" />

<entity name="FundData" dataSource="getFundManager" 

query="{call dbo.getFundManager_Data(${FundId.FundId})}">

 

<field column="ManagerName" name="OF015" />

</entity>

</entity>

</document>

 

That is a lot of database queries for a small result set which is really
slowing things down for us.

 

My question is more to ask advice, so it's a multi-parter :-)

 

1)                   Is there a way to declare in DIH an in-memory
lookup where we can query for the entire Many side of the query in one
database query, and match up on the PK?  Then we can declare that field
multiValued.

2)                   Assuming that isn't currently available, I thought
"denormalizing" the 1:M into a delimited list and then using
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDel
imiterFilterFactory to tokenize.  That would allow us to search on
individual bits, and build something into the front-end to handle the
display.  That means we wouldn't use multiValued and we'd have to modify
our db but we'd lose out on some of the abilities.

3)                   The third option was to open up DIH and try to add
the first feature into it ourselves.

 

Am I approaching this the right way?  Are there other ways I haven't
considered or don't know about?

 

Thanks in advance,

 

Tim


RE: Fast DIH with 1:M multValue entities

Posted by Ephraim Ofir <Ep...@icq.com>.
Search the list for my post "DIH - deleting documents, high performance
(delta) imports, and passing parameters" which shows a different
approach to 1:M sub entities

Ephraim Ofir

-----Original Message-----
From: Tim Gilbert [mailto:TIM.GILBERT@morningstar.com] 
Sent: Thursday, April 14, 2011 6:02 PM
To: solr-user@lucene.apache.org
Subject: RE: Fast DIH with 1:M multValue entities

How did I miss that?  Thanks, I will try that as it seems to be "in
memory" lookup solution I needed.

Thanks Erick,

Tim

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Thursday, April 14, 2011 10:58 AM
To: solr-user@lucene.apache.org
Subject: Re: Fast DIH with 1:M multValue entities

I'm not sure this applies, but have you looked at
http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor

<http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor>
Best
Erick

On Thu, Apr 14, 2011 at 9:12 AM, Tim Gilbert
<TI...@morningstar.com>wrote:

> We are working on importing a large number of records into Solr using
> DIH.  We have one schema with ~2000 fields declared which map off to
> several database schemas so that typically each document will have
~500
> fields in use.  We have about 2 million "rows" which we are importing,
> and we are seeing < 20 minutes in test across 14 different "entity's"
> which really map off to one virtual document.  Then we added our
> multiValue stuff and, well, it didn't work out nearly as well. :-)
>
>
>
> We have several fields which are 1:M and so in our data-config.xml we
> might have something like this:
>
>
>
> <document name="allfund">
>
> <entity name="FundId" dataSource="getFundManager" query="{call
> dbo.getFundManager_Id()}">
>
> <field column="FundId" name="HS04C" />
>
> <entity name="FundData" dataSource="getFundManager"
>
> query="{call dbo.getFundManager_Data(${FundId.FundId})}">
>
>
>
> <field column="ManagerName" name="OF015" />
>
> </entity>
>
> </entity>
>
> </document>
>
>
>
> That is a lot of database queries for a small result set which is
really
> slowing things down for us.
>
>
>
> My question is more to ask advice, so it's a multi-parter :-)
>
>
>
> 1)                   Is there a way to declare in DIH an in-memory
> lookup where we can query for the entire Many side of the query in one
> database query, and match up on the PK?  Then we can declare that
field
> multiValued.
>
> 2)                   Assuming that isn't currently available, I
thought
> "denormalizing" the 1:M into a delimited list and then using
>
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDel
> imiterFilterFactory to tokenize.  That would allow us to search on
> individual bits, and build something into the front-end to handle the
> display.  That means we wouldn't use multiValued and we'd have to
modify
> our db but we'd lose out on some of the abilities.
>
> 3)                   The third option was to open up DIH and try to
add
> the first feature into it ourselves.
>
>
>
> Am I approaching this the right way?  Are there other ways I haven't
> considered or don't know about?
>
>
>
> Thanks in advance,
>
>
>
> Tim
>
>

RE: Fast DIH with 1:M multValue entities

Posted by Tim Gilbert <TI...@morningstar.com>.
How did I miss that?  Thanks, I will try that as it seems to be "in
memory" lookup solution I needed.

Thanks Erick,

Tim

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Thursday, April 14, 2011 10:58 AM
To: solr-user@lucene.apache.org
Subject: Re: Fast DIH with 1:M multValue entities

I'm not sure this applies, but have you looked at
http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor

<http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor>
Best
Erick

On Thu, Apr 14, 2011 at 9:12 AM, Tim Gilbert
<TI...@morningstar.com>wrote:

> We are working on importing a large number of records into Solr using
> DIH.  We have one schema with ~2000 fields declared which map off to
> several database schemas so that typically each document will have
~500
> fields in use.  We have about 2 million "rows" which we are importing,
> and we are seeing < 20 minutes in test across 14 different "entity's"
> which really map off to one virtual document.  Then we added our
> multiValue stuff and, well, it didn't work out nearly as well. :-)
>
>
>
> We have several fields which are 1:M and so in our data-config.xml we
> might have something like this:
>
>
>
> <document name="allfund">
>
> <entity name="FundId" dataSource="getFundManager" query="{call
> dbo.getFundManager_Id()}">
>
> <field column="FundId" name="HS04C" />
>
> <entity name="FundData" dataSource="getFundManager"
>
> query="{call dbo.getFundManager_Data(${FundId.FundId})}">
>
>
>
> <field column="ManagerName" name="OF015" />
>
> </entity>
>
> </entity>
>
> </document>
>
>
>
> That is a lot of database queries for a small result set which is
really
> slowing things down for us.
>
>
>
> My question is more to ask advice, so it's a multi-parter :-)
>
>
>
> 1)                   Is there a way to declare in DIH an in-memory
> lookup where we can query for the entire Many side of the query in one
> database query, and match up on the PK?  Then we can declare that
field
> multiValued.
>
> 2)                   Assuming that isn't currently available, I
thought
> "denormalizing" the 1:M into a delimited list and then using
>
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDel
> imiterFilterFactory to tokenize.  That would allow us to search on
> individual bits, and build something into the front-end to handle the
> display.  That means we wouldn't use multiValued and we'd have to
modify
> our db but we'd lose out on some of the abilities.
>
> 3)                   The third option was to open up DIH and try to
add
> the first feature into it ourselves.
>
>
>
> Am I approaching this the right way?  Are there other ways I haven't
> considered or don't know about?
>
>
>
> Thanks in advance,
>
>
>
> Tim
>
>

Re: Fast DIH with 1:M multValue entities

Posted by Erick Erickson <er...@gmail.com>.
I'm not sure this applies, but have you looked at
http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor

<http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor>Best
Erick

On Thu, Apr 14, 2011 at 9:12 AM, Tim Gilbert <TI...@morningstar.com>wrote:

> We are working on importing a large number of records into Solr using
> DIH.  We have one schema with ~2000 fields declared which map off to
> several database schemas so that typically each document will have ~500
> fields in use.  We have about 2 million "rows" which we are importing,
> and we are seeing < 20 minutes in test across 14 different "entity's"
> which really map off to one virtual document.  Then we added our
> multiValue stuff and, well, it didn't work out nearly as well. :-)
>
>
>
> We have several fields which are 1:M and so in our data-config.xml we
> might have something like this:
>
>
>
> <document name="allfund">
>
> <entity name="FundId" dataSource="getFundManager" query="{call
> dbo.getFundManager_Id()}">
>
> <field column="FundId" name="HS04C" />
>
> <entity name="FundData" dataSource="getFundManager"
>
> query="{call dbo.getFundManager_Data(${FundId.FundId})}">
>
>
>
> <field column="ManagerName" name="OF015" />
>
> </entity>
>
> </entity>
>
> </document>
>
>
>
> That is a lot of database queries for a small result set which is really
> slowing things down for us.
>
>
>
> My question is more to ask advice, so it's a multi-parter :-)
>
>
>
> 1)                   Is there a way to declare in DIH an in-memory
> lookup where we can query for the entire Many side of the query in one
> database query, and match up on the PK?  Then we can declare that field
> multiValued.
>
> 2)                   Assuming that isn't currently available, I thought
> "denormalizing" the 1:M into a delimited list and then using
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDel
> imiterFilterFactory to tokenize.  That would allow us to search on
> individual bits, and build something into the front-end to handle the
> display.  That means we wouldn't use multiValued and we'd have to modify
> our db but we'd lose out on some of the abilities.
>
> 3)                   The third option was to open up DIH and try to add
> the first feature into it ourselves.
>
>
>
> Am I approaching this the right way?  Are there other ways I haven't
> considered or don't know about?
>
>
>
> Thanks in advance,
>
>
>
> Tim
>
>