You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "O. Olson" <ol...@yahoo.it> on 2013/05/16 18:01:22 UTC

Speed up import of Hierarchical Data

I am using the DataImportHandler to Query a SQL Server and populate Solr.
Unfortunately, SQL does not have an understanding of hierarchical
relationships, and hence I use Table Joins. The following is an outline of
my table structure: 


PROD_TABLE
-> SKU (Primary Key)
-> Title  (varchar)
-> Descr (varchar)

CAT_TABLE
-> SKU (Foreign Key)
->  CategoryLevel (int i.e. 1, 2, 3 …)
-> CategoryName  (varchar)

I specify the SQL Query in the db-data-config.xml file – a snippet of which
looks like: 

<dataConfig>
    <dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
url="jdbc:sqlserver://localhost\...."/>
    <document>
        <entity name="Product" 
				query="SELECT SKU, Title, Descr FROM PROD_TABLE">
            <field column="SKU" name="SKU" />
			<field column="Title" name="Title" />
            <field column="Descr" name="Descr" />

			<entity name="Cat1"  
                    query="SELECT CategoryName from CAT_TABLE where
SKU='${Product.SKU}' AND CategoryLevel=1">
				<field column="CategoryName" name="Category1" /> 
			</entity>
			<entity name="Cat2"  
                    query="SELECT CategoryName from CAT_TABLE where
SKU='${Product.SKU}' AND CategoryLevel=2">
				<field column="CategoryName" name="Category2" /> 
			</entity>
			<entity name="Cat3"  
                    query="SELECT CategoryName from CAT_TABLE where
SKU='${Product.SKU}' AND CategoryLevel=3">
				<field column="CategoryName" name="Category3" /> 
			</entity>
			
        </entity>
    </document>
</dataConfig>

It seems like the DataImportHandler handler sends out three or four queries
for each Product. This results in a very slow import. Is there any way to
speed this up? I would not mind an intermediate step of first extracting SQL
and then putting it into Solr.

Thank you for all your help. 
O. O.




--
View this message in context: http://lucene.472066.n3.nabble.com/Speed-up-import-of-Hierarchical-Data-tp4063924.html
Sent from the Solr - User mailing list archive at Nabble.com.

Hierarchical Data

Posted by "varsha.yadav" <va...@orkash.com>.
Hi,

I have been following http://wiki.apache.org/solr/HierarchicalFaceting
I have hierarchical data for facet . Some documents also have multiple 
hierarchy. like :
Doc#1 London > UK > 51.5
Doc#2 UK >54.0
Doc#3 Indiana > United States > 40.0, London >UK>51.5
Doc#4 United States > 39.7, Washington > United States > 38.8

what can be optimal schema for indexing this data so that i get 
following result by solr query :
1) i want to retrieve hierarchical data count by facet pivot query . ex: 
facet.pivot=country,state
2) I want Lat values wrt every document in query output.ex: Doc#3 
40.0,51.5 . Doc#2 54.0
3) I get direct search query like country:"United states" . state 
:"Washington"

  I think through this i am able to express my requirement along with 
data .
Please tell me how can i put data index and retreive through query .
I check out solution which you provided me about 
PathHierarchyTokenizerFactory. But along with hierarachy i have to put 
data with name State,district,lat,lon etc. So that i can also access 
direct query on fields.

Thanks
Varsha

Re: Speed up import of Hierarchical Data

Posted by "O. Olson" <ol...@yahoo.it>.
Thank you Stefan. I am new to Solr and I would need to read up more on
CachedSqlEntityProcessor. Do you have any clue where to begin? There do not
seem to be any tutorials online.

The link you provided seems to have a very short and unclear explanation.
After “Example 1” you have “The usage is exactly same as the other one.”
What does “other one” refer to? I did not understand the description
completely.

This description seems to say that if the query is the same as a prior query
it would fetched from the cache. From my case each of the Category queries
are unique because they have a unique SKU and Category Level. Would
CachedSqlEntityProcessor then help me?

Thank you,
O. O.



Stefan Matheis-2 wrote
> That sounds like a perfect match for
> http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor :)





--
View this message in context: http://lucene.472066.n3.nabble.com/Speed-up-import-of-Hierarchical-Data-tp4063924p4064034.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Speed up import of Hierarchical Data

Posted by Stefan Matheis <ma...@gmail.com>.
That sounds like a perfect match for http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor :)

On Thursday, May 16, 2013 at 6:01 PM, O. Olson wrote:

> I am using the DataImportHandler to Query a SQL Server and populate Solr.
> Unfortunately, SQL does not have an understanding of hierarchical
> relationships, and hence I use Table Joins. The following is an outline of
> my table structure:  
>  
>  
> PROD_TABLE
> -> SKU (Primary Key)
> -> Title (varchar)
> -> Descr (varchar)
>  
> CAT_TABLE
> -> SKU (Foreign Key)
> -> CategoryLevel (int i.e. 1, 2, 3 …)
> -> CategoryName (varchar)
>  
> I specify the SQL Query in the db-data-config.xml file – a snippet of which
> looks like:  
>  
> <dataConfig>
> <dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
> url="jdbc:sqlserver://localhost\...."/>
> <document>
> <entity name="Product"  
> query="SELECT SKU, Title, Descr FROM PROD_TABLE">
> <field column="SKU" name="SKU" />
> <field column="Title" name="Title" />
> <field column="Descr" name="Descr" />
>  
> <entity name="Cat1"  
> query="SELECT CategoryName from CAT_TABLE where
> SKU='${Product.SKU}' AND CategoryLevel=1">
> <field column="CategoryName" name="Category1" />  
> </entity>
> <entity name="Cat2"  
> query="SELECT CategoryName from CAT_TABLE where
> SKU='${Product.SKU}' AND CategoryLevel=2">
> <field column="CategoryName" name="Category2" />  
> </entity>
> <entity name="Cat3"  
> query="SELECT CategoryName from CAT_TABLE where
> SKU='${Product.SKU}' AND CategoryLevel=3">
> <field column="CategoryName" name="Category3" />  
> </entity>
>  
> </entity>
> </document>
> </dataConfig>
>  
> It seems like the DataImportHandler handler sends out three or four queries
> for each Product. This results in a very slow import. Is there any way to
> speed this up? I would not mind an intermediate step of first extracting SQL
> and then putting it into Solr.
>  
> Thank you for all your help.  
> O. O.
>  
>  
>  
>  
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Speed-up-import-of-Hierarchical-Data-tp4063924.html
> Sent from the Solr - User mailing list archive at Nabble.com (http://Nabble.com).
>  
>  



RE: Speed up import of Hierarchical Data

Posted by "O. Olson" <ol...@yahoo.it>.
Just an update for others reading this thread: I had some
CachedSqlEntityProcessor and had it addressed in the thread How do I use
CachedSqlEntityProcessor?
(http://lucene.472066.n3.nabble.com/How-do-I-use-CachedSqlEntityProcessor-td4064919.html)

I basically had to declare the child entities in the db-data-config.xml
like: 

<entity name="Cat1"  
                    query="SELECT CategoryName, SKU from CAT_TABLE WHERE
CategoryLevel=1" cacheKey="SKU" cacheLookup="Product.SKU"
processor="CachedSqlEntityProcessor">
                                <field column="CategoryName"
name="Category1" /> 
                        </entity>

Thanks to James and others for their help.
O. O.




--
View this message in context: http://lucene.472066.n3.nabble.com/Speed-up-import-of-Hierarchical-Data-tp4063924p4065400.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Speed up import of Hierarchical Data

Posted by "O. Olson" <ol...@yahoo.it>.
Thank you James. I think I got this to work using CachedSqlEntityProcessor –
and it seems extremely fast. I will try SortedMapBackedCache on Monday :-). 
Thank you,
O. O.



Dyer, James-2 wrote
> Using SqlEntityProcessor with cacheImpl="SortedMapBackedCache" is the same
> as specifying "CachedSqlEntityProcessor".  Because the pluggable caches
> are only partially committed, I never added details to the wiki, so it
> still refers to CachedSEP.  But its the same thing.
> 
> What is new here, though, is that you don't have to use
> "SortedMapBackedCache" (this is an in-memory cache and can only scale to
> what fits in heap.)  You can use an alternate cache (but none are included
> in the Solr distribution).  Also, you can cache data this doesn't come
> from SQL.  So its more flexible this way rather than the older CachedSEP.
> 
> Here's the wiki link with an example: 
> http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor 
> 
> James Dyer
> Ingram Content Group
> (615) 213-4311





--
View this message in context: http://lucene.472066.n3.nabble.com/Speed-up-import-of-Hierarchical-Data-tp4063924p4064297.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Speed up import of Hierarchical Data

Posted by "Dyer, James" <Ja...@ingramcontent.com>.
Using SqlEntityProcessor with cacheImpl="SortedMapBackedCache" is the same as specifying "CachedSqlEntityProcessor".  Because the pluggable caches are only partially committed, I never added details to the wiki, so it still refers to CachedSEP.  But its the same thing.

What is new here, though, is that you don't have to use "SortedMapBackedCache" (this is an in-memory cache and can only scale to what fits in heap.)  You can use an alternate cache (but none are included in the Solr distribution).  Also, you can cache data this doesn't come from SQL.  So its more flexible this way rather than the older CachedSEP.

Here's the wiki link with an example:  http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor 

James Dyer
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: O. Olson [mailto:olson_ord@yahoo.it] 
Sent: Thursday, May 16, 2013 5:06 PM
To: solr-user@lucene.apache.org
Subject: RE: Speed up import of Hierarchical Data

Thank you James. Are there any examples of SortedMapBackedCache? I am new to
Solr and I do not find many tutorials in this regard. I just modified the
examples and they worked for me.  What is a good way to learn these basics?
O. O.



Dyer, James-2 wrote
> See https://issues.apache.org/jira/browse/SOLR-2943 .  You can set up 2
> DIH handlers.  The first would query the "CAT_TABLE" and save it to a
> disk-backed cache, using DIHCacheWriter.  You then would replace your 3
> child entities in the 2nd DIH handler to use DIHCacheProcessor to read
> back the cached data.  This is a little complicated to do, but it would
> let you just cache the data once and because it is disk-backed, will scale
> to whatever size the CAT_TABLE is.  (For some details, see this thread:
> http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tt4015514.html)
> 
> A simpler method is simply to specify "cacheImpl=SortedMapBackedCache" on
> the 3 child entities.  (This is the same as using
> CachedSqlEntityProcessor.)  It would generate 3 in-memory caches, each
> with the same data.  If CAT_TABLE is small, this would be adequate.  
> 
> In between this would be to create a disk-backed cache Impl (or use the
> ones at SOLR-2613 or SOLR-2948) and specify it on "cacheImpl".  It would
> still create 3 identical caches, but they would be disk-backed and could
> scale beyond what in-memory can handle.
> 
> James Dyer
> Ingram Content Group
> (615) 213-4311





--
View this message in context: http://lucene.472066.n3.nabble.com/Speed-up-import-of-Hierarchical-Data-tp4063924p4064040.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: Speed up import of Hierarchical Data

Posted by "O. Olson" <ol...@yahoo.it>.
Thank you James. Are there any examples of SortedMapBackedCache? I am new to
Solr and I do not find many tutorials in this regard. I just modified the
examples and they worked for me.  What is a good way to learn these basics?
O. O.



Dyer, James-2 wrote
> See https://issues.apache.org/jira/browse/SOLR-2943 .  You can set up 2
> DIH handlers.  The first would query the "CAT_TABLE" and save it to a
> disk-backed cache, using DIHCacheWriter.  You then would replace your 3
> child entities in the 2nd DIH handler to use DIHCacheProcessor to read
> back the cached data.  This is a little complicated to do, but it would
> let you just cache the data once and because it is disk-backed, will scale
> to whatever size the CAT_TABLE is.  (For some details, see this thread:
> http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tt4015514.html)
> 
> A simpler method is simply to specify "cacheImpl=SortedMapBackedCache" on
> the 3 child entities.  (This is the same as using
> CachedSqlEntityProcessor.)  It would generate 3 in-memory caches, each
> with the same data.  If CAT_TABLE is small, this would be adequate.  
> 
> In between this would be to create a disk-backed cache Impl (or use the
> ones at SOLR-2613 or SOLR-2948) and specify it on "cacheImpl".  It would
> still create 3 identical caches, but they would be disk-backed and could
> scale beyond what in-memory can handle.
> 
> James Dyer
> Ingram Content Group
> (615) 213-4311





--
View this message in context: http://lucene.472066.n3.nabble.com/Speed-up-import-of-Hierarchical-Data-tp4063924p4064040.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Speed up import of Hierarchical Data

Posted by "Dyer, James" <Ja...@ingramcontent.com>.
See https://issues.apache.org/jira/browse/SOLR-2943 .  You can set up 2 DIH handlers.  The first would query the "CAT_TABLE" and save it to a disk-backed cache, using DIHCacheWriter.  You then would replace your 3 child entities in the 2nd DIH handler to use DIHCacheProcessor to read back the cached data.  This is a little complicated to do, but it would let you just cache the data once and because it is disk-backed, will scale to whatever size the CAT_TABLE is.  (For some details, see this thread: http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tt4015514.html)

A simpler method is simply to specify "cacheImpl=SortedMapBackedCache" on the 3 child entities.  (This is the same as using CachedSqlEntityProcessor.)  It would generate 3 in-memory caches, each with the same data.  If CAT_TABLE is small, this would be adequate.  

In between this would be to create a disk-backed cache Impl (or use the ones at SOLR-2613 or SOLR-2948) and specify it on "cacheImpl".  It would still create 3 identical caches, but they would be disk-backed and could scale beyond what in-memory can handle.

James Dyer
Ingram Content Group
(615) 213-4311

-----Original Message-----
From: O. Olson [mailto:olson_ord@yahoo.it] 
Sent: Thursday, May 16, 2013 11:01 AM
To: solr-user@lucene.apache.org
Subject: Speed up import of Hierarchical Data

I am using the DataImportHandler to Query a SQL Server and populate Solr.
Unfortunately, SQL does not have an understanding of hierarchical
relationships, and hence I use Table Joins. The following is an outline of
my table structure: 


PROD_TABLE
-> SKU (Primary Key)
-> Title  (varchar)
-> Descr (varchar)

CAT_TABLE
-> SKU (Foreign Key)
->  CategoryLevel (int i.e. 1, 2, 3 …)
-> CategoryName  (varchar)

I specify the SQL Query in the db-data-config.xml file – a snippet of which
looks like: 

<dataConfig>
    <dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
url="jdbc:sqlserver://localhost\...."/>
    <document>
        <entity name="Product" 
				query="SELECT SKU, Title, Descr FROM PROD_TABLE">
            <field column="SKU" name="SKU" />
			<field column="Title" name="Title" />
            <field column="Descr" name="Descr" />

			<entity name="Cat1"  
                    query="SELECT CategoryName from CAT_TABLE where
SKU='${Product.SKU}' AND CategoryLevel=1">
				<field column="CategoryName" name="Category1" /> 
			</entity>
			<entity name="Cat2"  
                    query="SELECT CategoryName from CAT_TABLE where
SKU='${Product.SKU}' AND CategoryLevel=2">
				<field column="CategoryName" name="Category2" /> 
			</entity>
			<entity name="Cat3"  
                    query="SELECT CategoryName from CAT_TABLE where
SKU='${Product.SKU}' AND CategoryLevel=3">
				<field column="CategoryName" name="Category3" /> 
			</entity>
			
        </entity>
    </document>
</dataConfig>

It seems like the DataImportHandler handler sends out three or four queries
for each Product. This results in a very slow import. Is there any way to
speed this up? I would not mind an intermediate step of first extracting SQL
and then putting it into Solr.

Thank you for all your help. 
O. O.




--
View this message in context: http://lucene.472066.n3.nabble.com/Speed-up-import-of-Hierarchical-Data-tp4063924.html
Sent from the Solr - User mailing list archive at Nabble.com.