You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by mroosendaal <mr...@yahoo.com> on 2012/11/09 07:39:10 UTC

RE: DIH nested entities don't work

Hi James,

What i did:
* build a jar from the patch
* downloaded the BDB library
* added them to my classpath
* download a nightly 4.1 Sol build
* created a db config according to:
http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/solr/contrib/dataimporthandler/src/test/org/apache/solr/handler/dataimport/TestEphemeralCache.java

although i got things working, after 2 hours of indexing i stopped the
proces. For that amount of data it took endeca 1h15. After looking at some
of the tests in the patch i configured the data-config.xml as follows:
<document>
        	<entity name="END_FRG_PRODUCTS_VW" 
           		processor="SqlEntityProcessor"
           	
persistCacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache" 
           		persistCacheName="END_FRG_PRODUCTS_VW"
           		persistCachePartitionNumber="0"
           		persistCacheBaseDir="d:\cacheloc"
           		berkleyInternalCacheSize="1000000"			
			berkleyInternalShared="true"
           		query="select PDT_ID, SEARCH_TITLE from END_FRG_PRODUCTS_VW">
           		<entity name="END_FRG_FEATURES_VW"
				processor="SqlEntityProcessor"
				persistCacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache"
				persistCacheName="FEATURE"
				cacheKey="PDT_ID"
				cacheLookup="END_FRG_PRODUCTS_VW.PDT_ID"
				berkleyInternalCacheSize="1000000"			
				berkleyInternalShared="true"
				persistCacheBaseDir="d:\cacheloc"
             			query="select * from END_FRG_FEATURES_VW"/>
               	</entity>
	</document>

Although different in behaviour:
[snapshot from the indexing after 8 minutes: Requests: 2899, Fetched:
28974398, Skipped: 0, Processed: 2258] it was still slow and the parameter
'persistCacheBaseDir' has no effect. The difference in behaviour from the
previous is that it had only 2 requests and hadn't processed anything after
2 hours.

Hope you can help me.

Thanks,
Maarten




--
View this message in context: http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tp4015514p4019223.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: DIH nested entities don't work

Posted by mroosendaal <mr...@yahoo.com>.
Thanks, i'll give that a try tomorrow. Here's the data-config.xml which i
will try when i get to work tomorrow:

<dataConfig>
	<dataSource name=&quot;jdbc1&quot;
driver=&quot;oracle.jdbc.driver.OracleDriver&quot;
url=&quot;jdbc:oracle:thin:@//&lt;host>:1521/ENDDEV" user="un"
password="pw"/>
	<document>
        	<entity name="END_FRG_PRODUCTS_VW" 
           		processor="SqlEntityProcessor"
            		dataSource="jdbc1"
               		query="select PDT_ID, SEARCH_TITLE from
END_FRG_PRODUCTS_VW">
           		<entity name="END_FRG_FEATURES_VW"
				processor="SqlEntityProcessor"
				cacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache"
				persistCacheName="FEATURE" 
				persistCacheBaseDir="d:\cacheloc"
				berkleyInternalCacheSize="1000000"
				persistCacheFieldNames="PDT_ID,PDT_FEATURES"
				persistCacheFieldTypes="STRING,STRING"
                                berkleyInternalShared="true"
				cacheKey="PDT_ID"
				cacheLookup="END_FRG_PRODUCTS_VW.PDT_ID"
				dataSource="jdbc1"
             			query="select PDT_ID, PDT_FEATURES from
END_FRG_FEATURES_VW"/>
             		<entity name="END_FRG_TAGS_VW"
				processor="SqlEntityProcessor"
				cacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache"
				persistCacheName="TAGS" 
				persistCacheBaseDir="d:\cacheloc"
				berkleyInternalCacheSize="1000000"
				persistCacheFieldNames="PDT_ID,TAGS"
				persistCacheFieldTypes="STRING,STRING"
                                berkleyInternalShared="true"
				cacheKey="PDT_ID"
				cacheLookup="END_FRG_PRODUCTS_VW.PDT_ID"
				dataSource="jdbc1"
             			query="select PDT_ID, TAGS from END_FRG_TAGS_VW"/>
             		<entity name="END_FRG_CATEGORIES_VW"
				processor="SqlEntityProcessor"
				cacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache"
				persistCacheName="CATEGORIES" 
				persistCacheBaseDir="d:\cacheloc"
				berkleyInternalCacheSize="1000000"
				persistCacheFieldNames="PDT_ID,CGY_CODE,WHEN_NO_CGY"
				persistCacheFieldTypes="STRING,STRING,STRING"
                                berkleyInternalShared="true"
				cacheKey="PDT_ID"
				cacheLookup="END_FRG_PRODUCTS_VW.PDT_ID"
				dataSource="jdbc1"
             			query="select PDT_ID, CGY_CODE, WHEN_NO_CGY from
END_FRG_CATEGORIES_VW"/>
               	</entity>
	</document>
</dataConfig>

the views are specific for the indexing process so each view (11) contains
all the information we want, thus the ' SELECT *'. I see in the logging that
Solr infers the correct type.

Thanks,
Maarten



--
View this message in context: http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tp4015514p4019822.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: DIH nested entities don't work

Posted by "Dyer, James" <Ja...@ingramcontent.com>.
Maarten,

Glad to hear that your DIH experiment worked well for you.  

To implement something like Endeca's guided navigation, see http://wiki.apache.org/solr/SolrFacetingOverview .  If you need to implement multi-level faceting, see http://wiki.apache.org/solr/HierarchicalFaceting  (but caution:  some of the techniques here are for non-committed feature patches).

If you're trying to do anything but the most simple cases, I would recommend getting yourself a good book that walks you through it, such as Smiley&Pugh's Solr Book.  There is also a lot to read on the topic from mail lists archives, blog posts, etc.  

The hardest thing for us in going from Endeca was that Solr doesn't have the "N-Value" concept.  So if you want to drill down on, say "Department", you might do something like this:  facet=true&facet.field=DEPARTMENT , whereas Endeca would generate some esoteric N-value in an obscure Xml file so you would end up with a query like N=4567897865 .  Unfortunately for us, we had those N-values hardcoded all over our application and we ended up having to create a cross-reference table so that we didn't have to rewrite a ton of code at once.  

Overall, Solr's faceting is a lot more flexible that what Endeca has to offer.  And its a lot simpler to set up and understand.  However, Endeca's strong point here is that an admin could configure a lot of behaviors on the back end and then developers could just write to the API and it would do everything for them.  (Of course, this also encourages you to write your app so that its harder to convert to anything else.)  We were able to convert from Endeca with Solr 1.4, including 2- & 3- level Dimensions, using facet.prefix.  The new features in Solr3 & especially Solr4 should make it easier and more efficient though.

If you have more questions about faceting, I would start a new discussion thread about it.  There are a lot of approaches to solving various problems so you may get a variety of answers.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: mroosendaal [mailto:mroosendaal@yahoo.com] 
Sent: Wednesday, December 05, 2012 8:23 AM
To: solr-user@lucene.apache.org
Subject: RE: DIH nested entities don't work

Hi James,

Just to let you know, i've just completed the PoC and it worked great!
Thanks.

What i still find difficult to is how to implement a 'guided' navigation
with Solr. That is one of the strenghts of Endeca and with Solr you have to
create this yourself. What are your thoughts on that and what challenges did
you encounter when moving to Endeca?

Thanks,
Maarten



--
View this message in context: http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tp4015514p4024467.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: DIH nested entities don't work

Posted by mroosendaal <mr...@yahoo.com>.
Hi James,

Just to let you know, i've just completed the PoC and it worked great!
Thanks.

What i still find difficult to is how to implement a 'guided' navigation
with Solr. That is one of the strenghts of Endeca and with Solr you have to
create this yourself. What are your thoughts on that and what challenges did
you encounter when moving to Endeca?

Thanks,
Maarten



--
View this message in context: http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tp4015514p4024467.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: DIH nested entities don't work

Posted by mroosendaal <mr...@yahoo.com>.
Hi,

I got things working for 7 views and it looks good! Next things is getting
all the views in the index and getting my schema.xml in order.

I'll let you know how it goes!

Thanks!!!,
Maarten



--
View this message in context: http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tp4015514p4022721.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: DIH nested entities don't work

Posted by "Dyer, James" <Ja...@ingramcontent.com>.
The line numbers aren't matching up, but best I can tell, it looks like its probably getting a NPE on this line:

DIHCacheTypes type = types[pkColumnIndex];

If this is the correct line, it seems the "types" array is NULL.  The "types" array is populated from a properties file that gets written when the cache is first created.  You should be able to find this file for the PRODUCTS cache at d:\cacheloc\PRODUCTS_cache.properties ... This file should have 2 lines "CACHE_NAMES=" and "CACHE_TYPES=".  The values should be comma-separated lists of column names and data types from your database.  My guess is this is not getting populated properly, or somehow this properties file is entirely not accessible.

If this is indeed the problem, the first thing I would try is to both change your "SELECT *" statements to statements with actual lists of the columns, and to also specify "persistCacheFieldNames" and "persistCacheFieldTypes" on the cache-building DIH handlers.

Also, I forget which version of Solr you say you're using, but if you're using 3.6, 4.0-ALPHA or 4.0-BETA, the "cacheKey" parameter was incorrectly renamed "cachePk".  For 4.0.0, this was renamed back to "cacheKey".

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: mroosendaal [mailto:mroosendaal@yahoo.com] 
Sent: Tuesday, November 27, 2012 3:17 AM
To: solr-user@lucene.apache.org
Subject: RE: DIH nested entities don't work

Hi James,

I was out for a week hence the late response. I did as you suggested and i
can import several views in parallel. However running the data-import which
does the join gives the following errors:

27-nov-2012 10:11:24 org.apache.solr.common.SolrException log
SEVERE: Exception while processing: END_FRG_PRODUCTS_VW document :
SolrInputDocument[PDT_GLOBAL_ID=1000004000000000,
PDT_EAN_CODE=0077779703623, PDT_I
D=1000004000000000, PDT_AVAILABILITY=Leverbaar, AVAIL_CODE_ON_STOCK=200,
search_title=19621966 Red Album, AVAIL_CODE_OFF_STOCK=200,
PDT_TYP_CODE=POP]:
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.RuntimeException: java.lang.NullPointerException
        at
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:63)
        at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:246)
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:472)
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:498)
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:411)
        at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:326)
        at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:234)
        at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:382)
        at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:448)
        at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:429)
Caused by: java.lang.RuntimeException: java.lang.NullPointerException
        at
org.apache.solr.handler.dataimport.BerkleyBackedCache.iterator(BerkleyBackedCache.java:674)
        at
org.apache.solr.handler.dataimport.DIHCacheProcessor.nextRow(DIHCacheProcessor.java:97)
        at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
        ... 8 more
Caused by: java.lang.NullPointerException
        at
org.apache.solr.handler.dataimport.BerkleyBackedCache$PrimaryKeyTupleBinding.objectToEntry(BerkleyBackedCache.java:1071)
        at
com.sleepycat.bind.tuple.TupleBinding.objectToEntry(TupleBinding.java:73)
        at
org.apache.solr.handler.dataimport.BerkleyBackedCache.iterator(BerkleyBackedCache.java:663)
        ... 10 more

27-nov-2012 10:11:24 org.apache.solr.handler.dataimport.BerkleyBackedCache
close
INFO: Total read/write time for cache: PRODUCTS was 3 ms
27-nov-2012 10:11:24 org.apache.solr.handler.dataimport.BerkleyBackedCache
close
WARNING: couldn't close db for cache: PRODUCTS
Problem: 1 locks left
---- LSN: 0x0/0x170585c----
 ThinLockAddr:391683389 Owner:985340868 -1_Thread-18_ThreadLocker Waiters:
(none)

Local Cache Usage = 108
Cache Layout: Allocation of resources in the cache.
        adminBytes=0
        cacheTotalBytes=31.358
        dataBytes=0
        lockBytes=108
        sharedCacheTotalBytes=78.245

27-nov-2012 10:11:24 org.apache.solr.handler.dataimport.BerkleyBackedCache
close
INFO: Total read/write time for cache: FEATURE was 3 ms
27-nov-2012 10:11:24 org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: [collection1] webapp=/solr path=/dataimport-join-all
params={command=full-import&optimize=false&clean=false&commit=false&verbose=false}
status=0
 QTime=240 {} 0 240
27-nov-2012 10:11:24 org.apache.solr.common.SolrException log
SEVERE: Full Import failed:java.lang.RuntimeException:
java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException: java
.lang.RuntimeException: java.lang.NullPointerException
        at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:273)
        at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:382)
        at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:448)
        at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:429)
Caused by: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.RuntimeException: java.lang.NullPointe
rException
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:413)
        at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:326)
        at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:234)
        ... 3 more
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.RuntimeException: java.lang.NullPointerException
        at
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:63)
        at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:246)
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:472)
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:498)
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:411)
        ... 5 more
Caused by: java.lang.RuntimeException: java.lang.NullPointerException
        at
org.apache.solr.handler.dataimport.BerkleyBackedCache.iterator(BerkleyBackedCache.java:674)
        at
org.apache.solr.handler.dataimport.DIHCacheProcessor.nextRow(DIHCacheProcessor.java:97)
        at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
        ... 8 more
Caused by: java.lang.NullPointerException
        at
org.apache.solr.handler.dataimport.BerkleyBackedCache$PrimaryKeyTupleBinding.objectToEntry(BerkleyBackedCache.java:1071)
        at
com.sleepycat.bind.tuple.TupleBinding.objectToEntry(TupleBinding.java:73)
        at
org.apache.solr.handler.dataimport.BerkleyBackedCache.iterator(BerkleyBackedCache.java:663)
        ... 10 more

27-nov-2012 10:11:24 org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback{flags=0,_version_=0}
27-nov-2012 10:11:24 org.apache.solr.update.DefaultSolrCoreState
newIndexWriter
INFO: Creating new IndexWriter...
27-nov-2012 10:11:24 org.apache.solr.update.DefaultSolrCoreState
newIndexWriter
INFO: Waiting until IndexWriter is unused... core=collection1
27-nov-2012 10:11:24 org.apache.solr.update.DefaultSolrCoreState
newIndexWriter
INFO: Rollback old IndexWriter... core=collection1
27-nov-2012 10:11:24 org.apache.solr.core.CachingDirectoryFactory close
INFO: Closing
directory:D:\apache-solr-4.1-2012-10-30_09-13-06\example\solr\collection1\data\index
27-nov-2012 10:11:24 org.apache.solr.core.CachingDirectoryFactory get
INFO: return new directory for
D:\apache-solr-4.1-2012-10-30_09-13-06\example\solr\collection1\data\index
forceNew:true
27-nov-2012 10:11:24 org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=1
       
commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@D:\apache-solr-4.1-2012-10-30_09-13-06\example\solr\collection1\data\inde
x lockFactory=org.apache.lucene.store.NativeFSLockFactory@74f2db2d;
maxCacheMB=48.0
maxMergeSizeMB=4.0),segFN=segments_te,generation=1058,filenames=[_
1qy_Lucene41_0.doc, _1qy_nrm.cfe, segments_te, _1qy.fnm,
_1qy_Lucene41_0.tip, _1qy_Lucene41_0.pos, _1qy.fdx, _1qy_nrm.cfs,
_1qy_Lucene41_0.tim, _1qy.s
i, _1qy.fdt]

My data-import-join:
<dataConfig>
	<document name="JOIN_ALL">
        	<entity name="END_FRG_PRODUCTS_VW" 
           	
processor="org.apache.solr.handler.dataimport.DIHCacheProcessor"
           		cacheKey="PDT_ID"
            		rootEntity="true"
            		persistCacheBaseDir="d:\cacheloc"
            	
persistCacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache"
            		persistCacheName="PRODUCTS"
            		berkleyInternalCacheSize="1000000"
            		berkleyInternalShared="true">
            		<entity name="END_FRG_FEATURES_VW"
				processor="org.apache.solr.handler.dataimport.DIHCacheProcessor"
				persistCacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache"
				persistCacheName="FEATURE" 
				persistCacheBaseDir="d:\cacheloc"
				berkleyInternalCacheSize="1000000"
                                berkleyInternalShared="true"
				cacheKey="PDT_ID"
				cacheLookup="END_FRG_PRODUCTS_VW.PDT_ID"/>
			<entity name="END_FRG_TAGS_VW"
				processor="org.apache.solr.handler.dataimport.DIHCacheProcessor"
				persistCacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache"
				persistCacheName="TAGS" 
				persistCacheBaseDir="d:\cacheloc"
				berkleyInternalCacheSize="1000000"
                                berkleyInternalShared="true"
				cacheKey="PDT_ID"
				cacheLookup="END_FRG_PRODUCTS_VW.PDT_ID"/>
		</entity>
	</document>
</dataConfig>

Just to be clear products *can* have a feature and/or tag but is not
required.

Hope you have an idea?

Thanks,
Maarten



--
View this message in context: http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tp4015514p4022566.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: DIH nested entities don't work

Posted by mroosendaal <mr...@yahoo.com>.
Hi James,

I was out for a week hence the late response. I did as you suggested and i
can import several views in parallel. However running the data-import which
does the join gives the following errors:

27-nov-2012 10:11:24 org.apache.solr.common.SolrException log
SEVERE: Exception while processing: END_FRG_PRODUCTS_VW document :
SolrInputDocument[PDT_GLOBAL_ID=1000004000000000,
PDT_EAN_CODE=0077779703623, PDT_I
D=1000004000000000, PDT_AVAILABILITY=Leverbaar, AVAIL_CODE_ON_STOCK=200,
search_title=19621966 Red Album, AVAIL_CODE_OFF_STOCK=200,
PDT_TYP_CODE=POP]:
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.RuntimeException: java.lang.NullPointerException
        at
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:63)
        at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:246)
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:472)
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:498)
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:411)
        at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:326)
        at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:234)
        at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:382)
        at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:448)
        at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:429)
Caused by: java.lang.RuntimeException: java.lang.NullPointerException
        at
org.apache.solr.handler.dataimport.BerkleyBackedCache.iterator(BerkleyBackedCache.java:674)
        at
org.apache.solr.handler.dataimport.DIHCacheProcessor.nextRow(DIHCacheProcessor.java:97)
        at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
        ... 8 more
Caused by: java.lang.NullPointerException
        at
org.apache.solr.handler.dataimport.BerkleyBackedCache$PrimaryKeyTupleBinding.objectToEntry(BerkleyBackedCache.java:1071)
        at
com.sleepycat.bind.tuple.TupleBinding.objectToEntry(TupleBinding.java:73)
        at
org.apache.solr.handler.dataimport.BerkleyBackedCache.iterator(BerkleyBackedCache.java:663)
        ... 10 more

27-nov-2012 10:11:24 org.apache.solr.handler.dataimport.BerkleyBackedCache
close
INFO: Total read/write time for cache: PRODUCTS was 3 ms
27-nov-2012 10:11:24 org.apache.solr.handler.dataimport.BerkleyBackedCache
close
WARNING: couldn't close db for cache: PRODUCTS
Problem: 1 locks left
---- LSN: 0x0/0x170585c----
 ThinLockAddr:391683389 Owner:985340868 -1_Thread-18_ThreadLocker Waiters:
(none)

Local Cache Usage = 108
Cache Layout: Allocation of resources in the cache.
        adminBytes=0
        cacheTotalBytes=31.358
        dataBytes=0
        lockBytes=108
        sharedCacheTotalBytes=78.245

27-nov-2012 10:11:24 org.apache.solr.handler.dataimport.BerkleyBackedCache
close
INFO: Total read/write time for cache: FEATURE was 3 ms
27-nov-2012 10:11:24 org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: [collection1] webapp=/solr path=/dataimport-join-all
params={command=full-import&optimize=false&clean=false&commit=false&verbose=false}
status=0
 QTime=240 {} 0 240
27-nov-2012 10:11:24 org.apache.solr.common.SolrException log
SEVERE: Full Import failed:java.lang.RuntimeException:
java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException: java
.lang.RuntimeException: java.lang.NullPointerException
        at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:273)
        at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:382)
        at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:448)
        at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:429)
Caused by: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.RuntimeException: java.lang.NullPointe
rException
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:413)
        at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:326)
        at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:234)
        ... 3 more
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.RuntimeException: java.lang.NullPointerException
        at
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:63)
        at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:246)
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:472)
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:498)
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:411)
        ... 5 more
Caused by: java.lang.RuntimeException: java.lang.NullPointerException
        at
org.apache.solr.handler.dataimport.BerkleyBackedCache.iterator(BerkleyBackedCache.java:674)
        at
org.apache.solr.handler.dataimport.DIHCacheProcessor.nextRow(DIHCacheProcessor.java:97)
        at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
        ... 8 more
Caused by: java.lang.NullPointerException
        at
org.apache.solr.handler.dataimport.BerkleyBackedCache$PrimaryKeyTupleBinding.objectToEntry(BerkleyBackedCache.java:1071)
        at
com.sleepycat.bind.tuple.TupleBinding.objectToEntry(TupleBinding.java:73)
        at
org.apache.solr.handler.dataimport.BerkleyBackedCache.iterator(BerkleyBackedCache.java:663)
        ... 10 more

27-nov-2012 10:11:24 org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback{flags=0,_version_=0}
27-nov-2012 10:11:24 org.apache.solr.update.DefaultSolrCoreState
newIndexWriter
INFO: Creating new IndexWriter...
27-nov-2012 10:11:24 org.apache.solr.update.DefaultSolrCoreState
newIndexWriter
INFO: Waiting until IndexWriter is unused... core=collection1
27-nov-2012 10:11:24 org.apache.solr.update.DefaultSolrCoreState
newIndexWriter
INFO: Rollback old IndexWriter... core=collection1
27-nov-2012 10:11:24 org.apache.solr.core.CachingDirectoryFactory close
INFO: Closing
directory:D:\apache-solr-4.1-2012-10-30_09-13-06\example\solr\collection1\data\index
27-nov-2012 10:11:24 org.apache.solr.core.CachingDirectoryFactory get
INFO: return new directory for
D:\apache-solr-4.1-2012-10-30_09-13-06\example\solr\collection1\data\index
forceNew:true
27-nov-2012 10:11:24 org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=1
       
commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@D:\apache-solr-4.1-2012-10-30_09-13-06\example\solr\collection1\data\inde
x lockFactory=org.apache.lucene.store.NativeFSLockFactory@74f2db2d;
maxCacheMB=48.0
maxMergeSizeMB=4.0),segFN=segments_te,generation=1058,filenames=[_
1qy_Lucene41_0.doc, _1qy_nrm.cfe, segments_te, _1qy.fnm,
_1qy_Lucene41_0.tip, _1qy_Lucene41_0.pos, _1qy.fdx, _1qy_nrm.cfs,
_1qy_Lucene41_0.tim, _1qy.s
i, _1qy.fdt]

My data-import-join:
<dataConfig>
	<document name="JOIN_ALL">
        	<entity name="END_FRG_PRODUCTS_VW" 
           	
processor="org.apache.solr.handler.dataimport.DIHCacheProcessor"
           		cacheKey="PDT_ID"
            		rootEntity="true"
            		persistCacheBaseDir="d:\cacheloc"
            	
persistCacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache"
            		persistCacheName="PRODUCTS"
            		berkleyInternalCacheSize="1000000"
            		berkleyInternalShared="true">
            		<entity name="END_FRG_FEATURES_VW"
				processor="org.apache.solr.handler.dataimport.DIHCacheProcessor"
				persistCacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache"
				persistCacheName="FEATURE" 
				persistCacheBaseDir="d:\cacheloc"
				berkleyInternalCacheSize="1000000"
                                berkleyInternalShared="true"
				cacheKey="PDT_ID"
				cacheLookup="END_FRG_PRODUCTS_VW.PDT_ID"/>
			<entity name="END_FRG_TAGS_VW"
				processor="org.apache.solr.handler.dataimport.DIHCacheProcessor"
				persistCacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache"
				persistCacheName="TAGS" 
				persistCacheBaseDir="d:\cacheloc"
				berkleyInternalCacheSize="1000000"
                                berkleyInternalShared="true"
				cacheKey="PDT_ID"
				cacheLookup="END_FRG_PRODUCTS_VW.PDT_ID"/>
		</entity>
	</document>
</dataConfig>

Just to be clear products *can* have a feature and/or tag but is not
required.

Hope you have an idea?

Thanks,
Maarten



--
View this message in context: http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tp4015514p4022566.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: DIH nested entities don't work

Posted by mroosendaal <mr...@yahoo.com>.
Additional information: i just finished a test for 10.000 records (the db
containts 600K products), it took 25 minutes and all the parents records had
the same 'feature'.



--
View this message in context: http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tp4015514p4019227.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: DIH nested entities don't work

Posted by "Dyer, James" <Ja...@ingramcontent.com>.
Maarten,

Here is a sample set-up that lets you build your caches in parallel and then index off the caches in a subsequent step.  See below for the solrconfig.xml snippet and the text of the 4 data-config.xml files.  In this example it builds a cache for the parent also, but this is not necessary.  But I guess its cleaner looking to just cache everything and then the final step works against caches only.

Here's how it works.  First, begin a full import for each of the cache builders by issuing these commands all at once.  Each of these builds a cache:
/solrcore/dih-parent?command=full-import
/solrcore/dih-child1?command=full-import
/solrcore/dih-child2?command=full-import

You then need to poll each of these handler's status screen and wait until they all finish.  Once done, issue this command.  This reads back the caches and indexes the data to your solr core:
/solrcore/dih-master?command=full-import

The tricky thing here is automating it all.  You'll need something that issues the commands and then polls the responses, etc.  For my case, I ended up writing a very hacky program that runs 12 cache-building handlers at once, starting a new one when one finishes, until all 50 or so are complete.  It then runs the master dih handlers (an additional complexity for our situation, not shown here, is I'm using the DIH Cache partitioning feature to make multiple partitions, then I have multiple master handlers that each index a slice of the data at the same time, making the "master" step finish faster on a multi-processor machine)

Another thing that is very confusing with all this is that to build the caches, you send all the cache params as request parameters, included in solrconfig.xml here.  But for the master indexing, these are parameters on the entity in data-config.xml.  It would be better (perhaps should this feature ever get committed) maybe if this changed to allow all the configuration to occur in data-config.xml for both building caches and reading caches.

One last thing is you might want to open a JIRA issue about JDBCDataSource not honoring the JDBC Driver parameter that you're trying to pass through.  https://issues.apache.org/jira/browse/SOLR  If you don't have an account you need to create one to open a new issue.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

<!-- 4 handlers declared in solrconfig.xml -->
<requestHandler name="/dih-parent" class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
    <str name="config">dataconfig-parent.xml</str>
    <str name="clean">true</str>
    <str name="persistCacheBaseDir">/path/to/caches</str>
    <str name="persistCacheName">PARENT</str>
    <str name="persistCacheImpl">org.apache.solr.handler.dataimport.BerkleyBackedCache</str>
    <!-- ID is Oracle's "number" type which the JDBC driver brings in as a BigDecimal.  
         The field always contains an Integer so we can optimize for that case 
         See org.apache.solr.handler.dataimport.DIHCacheTypes
    -->
    <str name="persistCacheFieldNames">ID,                 SOME_DATA</str> 
    <str name="persistCacheFieldTypes">BIGDECIMAL_INTEGER, STRING</str>
    <str name="cacheKey">ID</str>
    <str name="writerImpl">org.apache.solr.handler.dataimport.DIHCacheWriter</str>
    <!-- all bdb-je caches being built at the same time share this 100mb cache -->
    <str name="berkleyInternalCacheSize">100000000</str> 
    <str name="berkleyInternalShared">true</str>
  </lst>
</requestHandler>
<requestHandler name="/dih-child1" class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
    <str name="config">dataconfig-child1.xml</str>
    <str name="clean">true</str>
    <str name="persistCacheBaseDir">/path/to/caches</str>
    <str name="persistCacheName">CHILD1</str>
    <str name="persistCacheImpl">org.apache.solr.handler.dataimport.BerkleyBackedCache</str>
    <str name="persistCacheFieldNames">PARENT_ID,          CHILD_ONE_DATA</str> 
    <str name="persistCacheFieldTypes">BIGDECIMAL_INTEGER, STRING</str>
    <str name="cacheKey">PARENT_ID</str>
    <str name="writerImpl">org.apache.solr.handler.dataimport.DIHCacheWriter</str>
    <str name="berkleyInternalCacheSize">100000000</str>
    <str name="berkleyInternalShared">true</str>
  </lst>
</requestHandler>
<requestHandler name="/dih-child2" class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
    <str name="config">dataconfig-child2.xml</str>
    <str name="clean">true</str>
    <str name="persistCacheBaseDir">/path/to/caches</str>
    <str name="persistCacheName">CHILD2</str>
    <str name="persistCacheImpl">org.apache.solr.handler.dataimport.BerkleyBackedCache</str>
    <str name="persistCacheFieldNames">PARENT_ID,          CHILD_TWO_DATA</str> 
    <str name="persistCacheFieldTypes">BIGDECIMAL_INTEGER, STRING</str>
    <str name="cacheKey">PARENT_ID</str>
    <str name="writerImpl">org.apache.solr.handler.dataimport.DIHCacheWriter</str>
    <str name="berkleyInternalCacheSize">100000000</str>
    <str name="berkleyInternalShared">true</str>
  </lst>
</requestHandler>
<requestHandler name="/dih-master" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
  <str name="config">dataconfig-master.xml</str>
  <str name="clean">true</str>
  <str name="commit">true</str>
  <str name="optimize">false</str>
</lst>
</requestHandler>


<!-- dataconfig-parent.xml -->
<dataConfig>
  <dataSource name="zzz" driver="xxx" url="ccc" />
  <document name="PARENT">
    <entity name="PARENT" dataSource="zzz" query="SELECT ID, SOME_DATA FROM PARENT" />
  </document>
</dataConfig>

<!-- dataconfig-child1.xml -->
<dataConfig>
  <dataSource name="zzz" driver="xxx" url="ccc" />
  <document name="CHILD1">
    <entity name="CHILD1" dataSource="zzz" query="SELECT PARENT_ID, CHILD_ONE_DATA FROM CHILD1" />
  </document>
</dataConfig>

<!-- dataconfig-child2.xml -->
<dataConfig>
  <dataSource name="zzz" driver="xxx" url="ccc" />
  <document name="CHILD2">
    <entity name="CHILD2" dataSource="zzz" query="SELECT PARENT_ID, CHILD_TWO_DATA FROM CHILD2" />
  </document>
</dataConfig>

<!-- dataconfig-master.xml -->
<dataConfig>
  <document name="MASTER">
    <entity name="PARENT"
      processor="org.apache.solr.handler.dataimport.DIHCacheProcessor"
      cacheKey="ID"      
      persistCacheBaseDir="/path/to/caches"
      persistCacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache"
      persistCacheName="PARENT"
      berkleyInternalCacheSize="100000000" <!-- all bdb-je caches share this 100mb cache -->
      berkleyInternalShared="true"
    >
      <entity
        name="CHILD1"
        processor="org.apache.solr.handler.dataimport.DIHCacheProcessor"
        cacheKey="PARENT_ID"
        cacheLookup="PARENT.ID"        
        persistCacheBaseDir="/path/to/caches"
        persistCacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache"
        persistCacheName="CHILD1"
        berkleyInternalCacheSize="100000000"
        berkleyInternalShared="true"        
      />
      <entity
        name="CHILD2"
        processor="org.apache.solr.handler.dataimport.DIHCacheProcessor"
        cacheKey="PARENT_ID"
        cacheLookup="PARENT.ID"        
        persistCacheBaseDir="/path/to/caches"
        persistCacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache"
        persistCacheName="CHILD2"        
        berkleyInternalCacheSize="100000000"
        berkleyInternalShared="true"
      />
    </entity>
  </document>
</dataConfig>




-----Original Message-----
From: mroosendaal [mailto:mroosendaal@yahoo.com] 
Sent: Friday, November 16, 2012 8:19 AM
To: solr-user@lucene.apache.org
Subject: RE: DIH nested entities don't work

Hi,

You are correct about not wanting to index everything every day, however for
this PoC i need a 'bootstrap' mechanism which basically does what Endeca
does.

The 'defaultRowPrefetch' in the solrconfig.xml does not seem to take, i'll
have a closer look.

With the long time, it appeard that one of the views i was reading was also
by far the biggest with over 4mln entries. Other views should take much less
time.

With regards to the parallel processing, i have the 2 classes you mention
and packaged them. The documentation in the patch was not clear on how to
exactly do that. My assumption is that
* for every entity you have to define a DIH in the solrconfig and refer to
aspecific data-config-<entity>.xml
* define 1 importhandler for the join in the solrconfig 
* what isn't clear is how a data-config-<entity>.xml should look like (for
example, i see no reference in the documention to a cacheName)
* and how the data-config-join.xml should should look like


RE: DIH nested entities don't work

Posted by mroosendaal <mr...@yahoo.com>.
Hi,

You are correct about not wanting to index everything every day, however for
this PoC i need a 'bootstrap' mechanism which basically does what Endeca
does.

The 'defaultRowPrefetch' in the solrconfig.xml does not seem to take, i'll
have a closer look.

With the long time, it appeard that one of the views i was reading was also
by far the biggest with over 4mln entries. Other views should take much less
time.

With regards to the parallel processing, i have the 2 classes you mention
and packaged them. The documentation in the patch was not clear on how to
exactly do that. My assumption is that
* for every entity you have to define a DIH in the solrconfig and refer to
aspecific data-config-<entity>.xml
* define 1 importhandler for the join in the solrconfig 
* what isn't clear is how a data-config-<entity>.xml should look like (for
example, i see no reference in the documention to a cacheName)
* and how the data-config-join.xml should should look like

My first attempt:
the data-config-products.xml (parent)
<dataSource name=&quot;jdbc1&quot;
driver=&quot;oracle.jdbc.driver.OracleDriver&quot;
url=&quot;jdbc:oracle:thin:@//&lt;host>:1521/ENDDEV" user="un"
password="pw"/>
	<document>
        	<entity name="END_FRG_PRODUCTS_VW" 
           		processor="SqlEntityProcessor"
			cacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache"
			writerImpl="org.apache.solr.handler.dataimport.DIHCacheWriter"
            		dataSource="jdbc1"
              		rootEntity="true"
              		persistCacheName="PRODUCTS" 
			persistCacheBaseDir="d:\cacheloc"
			berkleyInternalCacheSize="1000000"
	
persistCacheFieldNames="PDT_ID,SEARCH_TITLE,PDT_GLOBAL_ID,PDT_EAN_CODE,PDT_TYP_CODE,PDT_AVAILABILITY,AVAIL_CODE_OFF_STOCK,AVAIL_CODE_ON_STOCK,OFFER_TYPE"
		
persistCacheFieldTypes="STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING"
               		query="select
PDT_ID,SEARCH_TITLE,PDT_GLOBAL_ID,PDT_EAN_CODE,PDT_TYP_CODE,PDT_AVAILABILITY,AVAIL_CODE_OFF_STOCK,AVAIL_CODE_ON_STOCK,OFFER_TYPE
from END_FRG_PRODUCTS_VW">
	      	</entity>
	</document>

the data-config-features (child):
 <dataSource name=&quot;jdbc1&quot;
driver=&quot;oracle.jdbc.driver.OracleDriver&quot;
url=&quot;jdbc:oracle:thin:@//&lt;host>:1521/ENDDEV" user="un" password="pw"
batchSize="20000"/>
	
	<document>
       		<entity name="END_FRG_FEATURES_VW"
			processor="SqlEntityProcessor"
			cacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache"
			writerImpl="org.apache.solr.handler.dataimport.DIHCacheWriter"
			persistCacheName="FEATURE" 
			persistCacheBaseDir="d:\cacheloc"
			berkleyInternalCacheSize="1000000"
			persistCacheFieldNames="PDT_ID,PDT_FEATURES"
			persistCacheFieldTypes="STRING,STRING"
                	berkleyInternalShared="true"
			cacheKey="PDT_ID"
			cacheLookup="END_FRG_PRODUCTS_VW.PDT_ID"
			dataSource="jdbc1"			
             		query="select PDT_ID, PDT_FEATURES from
END_FRG_FEATURES_VW"/>
	</document>

the data-config-join.xml
<entity name=&quot;END_FRG_PRODUCTS_VW&quot; 
           	
processor=&quot;org.apache.solr.handler.dataimport.DIHCacheProcessor&quot;
            		rootEntity=&quot;true&quot;
            		name=&quot;PARENT&quot;
            	
persistCacheFieldNames=&quot;PDT_ID,SEARCH_TITLE,PDT_GLOBAL_ID,PDT_EAN_CODE,PDT_TYP_CODE,PDT_AVAILABILITY,AVAIL_CODE_OFF_STOCK,AVAIL_CODE_ON_STOCK,OFFER_TYPE&quot;
		
persistCacheFieldTypes=&quot;STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING&quot;
               		&lt;entity name=&quot;END_FRG_FEATURES_VW&quot;
			
processor=&quot;org.apache.solr.handler.dataimport.DIHCacheProcessor&quot;
			
cacheImpl=&quot;org.apache.solr.handler.dataimport.BerkleyBackedCache&quot;
				persistCacheName=&quot;FEATURE&quot; 
				persistCacheBaseDir=&quot;d:\cacheloc&quot;
				berkleyInternalCacheSize=&quot;1000000&quot;
				persistCacheFieldNames=&quot;PDT_ID,PDT_FEATURES&quot;
				persistCacheFieldTypes=&quot;STRING,STRING&quot;
                                berkleyInternalShared=&quot;true&quot;
				cacheKey=&quot;PDT_ID&quot;
				cacheLookup=&quot;END_FRG_PRODUCTS_VW.PDT_ID&quot;/>

Is this a correct setup? Hope you can give some pointers.

Thanks,
Maarten



--
View this message in context: http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tp4015514p4020727.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: DIH nested entities don't work

Posted by "Dyer, James" <Ja...@ingramcontent.com>.
Depending on how much data you're pulling back, 2 hours might be a reasonable amount of time.  Of course if you had it a lot faster with Endeca & Forge, I can understand your questioning this. Keep in mind that the way you're setting up, it will build each cache, 1 at a time.  I'm pretty sure Forge does them serially like this also unless you use complicated tricks around it.  Likewise for DIH, there is a way to build your caches in parallel by setting up multiple DIH handlers to first build your caches, then a final handler to index the pre-cached data.  You need DIHCacheWriter and DIHCacheProcessor from SOLR-2943.

The default for berkleyInternalCacheSize is 2% of your JVM heap.  You might get better performance increasing this, but then again you might find that 2% of heap is way plenty big enough and you should just make it smaller to conserve memory.  This parameter takes bytes, so use 100000 for 100k, etc.

I think the file size is hardcoded to 1gb, so if you're getting 9 files, it means your query is pulling back more than 8gb of data. Sound right?

To get the "defaultRowPrefetch", try putting this in the <defaults /> section under <requestHandler name="/dataimport" ... /> in solrconfig.xml.  Based on a quick review of the code, it seems that it will only honor jdbc parameters if they are in "defaults".

Also keep in mind that Lucene/Solr handle updates really well and with the size of your data, you likely will want to use delta updates rather than re-index all the time.  If so, then perhaps the total time to pull back everything isn't going to matter quite as much?  To implement delta updates with DIH in your case, I'd recommend the approach outlined here: http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport ... (you can still use bdb-je for caches if it still makes sense depending on how big the deltas are)

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-----Original Message-----
From: mroosendaal [mailto:mroosendaal@yahoo.com] 
Sent: Thursday, November 15, 2012 8:52 AM
To: solr-user@lucene.apache.org
Subject: RE: DIH nested entities don't work

Hi James,

Just gave it a go and it worked! That's the good news. The problem now is
getting it to work faster. It took over 2 hours just to index 4 views and i
need to get information from 26.

I tried adding the defaultRowPrefetch="20000" as a jdbc parameter but it
does not seem to honour that. It should work because it is part of the
oracle jdbc driver but there's no mention of it in the Solr documentation.

Would it also help to increase the berkleyInternalCacheSize? For
'CATEGORIES' it creates 9 'files'.

Thanks,
Maarten



--
View this message in context: http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tp4015514p4020503.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: DIH nested entities don't work

Posted by mroosendaal <mr...@yahoo.com>.
Hi James,

Just gave it a go and it worked! That's the good news. The problem now is
getting it to work faster. It took over 2 hours just to index 4 views and i
need to get information from 26.

I tried adding the defaultRowPrefetch="20000" as a jdbc parameter but it
does not seem to honour that. It should work because it is part of the
oracle jdbc driver but there's no mention of it in the Solr documentation.

Would it also help to increase the berkleyInternalCacheSize? For
'CATEGORIES' it creates 9 'files'.

Thanks,
Maarten



--
View this message in context: http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tp4015514p4020503.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: DIH nested entities don't work

Posted by "Dyer, James" <Ja...@ingramcontent.com>.
Here's what I'd do next:

- double check you're only caching the child entity, not the parent.
- Replace the "SELECT *" queries with a list of actual fields you want.
- specify the persistCacheFieldNames and persistCacheFieldTypes parameters (see the doc-comment for DIHCachePersistProperties)
- Try running again.
- If it fails, post the exact data-config.xml you tried to this list for a closer look.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: mroosendaal [mailto:mroosendaal@yahoo.com] 
Sent: Monday, November 12, 2012 4:58 AM
To: solr-user@lucene.apache.org
Subject: RE: DIH nested entities don't work

Hi,

I've created a jar with 5 files:
4 files with DIHCache*.java
1 file named BerkleyBackedCache.java

I've changed the data-config based on your input. What i see it doing is
that it is building a cache at the given location. However the first testrun
took almost *3* hours before i got a message: "connection reset by peer:
socket write error".

I'll try again but the fact that it takes even more than an hour to process
would indicate that i'm missing something.

Any suggestions?

Thanks,
Maarten



--
View this message in context: http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tp4015514p4019728.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: DIH nested entities don't work

Posted by mroosendaal <mr...@yahoo.com>.
Hi,

I've created a jar with 5 files:
4 files with DIHCache*.java
1 file named BerkleyBackedCache.java

I've changed the data-config based on your input. What i see it doing is
that it is building a cache at the given location. However the first testrun
took almost *3* hours before i got a message: "connection reset by peer:
socket write error".

I'll try again but the fact that it takes even more than an hour to process
would indicate that i'm missing something.

Any suggestions?

Thanks,
Maarten



--
View this message in context: http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tp4015514p4019728.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: DIH nested entities don't work

Posted by "Dyer, James" <Ja...@ingramcontent.com>.
Here are things I would try:
- You need to package the patch from SOLR-2943 in your jar as well as SOLR-2613 (to get the class DIHCachePersistCacheProperties)

- You need to specify "cacheImpl", not "persistCacheImpl"

- You are correct using "persistCacheName" & "persistCacheBaseDir" , contra the test case for which these parameters are extraneous and are out-of-date. 

- I wouldn't cache the parent entity, just the child.

- Don't specify persistCachePartitionNumber unless you're actually trying to partition your caches (I wouldn't try this at first).

What will happen is it will loop through the resultset of the parent, document-by-document.  At the first iteration, it will note that the child entity's cache hasn't been initalized and it will build a cache for it. Then, for each iteration, it pulls out of the cache for the child while looping the resultset for the parent.

Hopefully this will work better for yu.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: mroosendaal [mailto:mroosendaal@yahoo.com] 
Sent: Friday, November 09, 2012 12:39 AM
To: solr-user@lucene.apache.org
Subject: RE: DIH nested entities don't work

Hi James,

What i did:
* build a jar from the patch
* downloaded the BDB library
* added them to my classpath
* download a nightly 4.1 Sol build
* created a db config according to:
http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/solr/contrib/dataimporthandler/src/test/org/apache/solr/handler/dataimport/TestEphemeralCache.java

although i got things working, after 2 hours of indexing i stopped the
proces. For that amount of data it took endeca 1h15. After looking at some
of the tests in the patch i configured the data-config.xml as follows:
<document>
        	<entity name="END_FRG_PRODUCTS_VW" 
           		processor="SqlEntityProcessor"
           	
persistCacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache" 
           		persistCacheName="END_FRG_PRODUCTS_VW"
           		persistCachePartitionNumber="0"
           		persistCacheBaseDir="d:\cacheloc"
           		berkleyInternalCacheSize="1000000"			
			berkleyInternalShared="true"
           		query="select PDT_ID, SEARCH_TITLE from END_FRG_PRODUCTS_VW">
           		<entity name="END_FRG_FEATURES_VW"
				processor="SqlEntityProcessor"
				persistCacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache"
				persistCacheName="FEATURE"
				cacheKey="PDT_ID"
				cacheLookup="END_FRG_PRODUCTS_VW.PDT_ID"
				berkleyInternalCacheSize="1000000"			
				berkleyInternalShared="true"
				persistCacheBaseDir="d:\cacheloc"
             			query="select * from END_FRG_FEATURES_VW"/>
               	</entity>
	</document>

Although different in behaviour:
[snapshot from the indexing after 8 minutes: Requests: 2899, Fetched:
28974398, Skipped: 0, Processed: 2258] it was still slow and the parameter
'persistCacheBaseDir' has no effect. The difference in behaviour from the
previous is that it had only 2 requests and hadn't processed anything after
2 hours.

Hope you can help me.

Thanks,
Maarten




--
View this message in context: http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tp4015514p4019223.html
Sent from the Solr - User mailing list archive at Nabble.com.