You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by mechravi25 <me...@yahoo.co.in> on 2012/08/14 15:03:36 UTC

Dataimport Handler in solr 3.6.1

I am indexing some data using dataimport handler files in solr 3.6.1. I using
a nested entity in my handler file. 
I noticed a scenario where-in instead of the records which is to be fetched
for a document, 
all the records present in the table are indexed.

Following is the ideal scenario how the data has to be indexed.
For a document A, I am trying to index the 2 values B,C as a multivalued
field

<id>A</id>
<related_id>
<str>B</str>
<str>C</str>
</related_id>

This is how the output should be. I have used the same DIH file for solr
1.4,3.5 versions 
and the data was indexed fine like the one mentioned above in both the
versions.

But in solr 3.6.1 version, data was indexed differently. In my table, there
are 4 values(B,C,D,E) in related_id field.
This is how the data is indexed in 3.6.1

<id>A</id>
<related_id>
<str>B</str>
<str>C</str>
<str>D</str>
<str>E</str>
</related_id>

Ideally, the values D and E should not get indexed under id "A". This is the
same for the other id records.


Following is the content of the DIH file



         <entity name="ent1"  query="select sid as id Table1 a "
transformer="RegexTransformer,DateFormatTransformer,TemplateTransformer">
                	
                        <field column="id" name="id" boost="0.5"/>
          

                <entity name="ent2" query="select id1,rid from Table2 "
processor="CachedSqlEntityProcessor" cacheKey="id1" cacheLookup="ent1.uid"
transformer="RegexTransformer,DateFormatTransformer,TemplateTransformer">

                    
                        <field column="rid" name="related_id"/>
                       

                </entity>        
                

        </entity>
        
        
        
 I tried changing the CachedSqlEntityProcessor to SqlEntityProcessor and
then indexed the same but still I faced the same issue.
 
 When I googled a bit, I found this url
https://issues.apache.org/jira/browse/SOLR-3360


I am not sure if the issue 3360 is the same as the scenario as I have
mentioned above.

Please guid me.

Thanks.



--
View this message in context: http://lucene.472066.n3.nabble.com/Dataimport-Handler-in-solr-3-6-1-tp4001149.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Dataimport Handler in solr 3.6.1

Posted by "Dyer, James" <Ja...@ingramcontent.com>.
There were 2 major changes to DIH Cache functionality in Solr 3.6, only 1 of which was carried to Solr 4.0:

- Solr 3.6 had 2 MAJOR changes:

1. We support pluggable caches so that you can write your own cache implemetations and cache however you want.  The goal here is to allow you to cache to disk when you had to do large, complex joins and an in-memory cache could result in an OOM.  Also, you can specify "cacheImpl" with any EntityProcessor, not just SqlEntityProcessor.  So you can join child entities that come from XML, flat files, etc.  CachedSqlEntityProcessor is technically deprecated as using it is the same as SqlEntityProcessor with cacheImpl="SortedMapBackedCache" specified.  This does a simple in-memory cache very similar to Solr3.5 and prior. (see https://issues.apache.org/jira/browse/SOLR-2382)

2. Extensive work was done to try and make the "threads" parameter work in more situations.  This involved some rather invasive changes to the DIH Cache functionality. (see https://issues.apache.org/jira/browse/SOLR-3011)

- Solr 4.0 has #1 above, BUT NOT #2.  Rather the "threads" functionality was entirely removed.

Subsequently, if the problem is due to #2 (SOLR-3011), this isn't as big a problem because 3.x users can simply use the 3.5 DIH jar (but some use-cases involding "threads" work with the 3.6(.1) jar and not at all with 3.5, so users will have to pick & choose the best version to use for their instance).

My concern is there are issues with #1 (SOLR-2382).  That's why I'm asking if at all possible you can try this with SOLR 4.0.  I have tested Solr 4.0 extensively here and it seems caching works exactly as it ought.  However, DIH is flexible on how it can be configured and there could be somethat that was broken that I have not uncovered myself.  Any issues that may exist with SOLR-2382 need to be identified and fixed in the 4.x branch as soon as possible.

I apologize for the late response.  I was away the past week.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-----Original Message-----
From: mechravi25 [mailto:mechravi25@yahoo.co.in] 
Sent: Tuesday, August 21, 2012 7:47 AM
To: solr-user@lucene.apache.org
Subject: RE: Dataimport Handler in solr 3.6.1

Hi James,

Thanks for the suggestions. 

Actually it is cacheLookup="ent1.id" . had misspelt it. Also, I will be
needing the transformers mentioned as there are other columns as well.

Actually tried using the 3.5 DIH jars in 3.6.1 and indexed the same and the
indexing was successful. But I wanted this to work with 3.6.1 DIH. Just came
across the SOLR-2382 patch. I tried giving the following 

processor="CachedSqlEntityProcessor" cacheImpl="SortedMapBackedCache" 

in my DIH.xml file. In case of static fields in child entities ,the indexing
happended fine but in case of dynamic fields, only one of the dynamic fields
was indexed and the rest was skipped even though the total rows fetched from
datasource was correct.

Following are my questions

1.) Is there a big difference in solr 3.5 and 3.6.1 DIH handler files? like
is any new feature added in 3.6 DIH that is not present in 3.5?
2.) Am i missing something while giving the cacheImpl="SortedMapBackedCache"
in my DIH.xml because of which dynamic fields are not indexed properly?
There is no change to my DIH file from my previous post apart from this
cacheImpl addition and also the dynamic fields are indexed properly if I do
not give this cacheImpl. Am I missing something here?

Thanks.



--
View this message in context: http://lucene.472066.n3.nabble.com/Dataimport-Handler-in-solr-3-6-1-tp4001149p4002421.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: Dataimport Handler in solr 3.6.1

Posted by mechravi25 <me...@yahoo.co.in>.
Hi James,

Thanks for the suggestions. 

Actually it is cacheLookup="ent1.id" . had misspelt it. Also, I will be
needing the transformers mentioned as there are other columns as well.

Actually tried using the 3.5 DIH jars in 3.6.1 and indexed the same and the
indexing was successful. But I wanted this to work with 3.6.1 DIH. Just came
across the SOLR-2382 patch. I tried giving the following 

processor="CachedSqlEntityProcessor" cacheImpl="SortedMapBackedCache" 

in my DIH.xml file. In case of static fields in child entities ,the indexing
happended fine but in case of dynamic fields, only one of the dynamic fields
was indexed and the rest was skipped even though the total rows fetched from
datasource was correct.

Following are my questions

1.) Is there a big difference in solr 3.5 and 3.6.1 DIH handler files? like
is any new feature added in 3.6 DIH that is not present in 3.5?
2.) Am i missing something while giving the cacheImpl="SortedMapBackedCache"
in my DIH.xml because of which dynamic fields are not indexed properly?
There is no change to my DIH file from my previous post apart from this
cacheImpl addition and also the dynamic fields are indexed properly if I do
not give this cacheImpl. Am I missing something here?

Thanks.



--
View this message in context: http://lucene.472066.n3.nabble.com/Dataimport-Handler-in-solr-3-6-1-tp4001149p4002421.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Dataimport Handler in solr 3.6.1

Posted by "Dyer, James" <Ja...@ingramcontent.com>.
One thing I notice in your configuration...the child entity has this:

cacheLookup="ent1.uid"

but your parent entity doesn't have a "uid" field.  

Also, you have these 3 transformers:  RegexTransformer,DateFormatTransformer,TemplateTransformer

but none of your columns seem to make use of these.  Are you sure you need them?

In any case I am suspicious there may still be bugs in 3.6.1 related to CachedSqlEntityProcessor, so if you are able to create a failing unit test and post it to JIRA that would be helpful.  If you need to, you can use the 3.5 DIH jar with Solr 3.6.1.  Also, I do not think the SOLR-3360 should affect you unless you're using the "threads" parameter.  Both SOLR-3360 & SOLR-3430 fixed bugs related to CachedSqlEntityProcessor that were introduced in 3.6.0 (from SOLR-3411 and SOLR-2482 respectively).

Finally, if you are at all able to test this on 4.0-beta, I would greatly appreciate it!  SOLR-3411/SOLR-3360 were never applied to version 4.0 because "threadS" support was removed entirely.  However, SOLR-2482/SOLR-3430 were applied to 4.0 also.  If we have any more SOLR-2482 bugs lingering in 4.0 these really need to be fixed so any testing help would be much appreciated.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: mechravi25 [mailto:mechravi25@yahoo.co.in] 
Sent: Tuesday, August 14, 2012 8:04 AM
To: solr-user@lucene.apache.org
Subject: Dataimport Handler in solr 3.6.1

I am indexing some data using dataimport handler files in solr 3.6.1. I using
a nested entity in my handler file. 
I noticed a scenario where-in instead of the records which is to be fetched
for a document, 
all the records present in the table are indexed.

Following is the ideal scenario how the data has to be indexed.
For a document A, I am trying to index the 2 values B,C as a multivalued
field

<id>A</id>
<related_id>
<str>B</str>
<str>C</str>
</related_id>

This is how the output should be. I have used the same DIH file for solr
1.4,3.5 versions 
and the data was indexed fine like the one mentioned above in both the
versions.

But in solr 3.6.1 version, data was indexed differently. In my table, there
are 4 values(B,C,D,E) in related_id field.
This is how the data is indexed in 3.6.1

<id>A</id>
<related_id>
<str>B</str>
<str>C</str>
<str>D</str>
<str>E</str>
</related_id>

Ideally, the values D and E should not get indexed under id "A". This is the
same for the other id records.


Following is the content of the DIH file



         <entity name="ent1"  query="select sid as id Table1 a "
transformer="RegexTransformer,DateFormatTransformer,TemplateTransformer">
                	
                        <field column="id" name="id" boost="0.5"/>
          

                <entity name="ent2" query="select id1,rid from Table2 "
processor="CachedSqlEntityProcessor" cacheKey="id1" cacheLookup="ent1.uid"
transformer="RegexTransformer,DateFormatTransformer,TemplateTransformer">

                    
                        <field column="rid" name="related_id"/>
                       

                </entity>        
                

        </entity>
        
        
        
 I tried changing the CachedSqlEntityProcessor to SqlEntityProcessor and
then indexed the same but still I faced the same issue.
 
 When I googled a bit, I found this url
https://issues.apache.org/jira/browse/SOLR-3360


I am not sure if the issue 3360 is the same as the scenario as I have
mentioned above.

Please guid me.

Thanks.



--
View this message in context: http://lucene.472066.n3.nabble.com/Dataimport-Handler-in-solr-3-6-1-tp4001149.html
Sent from the Solr - User mailing list archive at Nabble.com.