You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chantal Ackermann <ch...@btelligent.de> on 2009/07/22 18:35:02 UTC

DataImportHandler / Import from DB : one data set comes in multiple rows

Hi all,

this is my first post, as I am new to SOLR (some Lucene exp).

I am trying to load data from an existing datamart into SOLR using the 
DataImportHandler but in my opinion it is too slow due to the special 
structure of the datamart I have to use.

Root Cause:
This datamart uses a row based approach (pivot) to present its data. It 
was so done to allow adding more attributes to the data set without 
having to change the table structure.

Impact:
To use the DataImportHandler, i have to pivot the data to create again 
one row per data set. Unfortunately, this results in more and less 
performant queries. Moreover, there are sometimes multiple rows for a 
single attribute, that require separate queries - or more tricky 
subselects that probably don't speed things up.

Here is an example of the relation between DB requests, row fetches and 
actual number of documents created:

<lst name="statusMessages">
<str name="Total Requests made to DataSource">3737</str>
<str name="Total Rows Fetched">5380</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2009-07-22 18:19:06</str>
−
<str name="">
Indexing completed. Added/Updated: 934 documents. Deleted 0 documents.
</str>
<str name="Committed">2009-07-22 18:22:29</str>
<str name="Optimized">2009-07-22 18:22:29</str>
<str name="Time taken ">0:3:22.484</str>
</lst>

(Full index creation.)
There are about half a million data sets, in total. That would require 
about 30h for indexing? My feeling is that there are far too many row 
fetches per data set.

I am testing it on a smaller machine (2GB, Windows :-( ), Tomcat6 using 
around 680MB RAM, Java6. I haven't changed the Lucene configuration 
(merge factor 10, ram buffer size 32).

Possible solutions?
A) Write my own DataImportHandler?
B) Write my own "MultiRowTransformer" that accepts several rows as input 
argument (not sure this is a valid option)?
C) Approach the DB developers to add a flat table with one data set per row?
D) ...?

If someone would like to share their experiences, that would be great!

Thanks a lot!
Chantal



-- 
Chantal Ackermann

Re: DataImportHandler / Import from DB : one data set comes in multiple rows

Posted by Glen Newton <gl...@gmail.com>.
Hi Otis,

Yes, you are right: LuSql is heavily optimized for multi-thread/multi-core.
It also performs better on a single core with multiple threads, due to
the heavy i/o bounded nature of Lucene indexing.

So if the DB is the bottleneck, well, yes, then LuSql and any other
tool are not going help. Resolve the DB bottleneck, and then decide
what tool best serves your indexing requirements.

Only slightly off topic: I have noticed one problem with DBs (with
LuSql and custom JDBC clients processing records) when the fetch size
is too large and the amount of processsing of each record gets too
large: sometimes the connection times out because the time between
getting the next batch takes too long (due to the accumulated delay
from processing all the records). Solved with reducing the fetch size.
I am not sure if Solr/DIH users have experienced this. LuSql allows
setting the fetch size (like DIH I believe) and (unreleased version)
re-issues the SQL and offsets to the last+1 record when this happens.

-glen

2009/7/23 Otis Gospodnetic <ot...@yahoo.com>:
> Note that the statement about LuSql (or really any other tool, LuSql is just an example because it was mentioned) is true only if Solr is underutilized because DIH uses a single thread to talk to Solr (is this correct?) vs. LuSql using multiple (I'm guessing that's the case becase of the multicore comment).
>
> But, if the DB itself if your bottleneck, and I've seen plenty of such cases, then speed of DIH vs. LuSql vs. something else matters less.  Glen, please correct me if I'm wrong about this - I know you have done plenty of benchmarking. :)
>
>  Otis
> --
> Sematext is hiring: http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
>
> ----- Original Message ----
>> From: Glen Newton <gl...@gmail.com>
>> To: solr-user@lucene.apache.org
>> Sent: Thursday, July 23, 2009 5:52:43 AM
>> Subject: Re: DataImportHandler / Import from DB : one data set comes in  multiple rows
>>
>> Chantal,
>>
>> You might consider LuSql[1].
>> It has much better performance than Solr DIH. It runs 4-10 times faster on a
>> multicore machine, and can run in 1/20th the heap size Solr needs. It
>> produces a Lucene index.
>>
>> See slides 22-25 in this presentation comparing Solr DIH with LuSql:
>> http://code4lib.org/files/glen_newton_LuSql.pdf
>>
>> [1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
>>
>> Disclosure: I am the author of LuSql.
>>
>> Glen Newton
>> http://zzzoot.blogspot.com/
>>
>> 2009/7/22 Chantal Ackermann :
>> > Hi all,
>> >
>> > this is my first post, as I am new to SOLR (some Lucene exp).
>> >
>> > I am trying to load data from an existing datamart into SOLR using the
>> > DataImportHandler but in my opinion it is too slow due to the special
>> > structure of the datamart I have to use.
>> >
>> > Root Cause:
>> > This datamart uses a row based approach (pivot) to present its data. It was
>> > so done to allow adding more attributes to the data set without having to
>> > change the table structure.
>> >
>> > Impact:
>> > To use the DataImportHandler, i have to pivot the data to create again one
>> > row per data set. Unfortunately, this results in more and less performant
>> > queries. Moreover, there are sometimes multiple rows for a single attribute,
>> > that require separate queries - or more tricky subselects that probably
>> > don't speed things up.
>> >
>> > Here is an example of the relation between DB requests, row fetches and
>> > actual number of documents created:
>> >
>> >
>> > 3737
>> > 5380
>> > 0
>> > 2009-07-22 18:19:06
>> > −
>> >
>> > Indexing completed. Added/Updated: 934 documents. Deleted 0 documents.
>> >
>> > 2009-07-22 18:22:29
>> > 2009-07-22 18:22:29
>> > 0:3:22.484
>> >
>> >
>> > (Full index creation.)
>> > There are about half a million data sets, in total. That would require about
>> > 30h for indexing? My feeling is that there are far too many row fetches per
>> > data set.
>> >
>> > I am testing it on a smaller machine (2GB, Windows :-( ), Tomcat6 using
>> > around 680MB RAM, Java6. I haven't changed the Lucene configuration (merge
>> > factor 10, ram buffer size 32).
>> >
>> > Possible solutions?
>> > A) Write my own DataImportHandler?
>> > B) Write my own "MultiRowTransformer" that accepts several rows as input
>> > argument (not sure this is a valid option)?
>> > C) Approach the DB developers to add a flat table with one data set per row?
>> > D) ...?
>> >
>> > If someone would like to share their experiences, that would be great!
>> >
>> > Thanks a lot!
>> > Chantal
>> >
>> >
>> >
>> > --
>> > Chantal Ackermann
>> >
>>
>>
>>
>> --
>>
>> -
>
>



-- 

-

Re: DataImportHandler / Import from DB : one data set comes in multiple rows

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Note that the statement about LuSql (or really any other tool, LuSql is just an example because it was mentioned) is true only if Solr is underutilized because DIH uses a single thread to talk to Solr (is this correct?) vs. LuSql using multiple (I'm guessing that's the case becase of the multicore comment).

But, if the DB itself if your bottleneck, and I've seen plenty of such cases, then speed of DIH vs. LuSql vs. something else matters less.  Glen, please correct me if I'm wrong about this - I know you have done plenty of benchmarking. :)

 Otis
--
Sematext is hiring: http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: Glen Newton <gl...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Thursday, July 23, 2009 5:52:43 AM
> Subject: Re: DataImportHandler / Import from DB : one data set comes in  multiple rows
> 
> Chantal,
> 
> You might consider LuSql[1].
> It has much better performance than Solr DIH. It runs 4-10 times faster on a
> multicore machine, and can run in 1/20th the heap size Solr needs. It
> produces a Lucene index.
> 
> See slides 22-25 in this presentation comparing Solr DIH with LuSql:
> http://code4lib.org/files/glen_newton_LuSql.pdf
> 
> [1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
> 
> Disclosure: I am the author of LuSql.
> 
> Glen Newton
> http://zzzoot.blogspot.com/
> 
> 2009/7/22 Chantal Ackermann :
> > Hi all,
> >
> > this is my first post, as I am new to SOLR (some Lucene exp).
> >
> > I am trying to load data from an existing datamart into SOLR using the
> > DataImportHandler but in my opinion it is too slow due to the special
> > structure of the datamart I have to use.
> >
> > Root Cause:
> > This datamart uses a row based approach (pivot) to present its data. It was
> > so done to allow adding more attributes to the data set without having to
> > change the table structure.
> >
> > Impact:
> > To use the DataImportHandler, i have to pivot the data to create again one
> > row per data set. Unfortunately, this results in more and less performant
> > queries. Moreover, there are sometimes multiple rows for a single attribute,
> > that require separate queries - or more tricky subselects that probably
> > don't speed things up.
> >
> > Here is an example of the relation between DB requests, row fetches and
> > actual number of documents created:
> >
> > 
> > 3737
> > 5380
> > 0
> > 2009-07-22 18:19:06
> > −
> > 
> > Indexing completed. Added/Updated: 934 documents. Deleted 0 documents.
> > 
> > 2009-07-22 18:22:29
> > 2009-07-22 18:22:29
> > 0:3:22.484
> > 
> >
> > (Full index creation.)
> > There are about half a million data sets, in total. That would require about
> > 30h for indexing? My feeling is that there are far too many row fetches per
> > data set.
> >
> > I am testing it on a smaller machine (2GB, Windows :-( ), Tomcat6 using
> > around 680MB RAM, Java6. I haven't changed the Lucene configuration (merge
> > factor 10, ram buffer size 32).
> >
> > Possible solutions?
> > A) Write my own DataImportHandler?
> > B) Write my own "MultiRowTransformer" that accepts several rows as input
> > argument (not sure this is a valid option)?
> > C) Approach the DB developers to add a flat table with one data set per row?
> > D) ...?
> >
> > If someone would like to share their experiences, that would be great!
> >
> > Thanks a lot!
> > Chantal
> >
> >
> >
> > --
> > Chantal Ackermann
> >
> 
> 
> 
> -- 
> 
> -


Re: DataImportHandler / Import from DB : one data set comes in multiple rows

Posted by Glen Newton <gl...@gmail.com>.
Chantal,

You might consider LuSql[1].
It has much better performance than Solr DIH. It runs 4-10 times faster on a
multicore machine, and can run in 1/20th the heap size Solr needs. It
produces a Lucene index.

See slides 22-25 in this presentation comparing Solr DIH with LuSql:
 http://code4lib.org/files/glen_newton_LuSql.pdf

[1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql

Disclosure: I am the author of LuSql.

Glen Newton
http://zzzoot.blogspot.com/

2009/7/22 Chantal Ackermann <ch...@btelligent.de>:
> Hi all,
>
> this is my first post, as I am new to SOLR (some Lucene exp).
>
> I am trying to load data from an existing datamart into SOLR using the
> DataImportHandler but in my opinion it is too slow due to the special
> structure of the datamart I have to use.
>
> Root Cause:
> This datamart uses a row based approach (pivot) to present its data. It was
> so done to allow adding more attributes to the data set without having to
> change the table structure.
>
> Impact:
> To use the DataImportHandler, i have to pivot the data to create again one
> row per data set. Unfortunately, this results in more and less performant
> queries. Moreover, there are sometimes multiple rows for a single attribute,
> that require separate queries - or more tricky subselects that probably
> don't speed things up.
>
> Here is an example of the relation between DB requests, row fetches and
> actual number of documents created:
>
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">3737</str>
> <str name="Total Rows Fetched">5380</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2009-07-22 18:19:06</str>
> −
> <str name="">
> Indexing completed. Added/Updated: 934 documents. Deleted 0 documents.
> </str>
> <str name="Committed">2009-07-22 18:22:29</str>
> <str name="Optimized">2009-07-22 18:22:29</str>
> <str name="Time taken ">0:3:22.484</str>
> </lst>
>
> (Full index creation.)
> There are about half a million data sets, in total. That would require about
> 30h for indexing? My feeling is that there are far too many row fetches per
> data set.
>
> I am testing it on a smaller machine (2GB, Windows :-( ), Tomcat6 using
> around 680MB RAM, Java6. I haven't changed the Lucene configuration (merge
> factor 10, ram buffer size 32).
>
> Possible solutions?
> A) Write my own DataImportHandler?
> B) Write my own "MultiRowTransformer" that accepts several rows as input
> argument (not sure this is a valid option)?
> C) Approach the DB developers to add a flat table with one data set per row?
> D) ...?
>
> If someone would like to share their experiences, that would be great!
>
> Thanks a lot!
> Chantal
>
>
>
> --
> Chantal Ackermann
>



-- 

-

Re: DataImportHandler / Import from DB : one data set comes in multiple rows

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.
On Fri, Jul 24, 2009 at 2:04 PM, Chantal
Ackermann<ch...@btelligent.de> wrote:
> oh, I didn't know that. That did it! (Jippieh!)
> I've pasted the relevant code below in case you're interested.
>
> Thank you for your help!!
>
> Another little observation that I made using the debugger interface:
> it threw an exception trying to cast Long / String[] etc. to String. I guess
> that happened because I filled that Map that is declared as
> Map<String,Object> with different object types as values.
> Maybe it should be changed to Map<String,String> if you have a cast in
> there? Or maybe it's just a bug in the debugger. (Casting to String instead
> of using toString()?)
> I haven't tried changing it back to non-String objects, yet. I'll play
> around with it - and keep you posted, if you like.
>
> If there is any interest I'll ask my boss whether I can contribute that
> class to your code? I'd rework it to be generic as good as I can, of course.

If the usecase is generic enough you can raise an issue and provide a
patch for others to use. Just ensure that the patch is for the 1.4
trunk

>
> Cheers,
> Chantal
>
> *****************************************************************
> Here is the construct (I've changed it to use the rowIterator instead of
> getNext() ):
>
> // return null once the end of that data set is reached
> if (!rowIterator.hasNext()) {
>        rowIterator = null;
>        return null;
> }
> // as long as there is data, iterate over the rows and pivot them
> // return the pivotted row after the last row of data has been reached
> do {
>        fieldRow = rowIterator.next();
>        attName = fieldRow.get(...);
>        newValue = fieldRow.get(...);
>        pivottedRow.put(attName, newValue);
> } while (rowIterator.hasNext());
>
> pivottedRow = applyTransformer(pivottedRow);
> log.info("Returning: " + pivottedRow);
> return pivottedRow;
> ******************************************************************
>
>
> Noble Paul നോബിള്‍ नोब्ळ् schrieb:
>>
>> The API of Entityprocessor is as follows
>>
>> the nextRow() is invoked again and again till it returns null . if
>> getNext() returns null  you also should return null;
>>
>> are you doing that?
>>
>> fieldRow = getNext();
>> if(fieldRow == null) return null;
>>
>> open the DIH debug page and you should see the debug output
>>
>> On Thu, Jul 23, 2009 at 9:11 PM, Chantal
>> Ackermann<ch...@btelligent.de> wrote:
>>>
>>> Hi Paul,
>>>
>>> thanks for your patience. Before, I omitted parts of the code to make it
>>> easier to read. Below is the complete code for the Processor.
>>>
>>> I tried starting the dataimporthandler with parameter debug=true (in the
>>> URL
>>> call: e.g.
>>>
>>> http:://localhost:8080/solr/de/dataimport?command=full-import&debug=true),
>>> but this doesn't change the output in any way at all. It seems as if the
>>> parameter would be ignored - do I use it incorrectly?
>>>
>>> Changing to FINE in the SOLR admin outputs loads from Tomcat, on top. The
>>> "interesting" observation in that case is, that there is no additional
>>> log
>>> output between the nextRow() calls. The "Returning: " line is output, so
>>> the
>>> method exits but is called immediately afterwards, again. (Or so it
>>> seems?)
>>> There is no log output from any code that would try to create a lucene
>>> document.
>>>
>>> Regards,
>>> Chantal
>>>
>>> ****************** C O D E *******************************************
>>> public class PivotSqlEntityProcessor extends SqlEntityProcessor {
>>>
>>> private static final Logger log = Logger
>>>       .getLogger(EpgValueEntityProcessor.class.getName());
>>> private static final String ATTR_ID_DEFINITION = "columnIdDefinition";
>>> private static final String ATTR_COLUMN_ATT_NAME = "columnAttName";
>>> private static final String ATTR_COLUMN_VALUE = "columnValue";
>>> private static final String ATTR_COLUMN_SUBVALUE = "columnSubvalue";
>>> // default column names
>>> private static final String DEF_ATT_NAME = "ATT_NAME";
>>> private static final String DEF_VALUE = "VALUE";
>>> private static final String DEF_SUBVALUE = "SUBVALUE";
>>> private static final String DEF_ID_DEFINITION = "ID_DEFINITION";
>>> private String defId;
>>> private String colIdDef = DEF_ID_DEFINITION;
>>> private String colAttName = DEF_ATT_NAME;
>>> private String colValue = DEF_VALUE;
>>> private String colSubvalue = DEF_SUBVALUE;
>>>
>>>
>>> @SuppressWarnings("unchecked")
>>> public void init(Context context) {
>>>       super.init(context);
>>>       // load column names from entity element attributes
>>>       colIdDef = context.getEntityAttribute(ATTR_ID_DEFINITION);
>>>       epgdefId = colIdEpgDef;
>>>       colAttName = context.getEntityAttribute(ATTR_COLUMN_ATT_NAME);
>>>       colValue = context.getEntityAttribute(ATTR_COLUMN_VALUE);
>>>       colSubvalue = context.getEntityAttribute(ATTR_COLUMN_SUBVALUE);
>>>       log.info("Using the column names: " +
>>>               colAttName + ", " + colValue + ", " + colSubvalue);
>>> }
>>>
>>>
>>> public Map<String, Object> nextRow() {
>>>       if (rowcache != null)
>>>               return getFromRowCache();
>>>       if (rowIterator == null) {
>>>               String q = getQuery();
>>>               initQuery(resolver.replaceTokens(q));
>>>       }
>>>       Map<String, Object> pivottedRow = new HashMap<String, Object>();
>>>       Map<String, Object> fieldRow = getNext();
>>>       String attName, value, subvalue;
>>>       Object existingValue, newValue;
>>>       Long id = null;
>>>       while (fieldRow != null) {
>>>               id = ((Number)fieldRow.get(colIdDef)).longValue();
>>>               assert id != null;
>>>               if (pivottedRow.containsKey(colIdDef)) {
>>>                       assert id.equals(pivottedRow.get(colIdDef));
>>>               } else {
>>>                       pivottedRow.put(colIdDef, id);
>>>               }
>>>               log.info("Next fieldRow: " + fieldRow);
>>>               attName = (String) fieldRow.get(colAttName);
>>>               if (attName == null) {
>>>                       log.warning("No value returned for attribute name
>>> column " + colAttName);
>>>               }
>>>               value = (String) fieldRow.get(colValue);
>>>               subvalue = (String) fieldRow.get(colSubvalue);
>>>
>>>               // create a single object for value and subvalue
>>>               // if subvalue is not set, use value only,
>>>               // otherwise create string array
>>>               if (subvalue == null || subvalue.trim().length() == 0) {
>>>                       newValue = value;
>>>               } else {
>>>                       newValue = new String[] { value, subvalue };
>>>               }
>>>
>>>               // if there is already an entry for that attribute,
>>>               // extend the existing value
>>>               if (pivottedRow.containsKey(attName)) {
>>>                       existingValue = pivottedRow.get(attName);
>>>                       if (existingValue instanceof List) {
>>>                               ((List) existingValue).add(newValue);
>>>                       } else {
>>>                               ArrayList v = new ArrayList();
>>>                               Collections.addAll(v, existingValue,
>>> newValue);
>>>                               pivottedRow.put(attName, v);
>>>                       }
>>>               } else {
>>>                       pivottedRow.put(attName, newValue);
>>>               }
>>>
>>>               fieldRow = getNext();
>>>       }
>>>       pivottedRow = applyTransformer(pivottedRow);
>>>       log.info("Returning: " + pivottedRow);
>>>       return pivottedRow;
>>> }
>>> }
>>>
>>> ****************** C O D E    E N D **********************************
>>>
>>>
>>> Noble Paul നോബിള്‍ नोब्ळ् schrieb:
>>>>
>>>> apparently there is something wrong with the code or it is incomplete
>>>>
>>>>  Map<String, Object> pivottedRow = new HashMap<String, Object>();
>>>>      Map<String, Object> fieldRow = getNext();
>>>>      while (fieldRow != null) {
>>>>              // populate pivottedRow
>>>>              fieldRow = getNext();
>>>>      }
>>>>      pivottedRow = applyTransformer(pivottedRow);
>>>>      log.info("Returning: " + pivottedRow);
>>>>      return pivottedRow;
>>>>
>>>>
>>>> here your 'pivottedRow ' is empty and I don't see it getting filled.
>>>> fieldRow is contains the data but it is not used at all
>>>>
>>>> On Thu, Jul 23, 2009 at 6:37 PM, Chantal
>>>> Ackermann<ch...@btelligent.de> wrote:
>>>>>
>>>>> Hi Paul,
>>>>>
>>>>> no, I didn't return the unique key, though there is one defined. I
>>>>> added
>>>>> that to the nextRow() implementation, and I am now returning it as part
>>>>> of
>>>>> the map.
>>>>>
>>>>> But it is still not creating any documents, and now that I can see the
>>>>> ID
>>>>> I
>>>>> have realized that it is always processing the same - the first - data
>>>>> set.
>>>>> It's like it tries to create the first document but does not, then
>>>>> reiterates over that same data, fails again, and so on. I mean, it
>>>>> doesn't
>>>>> even create one document. So it cannot be a simple iteration that
>>>>> updates
>>>>> the same document over and over again (as there is none).
>>>>>
>>>>> I haven't changed the log level. I see no error message in the output
>>>>> (catalina.log in my case).
>>>>>
>>>>> The complete entity definition:
>>>>>
>>>>> <dataConfig>
>>>>>  <dataSource type="JdbcDataSource"
>>>>> driver="oracle.jdbc.driver.OracleDriver" ... />
>>>>>  <document name="doc">
>>>>>      <entity name="epg_definition" pk="ID"
>>>>>              query="select ID from DEFINITION">
>>>>> <!-- originally I would set the field "id" (unique key) on this level,
>>>>> doesn't work neither -->
>>>>>          <entity name="value" pk="DEF_ID"
>>>>> processor="PivotSqlEntityProcessor"
>>>>>              query="select DEF_ID, id, name, value from datamart where
>>>>> parent_id=${id_definition.ID} and id in (1,23,17)">
>>>>>              <field column="DEF_ID" name="id" />
>>>>>              <field column="sid" name="sid" />
>>>>>              <field column="audio" name="audio" />
>>>>>              <field column="cat" name="cat" />
>>>>>          </entity>
>>>>>      </entity>
>>>>>  </document>
>>>>> </dataConfig>
>>>>>
>>>>> schema:
>>>>> <field name="id" type="long" indexed="true" stored="true"
>>>>> required="true"
>>>>> />
>>>>> <field name="sid" type="long" indexed="true" stored="true"
>>>>> required="true"
>>>>> />
>>>>> <field name="audio" type="text_ws" indexed="true" stored="false"
>>>>> omitNorms="true" multiValued="true"/>
>>>>> <field name="cat" type="text_ws" indexed="true" stored="true"
>>>>> omitNorms="true" multiValued="true"/>
>>>>>
>>>>> I am using more fields, but I removed them to make it easier to read. I
>>>>> am
>>>>> thinking about removing them from my test to be sure they don't
>>>>> interfere.
>>>>>
>>>>> Thanks for your help!
>>>>> Chantal
>>>>>
>>>>>
>>>>> Noble Paul നോബിള്‍ नोब्ळ् schrieb:
>>>>>>
>>>>>> Is there a <uniqueKey> in your schema ? are you returning a value
>>>>>> corresponding to that key name?
>>>>>>
>>>>>> probably you can paste the whole data-config.xml
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 23, 2009 at 4:59 PM, Chantal
>>>>>> Ackermann<ch...@btelligent.de> wrote:
>>>>>>>
>>>>>>> Hi Paul, hi Glen, hi all,
>>>>>>>
>>>>>>> thank you for your answers.
>>>>>>>
>>>>>>> I have followed Paul's solution (as I received it earlier). (I'll
>>>>>>> keep
>>>>>>> your
>>>>>>> suggestion in mind, though, Glen.)
>>>>>>>
>>>>>>> It looks good, except that it's not creating any documents... ;-)
>>>>>>> It is most probably some misunderstanding on my side, and maybe you
>>>>>>> can
>>>>>>> help
>>>>>>> me correct that?
>>>>>>>
>>>>>>> So, I have subclassed the SqlEntityProcessor by overwriting basically
>>>>>>> nextRow() as Paul suggested:
>>>>>>>
>>>>>>> public Map<String, Object> nextRow() {
>>>>>>>     if (rowcache != null)
>>>>>>>             return getFromRowCache();
>>>>>>>     if (rowIterator == null) {
>>>>>>>             String q = getQuery();
>>>>>>>             initQuery(resolver.replaceTokens(q));
>>>>>>>     }
>>>>>>>     Map<String, Object> pivottedRow = new HashMap<String, Object>();
>>>>>>>     Map<String, Object> fieldRow = getNext();
>>>>>>>     while (fieldRow != null) {
>>>>>>>             // populate pivottedRow
>>>>>>>             fieldRow = getNext();
>>>>>>>     }
>>>>>>>     pivottedRow = applyTransformer(pivottedRow);
>>>>>>>     log.info("Returning: " + pivottedRow);
>>>>>>>     return pivottedRow;
>>>>>>> }
>>>>>>>
>>>>>>> This seems to work as intended. From the log output, I can see that I
>>>>>>> get
>>>>>>> only the rows that I expect for one iteration in the correct
>>>>>>> key-value
>>>>>>> structure. I can also see, that the returned pivottedRow is what I
>>>>>>> want
>>>>>>> it
>>>>>>> to be: a map containing columns where each column contains what
>>>>>>> previously
>>>>>>> was input as a row.
>>>>>>>
>>>>>>> Example (shortened):
>>>>>>> INFO: Next fieldRow: {value=2, name=audio, id=1}
>>>>>>> INFO: Next fieldRow: {value=773, name=cat, id=23}
>>>>>>> INFO: Next fieldRow: {value=642058, name=sid, id=17}
>>>>>>>
>>>>>>> INFO: Returning: {sid=642058, cat=[773], audio=2}
>>>>>>>
>>>>>>> The entity declaration in the dih config (db_data_config.xml) looks
>>>>>>> like
>>>>>>> this (shortened):
>>>>>>> <entity name="my_value" processor="PivotSqlEntityProcessor"
>>>>>>>     columnValue="value" columnName="name"
>>>>>>>     query="select id, name, value from datamart where
>>>>>>> parent_id=${id_definition.ID} and id in (1,23,17)">
>>>>>>>     <field column="sid" name="sid" />
>>>>>>>     <field column="audio" name="audio" />
>>>>>>>     <field column="cat" name="cat" />
>>>>>>> </entity>
>>>>>>>
>>>>>>> id_definition is the root entity. Per parent_id there are several
>>>>>>> rows
>>>>>>> in
>>>>>>> the datamart table which describe one data set (=>lucene document).
>>>>>>>
>>>>>>> The object type of "value" is either String, String[] or List. I am
>>>>>>> not
>>>>>>> handling that explicitly, yet. If that'd be the problem it would
>>>>>>> throw
>>>>>>> an
>>>>>>> exception, wouldn't it?
>>>>>>>
>>>>>>> But it is not creating any documents at all, although the data seems
>>>>>>> to
>>>>>>> be
>>>>>>> returned correctly from the processor, so it's pobably something far
>>>>>>> more
>>>>>>> fundamental.
>>>>>>> <str name="Total Requests made to DataSource">1069</str>
>>>>>>> <str name="Total Rows Fetched">1069</str>
>>>>>>> <str name="Total Documents Skipped">0</str>
>>>>>>> <str name="Full Dump Started">2009-07-23 12:57:07</str>
>>>>>>> −
>>>>>>> <str name="">
>>>>>>> Indexing completed. Added/Updated: 0 documents. Deleted 0 documents.
>>>>>>> </str>
>>>>>>>
>>>>>>> Any help / hint on what the root cause is or how to debug it would be
>>>>>>> greatly appreciated.
>>>>>>>
>>>>>>> Thank you!
>>>>>>> Chantal
>>>>>>>
>>>>>>>
>>>>>>> Noble Paul നോബിള്‍ नोब्ळ् schrieb:
>>>>>>>>
>>>>>>>> alternately, you can write your own EntityProcessor and just
>>>>>>>> override
>>>>>>>> the nextRow() . I guess you can still use the JdbcDataSource
>>>>>>>>
>>>>>>>> On Wed, Jul 22, 2009 at 10:05 PM, Chantal
>>>>>>>> Ackermann<ch...@btelligent.de> wrote:
>>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> this is my first post, as I am new to SOLR (some Lucene exp).
>>>>>>>>>
>>>>>>>>> I am trying to load data from an existing datamart into SOLR using
>>>>>>>>> the
>>>>>>>>> DataImportHandler but in my opinion it is too slow due to the
>>>>>>>>> special
>>>>>>>>> structure of the datamart I have to use.
>>>>>>>>>
>>>>>>>>> Root Cause:
>>>>>>>>> This datamart uses a row based approach (pivot) to present its
>>>>>>>>> data.
>>>>>>>>> It
>>>>>>>>> was
>>>>>>>>> so done to allow adding more attributes to the data set without
>>>>>>>>> having
>>>>>>>>> to
>>>>>>>>> change the table structure.
>>>>>>>>>
>>>>>>>>> Impact:
>>>>>>>>> To use the DataImportHandler, i have to pivot the data to create
>>>>>>>>> again
>>>>>>>>> one
>>>>>>>>> row per data set. Unfortunately, this results in more and less
>>>>>>>>> performant
>>>>>>>>> queries. Moreover, there are sometimes multiple rows for a single
>>>>>>>>> attribute,
>>>>>>>>> that require separate queries - or more tricky subselects that
>>>>>>>>> probably
>>>>>>>>> don't speed things up.
>>>>>>>>>
>>>>>>>>> Here is an example of the relation between DB requests, row fetches
>>>>>>>>> and
>>>>>>>>> actual number of documents created:
>>>>>>>>>
>>>>>>>>> <lst name="statusMessages">
>>>>>>>>> <str name="Total Requests made to DataSource">3737</str>
>>>>>>>>> <str name="Total Rows Fetched">5380</str>
>>>>>>>>> <str name="Total Documents Skipped">0</str>
>>>>>>>>> <str name="Full Dump Started">2009-07-22 18:19:06</str>
>>>>>>>>> -
>>>>>>>>> <str name="">
>>>>>>>>> Indexing completed. Added/Updated: 934 documents. Deleted 0
>>>>>>>>> documents.
>>>>>>>>> </str>
>>>>>>>>> <str name="Committed">2009-07-22 18:22:29</str>
>>>>>>>>> <str name="Optimized">2009-07-22 18:22:29</str>
>>>>>>>>> <str name="Time taken ">0:3:22.484</str>
>>>>>>>>> </lst>
>>>>>>>>>
>>>>>>>>> (Full index creation.)
>>>>>>>>> There are about half a million data sets, in total. That would
>>>>>>>>> require
>>>>>>>>> about
>>>>>>>>> 30h for indexing? My feeling is that there are far too many row
>>>>>>>>> fetches
>>>>>>>>> per
>>>>>>>>> data set.
>>>>>>>>>
>>>>>>>>> I am testing it on a smaller machine (2GB, Windows :-( ), Tomcat6
>>>>>>>>> using
>>>>>>>>> around 680MB RAM, Java6. I haven't changed the Lucene configuration
>>>>>>>>> (merge
>>>>>>>>> factor 10, ram buffer size 32).
>>>>>>>>>
>>>>>>>>> Possible solutions?
>>>>>>>>> A) Write my own DataImportHandler?
>>>>>>>>> B) Write my own "MultiRowTransformer" that accepts several rows as
>>>>>>>>> input
>>>>>>>>> argument (not sure this is a valid option)?
>>>>>>>>> C) Approach the DB developers to add a flat table with one data set
>>>>>>>>> per
>>>>>>>>> row?
>>>>>>>>> D) ...?
>>>>>>>>>
>>>>>>>>> If someone would like to share their experiences, that would be
>>>>>>>>> great!
>>>>>>>>>
>>>>>>>>> Thanks a lot!
>>>>>>>>> Chantal
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Chantal Ackermann
>>>>>>>>>
>>>>>>>> --
>>>>>>>> -----------------------------------------------------
>>>>>>>> Noble Paul | Principal Engineer| AOL | http://aol.com
>>>>>>
>>>>>> --
>>>>>> -----------------------------------------------------
>>>>>> Noble Paul | Principal Engineer| AOL | http://aol.com
>>>>>
>>>>> --
>>>>> Chantal Ackermann
>>>>> Consultant
>>>>>
>>>>> mobil    +49 (176) 10 00 09 45
>>>>> email    chantal.ackermann@btelligent.de
>>>>>
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------------------------------------
>>>>>
>>>>> b.telligent GmbH & Co. KG
>>>>> Lichtenbergstraße 8
>>>>> D-85748 Garching / München
>>>>>
>>>>> fon       +49 (89) 54 84 25 60
>>>>> fax        +49 (89) 54 84 25 69
>>>>> web      www.btelligent.de
>>>>>
>>>>> Registered in Munich: HRA 84393
>>>>> Managing Director: b.telligent Verwaltungs GmbH, HRB 153164 represented
>>>>> by
>>>>> Sebastian Amtage and Klaus Blaschek
>>>>> USt.Id.-Nr. DE814054803
>>>>>
>>>>>
>>>>>
>>>>> Confidentiality Note
>>>>> This email is intended only for the use of the individual or entity to
>>>>> which
>>>>> it is addressed, and may contain information that is privileged,
>>>>> confidential and exempt from disclosure under applicable law. If the
>>>>> reader
>>>>> of this email message is not the intended recipient, or the employee or
>>>>> agent responsible for delivery of the message to the intended
>>>>> recipient,
>>>>> you
>>>>> are hereby notified that any dissemination, distribution or copying of
>>>>> this
>>>>> communication is prohibited. If you have received this email in error,
>>>>> please notify us immediately by telephone at +49 (0) 89 54 84 25 60.
>>>>> Thank
>>>>> you.
>>>>>
>>>>
>>>>
>>>> --
>>>> -----------------------------------------------------
>>>> Noble Paul | Principal Engineer| AOL | http://aol.com
>>>
>>> --
>>> Chantal Ackermann
>>> Consultant
>>>
>>> mobil    +49 (176) 10 00 09 45
>>> email    chantal.ackermann@btelligent.de
>>>
>>>
>>> --------------------------------------------------------------------------------------------------------
>>>
>>> b.telligent GmbH & Co. KG
>>> Lichtenbergstraße 8
>>> D-85748 Garching / München
>>>
>>> fon       +49 (89) 54 84 25 60
>>> fax        +49 (89) 54 84 25 69
>>> web      www.btelligent.de
>>>
>>> Registered in Munich: HRA 84393
>>> Managing Director: b.telligent Verwaltungs GmbH, HRB 153164 represented
>>> by
>>> Sebastian Amtage and Klaus Blaschek
>>> USt.Id.-Nr. DE814054803
>>>
>>>
>>>
>>> Confidentiality Note
>>> This email is intended only for the use of the individual or entity to
>>> which
>>> it is addressed, and may contain information that is privileged,
>>> confidential and exempt from disclosure under applicable law. If the
>>> reader
>>> of this email message is not the intended recipient, or the employee or
>>> agent responsible for delivery of the message to the intended recipient,
>>> you
>>> are hereby notified that any dissemination, distribution or copying of
>>> this
>>> communication is prohibited. If you have received this email in error,
>>> please notify us immediately by telephone at +49 (0) 89 54 84 25 60.
>>> Thank
>>> you.
>>>
>>
>>
>>
>> --
>> -----------------------------------------------------
>> Noble Paul | Principal Engineer| AOL | http://aol.com
>
> --
> Chantal Ackermann
> Consultant
>
> mobil    +49 (176) 10 00 09 45
> email    chantal.ackermann@btelligent.de
>
> --------------------------------------------------------------------------------------------------------
>
> b.telligent GmbH & Co. KG
> Lichtenbergstraße 8
> D-85748 Garching / München
>
> fon       +49 (89) 54 84 25 60
> fax        +49 (89) 54 84 25 69
> web      www.btelligent.de
>
> Registered in Munich: HRA 84393
> Managing Director: b.telligent Verwaltungs GmbH, HRB 153164 represented by
> Sebastian Amtage and Klaus Blaschek
> USt.Id.-Nr. DE814054803
>
>
>
> Confidentiality Note
> This email is intended only for the use of the individual or entity to which
> it is addressed, and may contain information that is privileged,
> confidential and exempt from disclosure under applicable law. If the reader
> of this email message is not the intended recipient, or the employee or
> agent responsible for delivery of the message to the intended recipient, you
> are hereby notified that any dissemination, distribution or copying of this
> communication is prohibited. If you have received this email in error,
> please notify us immediately by telephone at +49 (0) 89 54 84 25 60. Thank
> you.
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: DataImportHandler / Import from DB : one data set comes in multiple rows

Posted by Chantal Ackermann <ch...@btelligent.de>.
Hi Paul,

no, I didn't return the unique key, though there is one defined. I added 
that to the nextRow() implementation, and I am now returning it as part 
of the map.

But it is still not creating any documents, and now that I can see the 
ID I have realized that it is always processing the same - the first - 
data set. It's like it tries to create the first document but does not, 
then reiterates over that same data, fails again, and so on. I mean, it 
doesn't even create one document. So it cannot be a simple iteration 
that updates the same document over and over again (as there is none).

I haven't changed the log level. I see no error message in the output 
(catalina.log in my case).

The complete entity definition:

<dataConfig>
     <dataSource type="JdbcDataSource" 
driver="oracle.jdbc.driver.OracleDriver" ... />
     <document name="doc">
         <entity name="epg_definition" pk="ID"
		query="select ID from DEFINITION">
<!-- originally I would set the field "id" (unique key) on this level, 
doesn't work neither -->
             <entity name="value" pk="DEF_ID" 
processor="PivotSqlEntityProcessor"
		query="select DEF_ID, id, name, value from datamart where 
parent_id=${id_definition.ID} and id in (1,23,17)">
		<field column="DEF_ID" name="id" />
		<field column="sid" name="sid" />
		<field column="audio" name="audio" />
		<field column="cat" name="cat" />
             </entity>
         </entity>
     </document>
</dataConfig>

schema:
<field name="id" type="long" indexed="true" stored="true" required="true" />
<field name="sid" type="long" indexed="true" stored="true" 
required="true" />
<field name="audio" type="text_ws" indexed="true" stored="false" 
omitNorms="true" multiValued="true"/>
<field name="cat" type="text_ws" indexed="true" stored="true" 
omitNorms="true" multiValued="true"/>

I am using more fields, but I removed them to make it easier to read. I 
am thinking about removing them from my test to be sure they don't 
interfere.

Thanks for your help!
Chantal


Noble Paul നോബിള്‍ नोब्ळ् schrieb:
> Is there a <uniqueKey> in your schema ? are you returning a value
> corresponding to that key name?
> 
> probably you can paste the whole data-config.xml
> 
> 
> 
> On Thu, Jul 23, 2009 at 4:59 PM, Chantal
> Ackermann<ch...@btelligent.de> wrote:
>> Hi Paul, hi Glen, hi all,
>>
>> thank you for your answers.
>>
>> I have followed Paul's solution (as I received it earlier). (I'll keep your
>> suggestion in mind, though, Glen.)
>>
>> It looks good, except that it's not creating any documents... ;-)
>> It is most probably some misunderstanding on my side, and maybe you can help
>> me correct that?
>>
>> So, I have subclassed the SqlEntityProcessor by overwriting basically
>> nextRow() as Paul suggested:
>>
>> public Map<String, Object> nextRow() {
>>        if (rowcache != null)
>>                return getFromRowCache();
>>        if (rowIterator == null) {
>>                String q = getQuery();
>>                initQuery(resolver.replaceTokens(q));
>>        }
>>        Map<String, Object> pivottedRow = new HashMap<String, Object>();
>>        Map<String, Object> fieldRow = getNext();
>>        while (fieldRow != null) {
>>                // populate pivottedRow
>>                fieldRow = getNext();
>>        }
>>        pivottedRow = applyTransformer(pivottedRow);
>>        log.info("Returning: " + pivottedRow);
>>        return pivottedRow;
>> }
>>
>> This seems to work as intended. From the log output, I can see that I get
>> only the rows that I expect for one iteration in the correct key-value
>> structure. I can also see, that the returned pivottedRow is what I want it
>> to be: a map containing columns where each column contains what previously
>> was input as a row.
>>
>> Example (shortened):
>> INFO: Next fieldRow: {value=2, name=audio, id=1}
>> INFO: Next fieldRow: {value=773, name=cat, id=23}
>> INFO: Next fieldRow: {value=642058, name=sid, id=17}
>>
>> INFO: Returning: {sid=642058, cat=[773], audio=2}
>>
>> The entity declaration in the dih config (db_data_config.xml) looks like
>> this (shortened):
>> <entity name="my_value" processor="PivotSqlEntityProcessor"
>>        columnValue="value" columnName="name"
>>        query="select id, name, value from datamart where
>> parent_id=${id_definition.ID} and id in (1,23,17)">
>>        <field column="sid" name="sid" />
>>        <field column="audio" name="audio" />
>>        <field column="cat" name="cat" />
>> </entity>
>>
>> id_definition is the root entity. Per parent_id there are several rows in
>> the datamart table which describe one data set (=>lucene document).
>>
>> The object type of "value" is either String, String[] or List. I am not
>> handling that explicitly, yet. If that'd be the problem it would throw an
>> exception, wouldn't it?
>>
>> But it is not creating any documents at all, although the data seems to be
>> returned correctly from the processor, so it's pobably something far more
>> fundamental.
>> <str name="Total Requests made to DataSource">1069</str>
>> <str name="Total Rows Fetched">1069</str>
>> <str name="Total Documents Skipped">0</str>
>> <str name="Full Dump Started">2009-07-23 12:57:07</str>
>> −
>> <str name="">
>> Indexing completed. Added/Updated: 0 documents. Deleted 0 documents.
>> </str>
>>
>> Any help / hint on what the root cause is or how to debug it would be
>> greatly appreciated.
>>
>> Thank you!
>> Chantal
>>
>>
>> Noble Paul നോബിള്‍ नोब्ळ् schrieb:
>>> alternately, you can write your own EntityProcessor and just override
>>> the nextRow() . I guess you can still use the JdbcDataSource
>>>
>>> On Wed, Jul 22, 2009 at 10:05 PM, Chantal
>>> Ackermann<ch...@btelligent.de> wrote:
>>>> Hi all,
>>>>
>>>> this is my first post, as I am new to SOLR (some Lucene exp).
>>>>
>>>> I am trying to load data from an existing datamart into SOLR using the
>>>> DataImportHandler but in my opinion it is too slow due to the special
>>>> structure of the datamart I have to use.
>>>>
>>>> Root Cause:
>>>> This datamart uses a row based approach (pivot) to present its data. It
>>>> was
>>>> so done to allow adding more attributes to the data set without having to
>>>> change the table structure.
>>>>
>>>> Impact:
>>>> To use the DataImportHandler, i have to pivot the data to create again
>>>> one
>>>> row per data set. Unfortunately, this results in more and less performant
>>>> queries. Moreover, there are sometimes multiple rows for a single
>>>> attribute,
>>>> that require separate queries - or more tricky subselects that probably
>>>> don't speed things up.
>>>>
>>>> Here is an example of the relation between DB requests, row fetches and
>>>> actual number of documents created:
>>>>
>>>> <lst name="statusMessages">
>>>> <str name="Total Requests made to DataSource">3737</str>
>>>> <str name="Total Rows Fetched">5380</str>
>>>> <str name="Total Documents Skipped">0</str>
>>>> <str name="Full Dump Started">2009-07-22 18:19:06</str>
>>>> -
>>>> <str name="">
>>>> Indexing completed. Added/Updated: 934 documents. Deleted 0 documents.
>>>> </str>
>>>> <str name="Committed">2009-07-22 18:22:29</str>
>>>> <str name="Optimized">2009-07-22 18:22:29</str>
>>>> <str name="Time taken ">0:3:22.484</str>
>>>> </lst>
>>>>
>>>> (Full index creation.)
>>>> There are about half a million data sets, in total. That would require
>>>> about
>>>> 30h for indexing? My feeling is that there are far too many row fetches
>>>> per
>>>> data set.
>>>>
>>>> I am testing it on a smaller machine (2GB, Windows :-( ), Tomcat6 using
>>>> around 680MB RAM, Java6. I haven't changed the Lucene configuration
>>>> (merge
>>>> factor 10, ram buffer size 32).
>>>>
>>>> Possible solutions?
>>>> A) Write my own DataImportHandler?
>>>> B) Write my own "MultiRowTransformer" that accepts several rows as input
>>>> argument (not sure this is a valid option)?
>>>> C) Approach the DB developers to add a flat table with one data set per
>>>> row?
>>>> D) ...?
>>>>
>>>> If someone would like to share their experiences, that would be great!
>>>>
>>>> Thanks a lot!
>>>> Chantal
>>>>
>>>>
>>>>
>>>> --
>>>> Chantal Ackermann
>>>>
>>>
>>>
>>> --
>>> -----------------------------------------------------
>>> Noble Paul | Principal Engineer| AOL | http://aol.com
> 
> 
> 
> --
> -----------------------------------------------------
> Noble Paul | Principal Engineer| AOL | http://aol.com

-- 
Chantal Ackermann
Consultant

mobil    +49 (176) 10 00 09 45
email    chantal.ackermann@btelligent.de

--------------------------------------------------------------------------------------------------------

b.telligent GmbH & Co. KG
Lichtenbergstraße 8
D-85748 Garching / München

fon       +49 (89) 54 84 25 60
fax        +49 (89) 54 84 25 69
web      www.btelligent.de

Registered in Munich: HRA 84393
Managing Director: b.telligent Verwaltungs GmbH, HRB 153164 represented 
by Sebastian Amtage and Klaus Blaschek
USt.Id.-Nr. DE814054803



Confidentiality Note
This email is intended only for the use of the individual or entity to 
which it is addressed, and may contain information that is privileged, 
confidential and exempt from disclosure under applicable law. If the 
reader of this email message is not the intended recipient, or the 
employee or agent responsible for delivery of the message to the 
intended recipient, you are hereby notified that any dissemination, 
distribution or copying of this communication is prohibited. If you have 
received this email in error, please notify us immediately by telephone 
at +49 (0) 89 54 84 25 60. Thank you.

Re: DataImportHandler / Import from DB : one data set comes in multiple rows

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.
Is there a <uniqueKey> in your schema ? are you returning a value
corresponding to that key name?

probably you can paste the whole data-config.xml



On Thu, Jul 23, 2009 at 4:59 PM, Chantal
Ackermann<ch...@btelligent.de> wrote:
> Hi Paul, hi Glen, hi all,
>
> thank you for your answers.
>
> I have followed Paul's solution (as I received it earlier). (I'll keep your
> suggestion in mind, though, Glen.)
>
> It looks good, except that it's not creating any documents... ;-)
> It is most probably some misunderstanding on my side, and maybe you can help
> me correct that?
>
> So, I have subclassed the SqlEntityProcessor by overwriting basically
> nextRow() as Paul suggested:
>
> public Map<String, Object> nextRow() {
>        if (rowcache != null)
>                return getFromRowCache();
>        if (rowIterator == null) {
>                String q = getQuery();
>                initQuery(resolver.replaceTokens(q));
>        }
>        Map<String, Object> pivottedRow = new HashMap<String, Object>();
>        Map<String, Object> fieldRow = getNext();
>        while (fieldRow != null) {
>                // populate pivottedRow
>                fieldRow = getNext();
>        }
>        pivottedRow = applyTransformer(pivottedRow);
>        log.info("Returning: " + pivottedRow);
>        return pivottedRow;
> }
>
> This seems to work as intended. From the log output, I can see that I get
> only the rows that I expect for one iteration in the correct key-value
> structure. I can also see, that the returned pivottedRow is what I want it
> to be: a map containing columns where each column contains what previously
> was input as a row.
>
> Example (shortened):
> INFO: Next fieldRow: {value=2, name=audio, id=1}
> INFO: Next fieldRow: {value=773, name=cat, id=23}
> INFO: Next fieldRow: {value=642058, name=sid, id=17}
>
> INFO: Returning: {sid=642058, cat=[773], audio=2}
>
> The entity declaration in the dih config (db_data_config.xml) looks like
> this (shortened):
> <entity name="my_value" processor="PivotSqlEntityProcessor"
>        columnValue="value" columnName="name"
>        query="select id, name, value from datamart where
> parent_id=${id_definition.ID} and id in (1,23,17)">
>        <field column="sid" name="sid" />
>        <field column="audio" name="audio" />
>        <field column="cat" name="cat" />
> </entity>
>
> id_definition is the root entity. Per parent_id there are several rows in
> the datamart table which describe one data set (=>lucene document).
>
> The object type of "value" is either String, String[] or List. I am not
> handling that explicitly, yet. If that'd be the problem it would throw an
> exception, wouldn't it?
>
> But it is not creating any documents at all, although the data seems to be
> returned correctly from the processor, so it's pobably something far more
> fundamental.
> <str name="Total Requests made to DataSource">1069</str>
> <str name="Total Rows Fetched">1069</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2009-07-23 12:57:07</str>
> −
> <str name="">
> Indexing completed. Added/Updated: 0 documents. Deleted 0 documents.
> </str>
>
> Any help / hint on what the root cause is or how to debug it would be
> greatly appreciated.
>
> Thank you!
> Chantal
>
>
> Noble Paul നോബിള്‍ नोब्ळ् schrieb:
>>
>> alternately, you can write your own EntityProcessor and just override
>> the nextRow() . I guess you can still use the JdbcDataSource
>>
>> On Wed, Jul 22, 2009 at 10:05 PM, Chantal
>> Ackermann<ch...@btelligent.de> wrote:
>>>
>>> Hi all,
>>>
>>> this is my first post, as I am new to SOLR (some Lucene exp).
>>>
>>> I am trying to load data from an existing datamart into SOLR using the
>>> DataImportHandler but in my opinion it is too slow due to the special
>>> structure of the datamart I have to use.
>>>
>>> Root Cause:
>>> This datamart uses a row based approach (pivot) to present its data. It
>>> was
>>> so done to allow adding more attributes to the data set without having to
>>> change the table structure.
>>>
>>> Impact:
>>> To use the DataImportHandler, i have to pivot the data to create again
>>> one
>>> row per data set. Unfortunately, this results in more and less performant
>>> queries. Moreover, there are sometimes multiple rows for a single
>>> attribute,
>>> that require separate queries - or more tricky subselects that probably
>>> don't speed things up.
>>>
>>> Here is an example of the relation between DB requests, row fetches and
>>> actual number of documents created:
>>>
>>> <lst name="statusMessages">
>>> <str name="Total Requests made to DataSource">3737</str>
>>> <str name="Total Rows Fetched">5380</str>
>>> <str name="Total Documents Skipped">0</str>
>>> <str name="Full Dump Started">2009-07-22 18:19:06</str>
>>> -
>>> <str name="">
>>> Indexing completed. Added/Updated: 934 documents. Deleted 0 documents.
>>> </str>
>>> <str name="Committed">2009-07-22 18:22:29</str>
>>> <str name="Optimized">2009-07-22 18:22:29</str>
>>> <str name="Time taken ">0:3:22.484</str>
>>> </lst>
>>>
>>> (Full index creation.)
>>> There are about half a million data sets, in total. That would require
>>> about
>>> 30h for indexing? My feeling is that there are far too many row fetches
>>> per
>>> data set.
>>>
>>> I am testing it on a smaller machine (2GB, Windows :-( ), Tomcat6 using
>>> around 680MB RAM, Java6. I haven't changed the Lucene configuration
>>> (merge
>>> factor 10, ram buffer size 32).
>>>
>>> Possible solutions?
>>> A) Write my own DataImportHandler?
>>> B) Write my own "MultiRowTransformer" that accepts several rows as input
>>> argument (not sure this is a valid option)?
>>> C) Approach the DB developers to add a flat table with one data set per
>>> row?
>>> D) ...?
>>>
>>> If someone would like to share their experiences, that would be great!
>>>
>>> Thanks a lot!
>>> Chantal
>>>
>>>
>>>
>>> --
>>> Chantal Ackermann
>>>
>>
>>
>>
>> --
>> -----------------------------------------------------
>> Noble Paul | Principal Engineer| AOL | http://aol.com
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: DataImportHandler / Import from DB : one data set comes in multiple rows

Posted by Chantal Ackermann <ch...@btelligent.de>.
Hi Paul, hi Glen, hi all,

thank you for your answers.

I have followed Paul's solution (as I received it earlier). (I'll keep 
your suggestion in mind, though, Glen.)

It looks good, except that it's not creating any documents... ;-)
It is most probably some misunderstanding on my side, and maybe you can 
help me correct that?

So, I have subclassed the SqlEntityProcessor by overwriting basically 
nextRow() as Paul suggested:

public Map<String, Object> nextRow() {
	if (rowcache != null)
		return getFromRowCache();
	if (rowIterator == null) {
		String q = getQuery();
		initQuery(resolver.replaceTokens(q));
	}
	Map<String, Object> pivottedRow = new HashMap<String, Object>();
	Map<String, Object> fieldRow = getNext();
	while (fieldRow != null) {
		// populate pivottedRow
		fieldRow = getNext();
	}
	pivottedRow = applyTransformer(pivottedRow);
	log.info("Returning: " + pivottedRow);
	return pivottedRow;
}

This seems to work as intended. From the log output, I can see that I 
get only the rows that I expect for one iteration in the correct 
key-value structure. I can also see, that the returned pivottedRow is 
what I want it to be: a map containing columns where each column 
contains what previously was input as a row.

Example (shortened):
INFO: Next fieldRow: {value=2, name=audio, id=1}
INFO: Next fieldRow: {value=773, name=cat, id=23}
INFO: Next fieldRow: {value=642058, name=sid, id=17}

INFO: Returning: {sid=642058, cat=[773], audio=2}

The entity declaration in the dih config (db_data_config.xml) looks like 
this (shortened):
<entity name="my_value" processor="PivotSqlEntityProcessor"
	columnValue="value" columnName="name"
	query="select id, name, value from datamart where 
parent_id=${id_definition.ID} and id in (1,23,17)">
	<field column="sid" name="sid" />
	<field column="audio" name="audio" />
	<field column="cat" name="cat" />
</entity>

id_definition is the root entity. Per parent_id there are several rows 
in the datamart table which describe one data set (=>lucene document).

The object type of "value" is either String, String[] or List. I am not 
handling that explicitly, yet. If that'd be the problem it would throw 
an exception, wouldn't it?

But it is not creating any documents at all, although the data seems to 
be returned correctly from the processor, so it's pobably something far 
more fundamental.
<str name="Total Requests made to DataSource">1069</str>
<str name="Total Rows Fetched">1069</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2009-07-23 12:57:07</str>
−
<str name="">
Indexing completed. Added/Updated: 0 documents. Deleted 0 documents.
</str>

Any help / hint on what the root cause is or how to debug it would be 
greatly appreciated.

Thank you!
Chantal


Noble Paul നോബിള്‍ नोब्ळ् schrieb:
> alternately, you can write your own EntityProcessor and just override
> the nextRow() . I guess you can still use the JdbcDataSource
> 
> On Wed, Jul 22, 2009 at 10:05 PM, Chantal
> Ackermann<ch...@btelligent.de> wrote:
>> Hi all,
>>
>> this is my first post, as I am new to SOLR (some Lucene exp).
>>
>> I am trying to load data from an existing datamart into SOLR using the
>> DataImportHandler but in my opinion it is too slow due to the special
>> structure of the datamart I have to use.
>>
>> Root Cause:
>> This datamart uses a row based approach (pivot) to present its data. It was
>> so done to allow adding more attributes to the data set without having to
>> change the table structure.
>>
>> Impact:
>> To use the DataImportHandler, i have to pivot the data to create again one
>> row per data set. Unfortunately, this results in more and less performant
>> queries. Moreover, there are sometimes multiple rows for a single attribute,
>> that require separate queries - or more tricky subselects that probably
>> don't speed things up.
>>
>> Here is an example of the relation between DB requests, row fetches and
>> actual number of documents created:
>>
>> <lst name="statusMessages">
>> <str name="Total Requests made to DataSource">3737</str>
>> <str name="Total Rows Fetched">5380</str>
>> <str name="Total Documents Skipped">0</str>
>> <str name="Full Dump Started">2009-07-22 18:19:06</str>
>> -
>> <str name="">
>> Indexing completed. Added/Updated: 934 documents. Deleted 0 documents.
>> </str>
>> <str name="Committed">2009-07-22 18:22:29</str>
>> <str name="Optimized">2009-07-22 18:22:29</str>
>> <str name="Time taken ">0:3:22.484</str>
>> </lst>
>>
>> (Full index creation.)
>> There are about half a million data sets, in total. That would require about
>> 30h for indexing? My feeling is that there are far too many row fetches per
>> data set.
>>
>> I am testing it on a smaller machine (2GB, Windows :-( ), Tomcat6 using
>> around 680MB RAM, Java6. I haven't changed the Lucene configuration (merge
>> factor 10, ram buffer size 32).
>>
>> Possible solutions?
>> A) Write my own DataImportHandler?
>> B) Write my own "MultiRowTransformer" that accepts several rows as input
>> argument (not sure this is a valid option)?
>> C) Approach the DB developers to add a flat table with one data set per row?
>> D) ...?
>>
>> If someone would like to share their experiences, that would be great!
>>
>> Thanks a lot!
>> Chantal
>>
>>
>>
>> --
>> Chantal Ackermann
>>
> 
> 
> 
> --
> -----------------------------------------------------
> Noble Paul | Principal Engineer| AOL | http://aol.com

Re: DataImportHandler / Import from DB : one data set comes in multiple rows

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.
alternately, you can write your own EntityProcessor and just override
the nextRow() . I guess you can still use the JdbcDataSource

On Wed, Jul 22, 2009 at 10:05 PM, Chantal
Ackermann<ch...@btelligent.de> wrote:
> Hi all,
>
> this is my first post, as I am new to SOLR (some Lucene exp).
>
> I am trying to load data from an existing datamart into SOLR using the
> DataImportHandler but in my opinion it is too slow due to the special
> structure of the datamart I have to use.
>
> Root Cause:
> This datamart uses a row based approach (pivot) to present its data. It was
> so done to allow adding more attributes to the data set without having to
> change the table structure.
>
> Impact:
> To use the DataImportHandler, i have to pivot the data to create again one
> row per data set. Unfortunately, this results in more and less performant
> queries. Moreover, there are sometimes multiple rows for a single attribute,
> that require separate queries - or more tricky subselects that probably
> don't speed things up.
>
> Here is an example of the relation between DB requests, row fetches and
> actual number of documents created:
>
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">3737</str>
> <str name="Total Rows Fetched">5380</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2009-07-22 18:19:06</str>
> -
> <str name="">
> Indexing completed. Added/Updated: 934 documents. Deleted 0 documents.
> </str>
> <str name="Committed">2009-07-22 18:22:29</str>
> <str name="Optimized">2009-07-22 18:22:29</str>
> <str name="Time taken ">0:3:22.484</str>
> </lst>
>
> (Full index creation.)
> There are about half a million data sets, in total. That would require about
> 30h for indexing? My feeling is that there are far too many row fetches per
> data set.
>
> I am testing it on a smaller machine (2GB, Windows :-( ), Tomcat6 using
> around 680MB RAM, Java6. I haven't changed the Lucene configuration (merge
> factor 10, ram buffer size 32).
>
> Possible solutions?
> A) Write my own DataImportHandler?
> B) Write my own "MultiRowTransformer" that accepts several rows as input
> argument (not sure this is a valid option)?
> C) Approach the DB developers to add a flat table with one data set per row?
> D) ...?
>
> If someone would like to share their experiences, that would be great!
>
> Thanks a lot!
> Chantal
>
>
>
> --
> Chantal Ackermann
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com