You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Bertie Shen <be...@gmail.com> on 2009/11/07 18:43:50 UTC

Re: Specifying multiple documents in DataImportHandler dataConfig

I have the same problem. I had thought we could specify multiple <document>
blah blah blah</document>s, each of which is mapping one table in the RDBMS.
But I found it was not the case. It only picks the first <document>blah blah
blah</document> to do indexing.

I think Rupert's  and my request are pretty common. Basically there are
multiple tables in RDBMS, and we want each row in each table become a
document in Lucene index. How can we write one data config.xml file to let
DataImportHandler import multiple tables at the same time?

Rupert, have you figured out a way to do it?

Thanks.


On Tue, Sep 8, 2009 at 3:42 PM, Rupert Fiasco <ru...@gmail.com> wrote:

> Maybe I should be more clear: I have multiple tables in my DB that I
> need to save to my Solr index. In my app code I have logic to persist
> each table, which maps to an application model to Solr. This is fine.
> I am just trying to speed up indexing time by using DIH instead of
> going through my application. From what I understand of DIH I can
> specify one dataSource element and then a series of document/entity
> sets, for each of my models. But like I said before, DIH only appears
> to want to index the first document declared under the dataSource tag.
>
> -Rupert
>
> On Tue, Sep 8, 2009 at 4:05 PM, Rupert Fiasco<ru...@gmail.com> wrote:
> > I am using the DataImportHandler with a JDBC datasource. From my
> > understanding of DIH, for each of my "content types" e.g. Blog posts,
> > Mesh Categories, etc I would construct a series of document/entity
> > sets, like
> >
> > <dataConfig>
> > <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://...." />
> >
> >    <!-- BLOG ENTRIES -->
> >    <document name="blog_entries">
> >      <entity name="blog_entries" query="select
> > id,title,keywords,summary,data,title as name_fc,'BlogEntry' as type
> > from blog_entries">
> >        <field column="id" name="pk_i" />
> >        <field column="id" name="id" />
> >        <field column="title" name="text_t" />
> >        <field column="data" name="text_t" />
> >      </entity>
> >    </document>
> >
> >    <!-- MESH CATEGORIES -->
> >    <document name="mesh_category">
> >      <entity name="mesh_categories" query="select
> > id,name,node_key,name as name_fc,'MeshCategory' as type from
> > mesh_categories">
> >        <field column="id" name="pk_i" />
> >        <field column="id" name="id" />
> >        <field column="name" name="text_t" />
> >        <field column="node_key" name="string" />
> >        <field column="name_fc" name="facet_value" />
> >        <field column="type" name="type_t" />
> >      </entity>
> >    </document>
> > </datasource>
> > </dataConfig>
> >
> >
> > Solr parses this just fine and allows me to issue a
> > /dataimport?command=full-import and it runs, but it only runs against
> > the "first" document (blog_entries). It doesnt run against the 2nd
> > document (mesh_categories).
> >
> > If I remove the 2 document elements and wrap both entity sets in just
> > one document tag, then both sets get indexed, which seemingly achieves
> > my goal. This just doesnt make sense from my understanding of how DIH
> > works. My 2 content types are indeed separate so they logically
> > represent two document types, not one.
> >
> > Is this correct? What am I missing here?
> >
> > Thanks
> > -Rupert
> >
>

Re: Specifying multiple documents in DataImportHandler dataConfig

Posted by Bertie Shen <be...@gmail.com>.

HI Lance,

 I think you are discussing a different issue here.  We are talking about
each row from each table represents a document in index. You look to discuss
about some documents may have multi-value fields which are stored in a
separate table in RDBMS because of normalization.



On Mon, Nov 9, 2009 at 6:01 PM, Lance Norskog <go...@gmail.com> wrote:

> There is a more fundamental problem here: Solr/Lucene index only
> implements one table. If you have data from multiple tables in a
> normalized index, you have denormalize the multi-table DB schema to
> make a single-table Solr/Lucene index.
>
> Your indexing will probably be faster if you a join in SQL to supply
> your entire set of fields per database request.
>
> 2009/11/7 Noble Paul നോബിള്‍  नोब्ळ् <no...@corp.aol.com>:
> > On Sun, Nov 8, 2009 at 8:25 AM, Bertie Shen <be...@gmail.com>
> wrote:
> >> I have figured out a way to solve this problem: just specify a
> >> single <document> blah blah blah </document>. Under <document>, specify
> >> multiple top level entity entries, each of which corresponds to one
> table
> >> data.
> >>
> >> So each top level entry will map one row in it to a document in Lucene
> >> index. <document> in DIH is *NOT* mapped to a document in Lucene index
> while
> >> top-level entity is. I feel <document> tag is redundant and misleading
> in
> >> data config and thus should be removed.
> >
> > There are some common attributes specified at the <document> level .
> > It still acts as a container tag .
> >>
> >> Cheers.
> >>
> >> On Sat, Nov 7, 2009 at 9:43 AM, Bertie Shen <be...@gmail.com>
> wrote:
> >>
> >>> I have the same problem. I had thought we could specify multiple
> <document>
> >>> blah blah blah</document>s, each of which is mapping one table in the
> RDBMS.
> >>> But I found it was not the case. It only picks the first <document>blah
> blah
> >>> blah</document> to do indexing.
> >>>
> >>> I think Rupert's  and my request are pretty common. Basically there are
> >>> multiple tables in RDBMS, and we want each row in each table become a
> >>> document in Lucene index. How can we write one data config.xml file to
> let
> >>> DataImportHandler import multiple tables at the same time?
> >>>
> >>> Rupert, have you figured out a way to do it?
> >>>
> >>> Thanks.
> >>>
> >>>
> >>>
> >>> On Tue, Sep 8, 2009 at 3:42 PM, Rupert Fiasco <ru...@gmail.com>
> wrote:
> >>>
> >>>> Maybe I should be more clear: I have multiple tables in my DB that I
> >>>> need to save to my Solr index. In my app code I have logic to persist
> >>>> each table, which maps to an application model to Solr. This is fine.
> >>>> I am just trying to speed up indexing time by using DIH instead of
> >>>> going through my application. From what I understand of DIH I can
> >>>> specify one dataSource element and then a series of document/entity
> >>>> sets, for each of my models. But like I said before, DIH only appears
> >>>> to want to index the first document declared under the dataSource tag.
> >>>>
> >>>> -Rupert
> >>>>
> >>>> On Tue, Sep 8, 2009 at 4:05 PM, Rupert Fiasco<ru...@gmail.com>
> wrote:
> >>>> > I am using the DataImportHandler with a JDBC datasource. From my
> >>>> > understanding of DIH, for each of my "content types" e.g. Blog
> posts,
> >>>> > Mesh Categories, etc I would construct a series of document/entity
> >>>> > sets, like
> >>>> >
> >>>> > <dataConfig>
> >>>> > <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://...."
> />
> >>>> >
> >>>> >    <!-- BLOG ENTRIES -->
> >>>> >    <document name="blog_entries">
> >>>> >      <entity name="blog_entries" query="select
> >>>> > id,title,keywords,summary,data,title as name_fc,'BlogEntry' as type
> >>>> > from blog_entries">
> >>>> >        <field column="id" name="pk_i" />
> >>>> >        <field column="id" name="id" />
> >>>> >        <field column="title" name="text_t" />
> >>>> >        <field column="data" name="text_t" />
> >>>> >      </entity>
> >>>> >    </document>
> >>>> >
> >>>> >    <!-- MESH CATEGORIES -->
> >>>> >    <document name="mesh_category">
> >>>> >      <entity name="mesh_categories" query="select
> >>>> > id,name,node_key,name as name_fc,'MeshCategory' as type from
> >>>> > mesh_categories">
> >>>> >        <field column="id" name="pk_i" />
> >>>> >        <field column="id" name="id" />
> >>>> >        <field column="name" name="text_t" />
> >>>> >        <field column="node_key" name="string" />
> >>>> >        <field column="name_fc" name="facet_value" />
> >>>> >        <field column="type" name="type_t" />
> >>>> >      </entity>
> >>>> >    </document>
> >>>> > </datasource>
> >>>> > </dataConfig>
> >>>> >
> >>>> >
> >>>> > Solr parses this just fine and allows me to issue a
> >>>> > /dataimport?command=full-import and it runs, but it only runs
> against
> >>>> > the "first" document (blog_entries). It doesnt run against the 2nd
> >>>> > document (mesh_categories).
> >>>> >
> >>>> > If I remove the 2 document elements and wrap both entity sets in
> just
> >>>> > one document tag, then both sets get indexed, which seemingly
> achieves
> >>>> > my goal. This just doesnt make sense from my understanding of how
> DIH
> >>>> > works. My 2 content types are indeed separate so they logically
> >>>> > represent two document types, not one.
> >>>> >
> >>>> > Is this correct? What am I missing here?
> >>>> >
> >>>> > Thanks
> >>>> > -Rupert
> >>>> >
> >>>>
> >>>
> >>>
> >>
> >
> >
> >
> > --
> > -----------------------------------------------------
> > Noble Paul | Principal Engineer| AOL | http://aol.com
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Specifying multiple documents in DataImportHandler dataConfig

Posted by Lance Norskog <go...@gmail.com>.

There is a more fundamental problem here: Solr/Lucene index only
implements one table. If you have data from multiple tables in a
normalized index, you have denormalize the multi-table DB schema to
make a single-table Solr/Lucene index.

Your indexing will probably be faster if you a join in SQL to supply
your entire set of fields per database request.

2009/11/7 Noble Paul നോബിള്‍  नोब्ळ् <no...@corp.aol.com>:
> On Sun, Nov 8, 2009 at 8:25 AM, Bertie Shen <be...@gmail.com> wrote:
>> I have figured out a way to solve this problem: just specify a
>> single <document> blah blah blah </document>. Under <document>, specify
>> multiple top level entity entries, each of which corresponds to one table
>> data.
>>
>> So each top level entry will map one row in it to a document in Lucene
>> index. <document> in DIH is *NOT* mapped to a document in Lucene index while
>> top-level entity is. I feel <document> tag is redundant and misleading in
>> data config and thus should be removed.
>
> There are some common attributes specified at the <document> level .
> It still acts as a container tag .
>>
>> Cheers.
>>
>> On Sat, Nov 7, 2009 at 9:43 AM, Bertie Shen <be...@gmail.com> wrote:
>>
>>> I have the same problem. I had thought we could specify multiple <document>
>>> blah blah blah</document>s, each of which is mapping one table in the RDBMS.
>>> But I found it was not the case. It only picks the first <document>blah blah
>>> blah</document> to do indexing.
>>>
>>> I think Rupert's  and my request are pretty common. Basically there are
>>> multiple tables in RDBMS, and we want each row in each table become a
>>> document in Lucene index. How can we write one data config.xml file to let
>>> DataImportHandler import multiple tables at the same time?
>>>
>>> Rupert, have you figured out a way to do it?
>>>
>>> Thanks.
>>>
>>>
>>>
>>> On Tue, Sep 8, 2009 at 3:42 PM, Rupert Fiasco <ru...@gmail.com> wrote:
>>>
>>>> Maybe I should be more clear: I have multiple tables in my DB that I
>>>> need to save to my Solr index. In my app code I have logic to persist
>>>> each table, which maps to an application model to Solr. This is fine.
>>>> I am just trying to speed up indexing time by using DIH instead of
>>>> going through my application. From what I understand of DIH I can
>>>> specify one dataSource element and then a series of document/entity
>>>> sets, for each of my models. But like I said before, DIH only appears
>>>> to want to index the first document declared under the dataSource tag.
>>>>
>>>> -Rupert
>>>>
>>>> On Tue, Sep 8, 2009 at 4:05 PM, Rupert Fiasco<ru...@gmail.com> wrote:
>>>> > I am using the DataImportHandler with a JDBC datasource. From my
>>>> > understanding of DIH, for each of my "content types" e.g. Blog posts,
>>>> > Mesh Categories, etc I would construct a series of document/entity
>>>> > sets, like
>>>> >
>>>> > <dataConfig>
>>>> > <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://...." />
>>>> >
>>>> >    <!-- BLOG ENTRIES -->
>>>> >    <document name="blog_entries">
>>>> >      <entity name="blog_entries" query="select
>>>> > id,title,keywords,summary,data,title as name_fc,'BlogEntry' as type
>>>> > from blog_entries">
>>>> >        <field column="id" name="pk_i" />
>>>> >        <field column="id" name="id" />
>>>> >        <field column="title" name="text_t" />
>>>> >        <field column="data" name="text_t" />
>>>> >      </entity>
>>>> >    </document>
>>>> >
>>>> >    <!-- MESH CATEGORIES -->
>>>> >    <document name="mesh_category">
>>>> >      <entity name="mesh_categories" query="select
>>>> > id,name,node_key,name as name_fc,'MeshCategory' as type from
>>>> > mesh_categories">
>>>> >        <field column="id" name="pk_i" />
>>>> >        <field column="id" name="id" />
>>>> >        <field column="name" name="text_t" />
>>>> >        <field column="node_key" name="string" />
>>>> >        <field column="name_fc" name="facet_value" />
>>>> >        <field column="type" name="type_t" />
>>>> >      </entity>
>>>> >    </document>
>>>> > </datasource>
>>>> > </dataConfig>
>>>> >
>>>> >
>>>> > Solr parses this just fine and allows me to issue a
>>>> > /dataimport?command=full-import and it runs, but it only runs against
>>>> > the "first" document (blog_entries). It doesnt run against the 2nd
>>>> > document (mesh_categories).
>>>> >
>>>> > If I remove the 2 document elements and wrap both entity sets in just
>>>> > one document tag, then both sets get indexed, which seemingly achieves
>>>> > my goal. This just doesnt make sense from my understanding of how DIH
>>>> > works. My 2 content types are indeed separate so they logically
>>>> > represent two document types, not one.
>>>> >
>>>> > Is this correct? What am I missing here?
>>>> >
>>>> > Thanks
>>>> > -Rupert
>>>> >
>>>>
>>>
>>>
>>
>
>
>
> --
> -----------------------------------------------------
> Noble Paul | Principal Engineer| AOL | http://aol.com
>



-- 
Lance Norskog
goksron@gmail.com

Re: Specifying multiple documents in DataImportHandler dataConfig

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.

On Sun, Nov 8, 2009 at 8:25 AM, Bertie Shen <be...@gmail.com> wrote:
> I have figured out a way to solve this problem: just specify a
> single <document> blah blah blah </document>. Under <document>, specify
> multiple top level entity entries, each of which corresponds to one table
> data.
>
> So each top level entry will map one row in it to a document in Lucene
> index. <document> in DIH is *NOT* mapped to a document in Lucene index while
> top-level entity is. I feel <document> tag is redundant and misleading in
> data config and thus should be removed.

There are some common attributes specified at the <document> level .
It still acts as a container tag .
>
> Cheers.
>
> On Sat, Nov 7, 2009 at 9:43 AM, Bertie Shen <be...@gmail.com> wrote:
>
>> I have the same problem. I had thought we could specify multiple <document>
>> blah blah blah</document>s, each of which is mapping one table in the RDBMS.
>> But I found it was not the case. It only picks the first <document>blah blah
>> blah</document> to do indexing.
>>
>> I think Rupert's  and my request are pretty common. Basically there are
>> multiple tables in RDBMS, and we want each row in each table become a
>> document in Lucene index. How can we write one data config.xml file to let
>> DataImportHandler import multiple tables at the same time?
>>
>> Rupert, have you figured out a way to do it?
>>
>> Thanks.
>>
>>
>>
>> On Tue, Sep 8, 2009 at 3:42 PM, Rupert Fiasco <ru...@gmail.com> wrote:
>>
>>> Maybe I should be more clear: I have multiple tables in my DB that I
>>> need to save to my Solr index. In my app code I have logic to persist
>>> each table, which maps to an application model to Solr. This is fine.
>>> I am just trying to speed up indexing time by using DIH instead of
>>> going through my application. From what I understand of DIH I can
>>> specify one dataSource element and then a series of document/entity
>>> sets, for each of my models. But like I said before, DIH only appears
>>> to want to index the first document declared under the dataSource tag.
>>>
>>> -Rupert
>>>
>>> On Tue, Sep 8, 2009 at 4:05 PM, Rupert Fiasco<ru...@gmail.com> wrote:
>>> > I am using the DataImportHandler with a JDBC datasource. From my
>>> > understanding of DIH, for each of my "content types" e.g. Blog posts,
>>> > Mesh Categories, etc I would construct a series of document/entity
>>> > sets, like
>>> >
>>> > <dataConfig>
>>> > <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://...." />
>>> >
>>> >    <!-- BLOG ENTRIES -->
>>> >    <document name="blog_entries">
>>> >      <entity name="blog_entries" query="select
>>> > id,title,keywords,summary,data,title as name_fc,'BlogEntry' as type
>>> > from blog_entries">
>>> >        <field column="id" name="pk_i" />
>>> >        <field column="id" name="id" />
>>> >        <field column="title" name="text_t" />
>>> >        <field column="data" name="text_t" />
>>> >      </entity>
>>> >    </document>
>>> >
>>> >    <!-- MESH CATEGORIES -->
>>> >    <document name="mesh_category">
>>> >      <entity name="mesh_categories" query="select
>>> > id,name,node_key,name as name_fc,'MeshCategory' as type from
>>> > mesh_categories">
>>> >        <field column="id" name="pk_i" />
>>> >        <field column="id" name="id" />
>>> >        <field column="name" name="text_t" />
>>> >        <field column="node_key" name="string" />
>>> >        <field column="name_fc" name="facet_value" />
>>> >        <field column="type" name="type_t" />
>>> >      </entity>
>>> >    </document>
>>> > </datasource>
>>> > </dataConfig>
>>> >
>>> >
>>> > Solr parses this just fine and allows me to issue a
>>> > /dataimport?command=full-import and it runs, but it only runs against
>>> > the "first" document (blog_entries). It doesnt run against the 2nd
>>> > document (mesh_categories).
>>> >
>>> > If I remove the 2 document elements and wrap both entity sets in just
>>> > one document tag, then both sets get indexed, which seemingly achieves
>>> > my goal. This just doesnt make sense from my understanding of how DIH
>>> > works. My 2 content types are indeed separate so they logically
>>> > represent two document types, not one.
>>> >
>>> > Is this correct? What am I missing here?
>>> >
>>> > Thanks
>>> > -Rupert
>>> >
>>>
>>
>>
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: Specifying multiple documents in DataImportHandler dataConfig

Posted by Bertie Shen <be...@gmail.com>.

I have figured out a way to solve this problem: just specify a
single <document> blah blah blah </document>. Under <document>, specify
multiple top level entity entries, each of which corresponds to one table
data.

So each top level entry will map one row in it to a document in Lucene
index. <document> in DIH is *NOT* mapped to a document in Lucene index while
top-level entity is. I feel <document> tag is redundant and misleading in
data config and thus should be removed.

Cheers.

On Sat, Nov 7, 2009 at 9:43 AM, Bertie Shen <be...@gmail.com> wrote:

> I have the same problem. I had thought we could specify multiple <document>
> blah blah blah</document>s, each of which is mapping one table in the RDBMS.
> But I found it was not the case. It only picks the first <document>blah blah
> blah</document> to do indexing.
>
> I think Rupert's  and my request are pretty common. Basically there are
> multiple tables in RDBMS, and we want each row in each table become a
> document in Lucene index. How can we write one data config.xml file to let
> DataImportHandler import multiple tables at the same time?
>
> Rupert, have you figured out a way to do it?
>
> Thanks.
>
>
>
> On Tue, Sep 8, 2009 at 3:42 PM, Rupert Fiasco <ru...@gmail.com> wrote:
>
>> Maybe I should be more clear: I have multiple tables in my DB that I
>> need to save to my Solr index. In my app code I have logic to persist
>> each table, which maps to an application model to Solr. This is fine.
>> I am just trying to speed up indexing time by using DIH instead of
>> going through my application. From what I understand of DIH I can
>> specify one dataSource element and then a series of document/entity
>> sets, for each of my models. But like I said before, DIH only appears
>> to want to index the first document declared under the dataSource tag.
>>
>> -Rupert
>>
>> On Tue, Sep 8, 2009 at 4:05 PM, Rupert Fiasco<ru...@gmail.com> wrote:
>> > I am using the DataImportHandler with a JDBC datasource. From my
>> > understanding of DIH, for each of my "content types" e.g. Blog posts,
>> > Mesh Categories, etc I would construct a series of document/entity
>> > sets, like
>> >
>> > <dataConfig>
>> > <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://...." />
>> >
>> >    <!-- BLOG ENTRIES -->
>> >    <document name="blog_entries">
>> >      <entity name="blog_entries" query="select
>> > id,title,keywords,summary,data,title as name_fc,'BlogEntry' as type
>> > from blog_entries">
>> >        <field column="id" name="pk_i" />
>> >        <field column="id" name="id" />
>> >        <field column="title" name="text_t" />
>> >        <field column="data" name="text_t" />
>> >      </entity>
>> >    </document>
>> >
>> >    <!-- MESH CATEGORIES -->
>> >    <document name="mesh_category">
>> >      <entity name="mesh_categories" query="select
>> > id,name,node_key,name as name_fc,'MeshCategory' as type from
>> > mesh_categories">
>> >        <field column="id" name="pk_i" />
>> >        <field column="id" name="id" />
>> >        <field column="name" name="text_t" />
>> >        <field column="node_key" name="string" />
>> >        <field column="name_fc" name="facet_value" />
>> >        <field column="type" name="type_t" />
>> >      </entity>
>> >    </document>
>> > </datasource>
>> > </dataConfig>
>> >
>> >
>> > Solr parses this just fine and allows me to issue a
>> > /dataimport?command=full-import and it runs, but it only runs against
>> > the "first" document (blog_entries). It doesnt run against the 2nd
>> > document (mesh_categories).
>> >
>> > If I remove the 2 document elements and wrap both entity sets in just
>> > one document tag, then both sets get indexed, which seemingly achieves
>> > my goal. This just doesnt make sense from my understanding of how DIH
>> > works. My 2 content types are indeed separate so they logically
>> > represent two document types, not one.
>> >
>> > Is this correct? What am I missing here?
>> >
>> > Thanks
>> > -Rupert
>> >
>>
>
>