You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Rupert Fiasco <ru...@gmail.com> on 2009/09/09 01:05:31 UTC

Specifying multiple documents in DataImportHandler dataConfig

I am using the DataImportHandler with a JDBC datasource. From my
understanding of DIH, for each of my "content types" e.g. Blog posts,
Mesh Categories, etc I would construct a series of document/entity
sets, like

<dataConfig>
<dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://...." />

    <!-- BLOG ENTRIES -->
    <document name="blog_entries">
      <entity name="blog_entries" query="select
id,title,keywords,summary,data,title as name_fc,'BlogEntry' as type
from blog_entries">
        <field column="id" name="pk_i" />
        <field column="id" name="id" />
        <field column="title" name="text_t" />
        <field column="data" name="text_t" />
      </entity>
    </document>

    <!-- MESH CATEGORIES -->
    <document name="mesh_category">
      <entity name="mesh_categories" query="select
id,name,node_key,name as name_fc,'MeshCategory' as type from
mesh_categories">
        <field column="id" name="pk_i" />
        <field column="id" name="id" />
        <field column="name" name="text_t" />
        <field column="node_key" name="string" />
        <field column="name_fc" name="facet_value" />
        <field column="type" name="type_t" />
      </entity>
    </document>
</datasource>
</dataConfig>


Solr parses this just fine and allows me to issue a
/dataimport?command=full-import and it runs, but it only runs against
the "first" document (blog_entries). It doesnt run against the 2nd
document (mesh_categories).

If I remove the 2 document elements and wrap both entity sets in just
one document tag, then both sets get indexed, which seemingly achieves
my goal. This just doesnt make sense from my understanding of how DIH
works. My 2 content types are indeed separate so they logically
represent two document types, not one.

Is this correct? What am I missing here?

Thanks
-Rupert

Re: Specifying multiple documents in DataImportHandler dataConfig

Posted by Bertie Shen <be...@gmail.com>.

HI Lance,

 I think you are discussing a different issue here.  We are talking about
each row from each table represents a document in index. You look to discuss
about some documents may have multi-value fields which are stored in a
separate table in RDBMS because of normalization.



On Mon, Nov 9, 2009 at 6:01 PM, Lance Norskog <go...@gmail.com> wrote:

> There is a more fundamental problem here: Solr/Lucene index only
> implements one table. If you have data from multiple tables in a
> normalized index, you have denormalize the multi-table DB schema to
> make a single-table Solr/Lucene index.
>
> Your indexing will probably be faster if you a join in SQL to supply
> your entire set of fields per database request.
>
> 2009/11/7 Noble Paul നോബിള്‍  नोब्ळ् <no...@corp.aol.com>:
> > On Sun, Nov 8, 2009 at 8:25 AM, Bertie Shen <be...@gmail.com>
> wrote:
> >> I have figured out a way to solve this problem: just specify a
> >> single <document> blah blah blah </document>. Under <document>, specify
> >> multiple top level entity entries, each of which corresponds to one
> table
> >> data.
> >>
> >> So each top level entry will map one row in it to a document in Lucene
> >> index. <document> in DIH is *NOT* mapped to a document in Lucene index
> while
> >> top-level entity is. I feel <document> tag is redundant and misleading
> in
> >> data config and thus should be removed.
> >
> > There are some common attributes specified at the <document> level .
> > It still acts as a container tag .
> >>
> >> Cheers.
> >>
> >> On Sat, Nov 7, 2009 at 9:43 AM, Bertie Shen <be...@gmail.com>
> wrote:
> >>
> >>> I have the same problem. I had thought we could specify multiple
> <document>
> >>> blah blah blah</document>s, each of which is mapping one table in the
> RDBMS.
> >>> But I found it was not the case. It only picks the first <document>blah
> blah
> >>> blah</document> to do indexing.
> >>>
> >>> I think Rupert's  and my request are pretty common. Basically there are
> >>> multiple tables in RDBMS, and we want each row in each table become a
> >>> document in Lucene index. How can we write one data config.xml file to
> let
> >>> DataImportHandler import multiple tables at the same time?
> >>>
> >>> Rupert, have you figured out a way to do it?
> >>>
> >>> Thanks.
> >>>
> >>>
> >>>
> >>> On Tue, Sep 8, 2009 at 3:42 PM, Rupert Fiasco <ru...@gmail.com>
> wrote:
> >>>
> >>>> Maybe I should be more clear: I have multiple tables in my DB that I
> >>>> need to save to my Solr index. In my app code I have logic to persist
> >>>> each table, which maps to an application model to Solr. This is fine.
> >>>> I am just trying to speed up indexing time by using DIH instead of
> >>>> going through my application. From what I understand of DIH I can
> >>>> specify one dataSource element and then a series of document/entity
> >>>> sets, for each of my models. But like I said before, DIH only appears
> >>>> to want to index the first document declared under the dataSource tag.
> >>>>
> >>>> -Rupert
> >>>>
> >>>> On Tue, Sep 8, 2009 at 4:05 PM, Rupert Fiasco<ru...@gmail.com>
> wrote:
> >>>> > I am using the DataImportHandler with a JDBC datasource. From my
> >>>> > understanding of DIH, for each of my "content types" e.g. Blog
> posts,
> >>>> > Mesh Categories, etc I would construct a series of document/entity
> >>>> > sets, like
> >>>> >
> >>>> > <dataConfig>
> >>>> > <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://...."
> />
> >>>> >
> >>>> >    <!-- BLOG ENTRIES -->
> >>>> >    <document name="blog_entries">
> >>>> >      <entity name="blog_entries" query="select
> >>>> > id,title,keywords,summary,data,title as name_fc,'BlogEntry' as type
> >>>> > from blog_entries">
> >>>> >        <field column="id" name="pk_i" />
> >>>> >        <field column="id" name="id" />
> >>>> >        <field column="title" name="text_t" />
> >>>> >        <field column="data" name="text_t" />
> >>>> >      </entity>
> >>>> >    </document>
> >>>> >
> >>>> >    <!-- MESH CATEGORIES -->
> >>>> >    <document name="mesh_category">
> >>>> >      <entity name="mesh_categories" query="select
> >>>> > id,name,node_key,name as name_fc,'MeshCategory' as type from
> >>>> > mesh_categories">
> >>>> >        <field column="id" name="pk_i" />
> >>>> >        <field column="id" name="id" />
> >>>> >        <field column="name" name="text_t" />
> >>>> >        <field column="node_key" name="string" />
> >>>> >        <field column="name_fc" name="facet_value" />
> >>>> >        <field column="type" name="type_t" />
> >>>> >      </entity>
> >>>> >    </document>
> >>>> > </datasource>
> >>>> > </dataConfig>
> >>>> >
> >>>> >
> >>>> > Solr parses this just fine and allows me to issue a
> >>>> > /dataimport?command=full-import and it runs, but it only runs
> against
> >>>> > the "first" document (blog_entries). It doesnt run against the 2nd
> >>>> > document (mesh_categories).
> >>>> >
> >>>> > If I remove the 2 document elements and wrap both entity sets in
> just
> >>>> > one document tag, then both sets get indexed, which seemingly
> achieves
> >>>> > my goal. This just doesnt make sense from my understanding of how
> DIH
> >>>> > works. My 2 content types are indeed separate so they logically
> >>>> > represent two document types, not one.
> >>>> >
> >>>> > Is this correct? What am I missing here?
> >>>> >
> >>>> > Thanks
> >>>> > -Rupert
> >>>> >
> >>>>
> >>>
> >>>
> >>
> >
> >
> >
> > --
> > -----------------------------------------------------
> > Noble Paul | Principal Engineer| AOL | http://aol.com
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Specifying multiple documents in DataImportHandler dataConfig

Posted by Lance Norskog <go...@gmail.com>.

There is a more fundamental problem here: Solr/Lucene index only
implements one table. If you have data from multiple tables in a
normalized index, you have denormalize the multi-table DB schema to
make a single-table Solr/Lucene index.

Your indexing will probably be faster if you a join in SQL to supply
your entire set of fields per database request.

2009/11/7 Noble Paul നോബിള്‍  नोब्ळ् <no...@corp.aol.com>:
> On Sun, Nov 8, 2009 at 8:25 AM, Bertie Shen <be...@gmail.com> wrote:
>> I have figured out a way to solve this problem: just specify a
>> single <document> blah blah blah </document>. Under <document>, specify
>> multiple top level entity entries, each of which corresponds to one table
>> data.
>>
>> So each top level entry will map one row in it to a document in Lucene
>> index. <document> in DIH is *NOT* mapped to a document in Lucene index while
>> top-level entity is. I feel <document> tag is redundant and misleading in
>> data config and thus should be removed.
>
> There are some common attributes specified at the <document> level .
> It still acts as a container tag .
>>
>> Cheers.
>>
>> On Sat, Nov 7, 2009 at 9:43 AM, Bertie Shen <be...@gmail.com> wrote:
>>
>>> I have the same problem. I had thought we could specify multiple <document>
>>> blah blah blah</document>s, each of which is mapping one table in the RDBMS.
>>> But I found it was not the case. It only picks the first <document>blah blah
>>> blah</document> to do indexing.
>>>
>>> I think Rupert's  and my request are pretty common. Basically there are
>>> multiple tables in RDBMS, and we want each row in each table become a
>>> document in Lucene index. How can we write one data config.xml file to let
>>> DataImportHandler import multiple tables at the same time?
>>>
>>> Rupert, have you figured out a way to do it?
>>>
>>> Thanks.
>>>
>>>
>>>
>>> On Tue, Sep 8, 2009 at 3:42 PM, Rupert Fiasco <ru...@gmail.com> wrote:
>>>
>>>> Maybe I should be more clear: I have multiple tables in my DB that I
>>>> need to save to my Solr index. In my app code I have logic to persist
>>>> each table, which maps to an application model to Solr. This is fine.
>>>> I am just trying to speed up indexing time by using DIH instead of
>>>> going through my application. From what I understand of DIH I can
>>>> specify one dataSource element and then a series of document/entity
>>>> sets, for each of my models. But like I said before, DIH only appears
>>>> to want to index the first document declared under the dataSource tag.
>>>>
>>>> -Rupert
>>>>
>>>> On Tue, Sep 8, 2009 at 4:05 PM, Rupert Fiasco<ru...@gmail.com> wrote:
>>>> > I am using the DataImportHandler with a JDBC datasource. From my
>>>> > understanding of DIH, for each of my "content types" e.g. Blog posts,
>>>> > Mesh Categories, etc I would construct a series of document/entity
>>>> > sets, like
>>>> >
>>>> > <dataConfig>
>>>> > <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://...." />
>>>> >
>>>> >    <!-- BLOG ENTRIES -->
>>>> >    <document name="blog_entries">
>>>> >      <entity name="blog_entries" query="select
>>>> > id,title,keywords,summary,data,title as name_fc,'BlogEntry' as type
>>>> > from blog_entries">
>>>> >        <field column="id" name="pk_i" />
>>>> >        <field column="id" name="id" />
>>>> >        <field column="title" name="text_t" />
>>>> >        <field column="data" name="text_t" />
>>>> >      </entity>
>>>> >    </document>
>>>> >
>>>> >    <!-- MESH CATEGORIES -->
>>>> >    <document name="mesh_category">
>>>> >      <entity name="mesh_categories" query="select
>>>> > id,name,node_key,name as name_fc,'MeshCategory' as type from
>>>> > mesh_categories">
>>>> >        <field column="id" name="pk_i" />
>>>> >        <field column="id" name="id" />
>>>> >        <field column="name" name="text_t" />
>>>> >        <field column="node_key" name="string" />
>>>> >        <field column="name_fc" name="facet_value" />
>>>> >        <field column="type" name="type_t" />
>>>> >      </entity>
>>>> >    </document>
>>>> > </datasource>
>>>> > </dataConfig>
>>>> >
>>>> >
>>>> > Solr parses this just fine and allows me to issue a
>>>> > /dataimport?command=full-import and it runs, but it only runs against
>>>> > the "first" document (blog_entries). It doesnt run against the 2nd
>>>> > document (mesh_categories).
>>>> >
>>>> > If I remove the 2 document elements and wrap both entity sets in just
>>>> > one document tag, then both sets get indexed, which seemingly achieves
>>>> > my goal. This just doesnt make sense from my understanding of how DIH
>>>> > works. My 2 content types are indeed separate so they logically
>>>> > represent two document types, not one.
>>>> >
>>>> > Is this correct? What am I missing here?
>>>> >
>>>> > Thanks
>>>> > -Rupert
>>>> >
>>>>
>>>
>>>
>>
>
>
>
> --
> -----------------------------------------------------
> Noble Paul | Principal Engineer| AOL | http://aol.com
>



-- 
Lance Norskog
goksron@gmail.com

Re: Specifying multiple documents in DataImportHandler dataConfig

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.

On Sun, Nov 8, 2009 at 8:25 AM, Bertie Shen <be...@gmail.com> wrote:
> I have figured out a way to solve this problem: just specify a
> single <document> blah blah blah </document>. Under <document>, specify
> multiple top level entity entries, each of which corresponds to one table
> data.
>
> So each top level entry will map one row in it to a document in Lucene
> index. <document> in DIH is *NOT* mapped to a document in Lucene index while
> top-level entity is. I feel <document> tag is redundant and misleading in
> data config and thus should be removed.

There are some common attributes specified at the <document> level .
It still acts as a container tag .
>
> Cheers.
>
> On Sat, Nov 7, 2009 at 9:43 AM, Bertie Shen <be...@gmail.com> wrote:
>
>> I have the same problem. I had thought we could specify multiple <document>
>> blah blah blah</document>s, each of which is mapping one table in the RDBMS.
>> But I found it was not the case. It only picks the first <document>blah blah
>> blah</document> to do indexing.
>>
>> I think Rupert's  and my request are pretty common. Basically there are
>> multiple tables in RDBMS, and we want each row in each table become a
>> document in Lucene index. How can we write one data config.xml file to let
>> DataImportHandler import multiple tables at the same time?
>>
>> Rupert, have you figured out a way to do it?
>>
>> Thanks.
>>
>>
>>
>> On Tue, Sep 8, 2009 at 3:42 PM, Rupert Fiasco <ru...@gmail.com> wrote:
>>
>>> Maybe I should be more clear: I have multiple tables in my DB that I
>>> need to save to my Solr index. In my app code I have logic to persist
>>> each table, which maps to an application model to Solr. This is fine.
>>> I am just trying to speed up indexing time by using DIH instead of
>>> going through my application. From what I understand of DIH I can
>>> specify one dataSource element and then a series of document/entity
>>> sets, for each of my models. But like I said before, DIH only appears
>>> to want to index the first document declared under the dataSource tag.
>>>
>>> -Rupert
>>>
>>> On Tue, Sep 8, 2009 at 4:05 PM, Rupert Fiasco<ru...@gmail.com> wrote:
>>> > I am using the DataImportHandler with a JDBC datasource. From my
>>> > understanding of DIH, for each of my "content types" e.g. Blog posts,
>>> > Mesh Categories, etc I would construct a series of document/entity
>>> > sets, like
>>> >
>>> > <dataConfig>
>>> > <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://...." />
>>> >
>>> >    <!-- BLOG ENTRIES -->
>>> >    <document name="blog_entries">
>>> >      <entity name="blog_entries" query="select
>>> > id,title,keywords,summary,data,title as name_fc,'BlogEntry' as type
>>> > from blog_entries">
>>> >        <field column="id" name="pk_i" />
>>> >        <field column="id" name="id" />
>>> >        <field column="title" name="text_t" />
>>> >        <field column="data" name="text_t" />
>>> >      </entity>
>>> >    </document>
>>> >
>>> >    <!-- MESH CATEGORIES -->
>>> >    <document name="mesh_category">
>>> >      <entity name="mesh_categories" query="select
>>> > id,name,node_key,name as name_fc,'MeshCategory' as type from
>>> > mesh_categories">
>>> >        <field column="id" name="pk_i" />
>>> >        <field column="id" name="id" />
>>> >        <field column="name" name="text_t" />
>>> >        <field column="node_key" name="string" />
>>> >        <field column="name_fc" name="facet_value" />
>>> >        <field column="type" name="type_t" />
>>> >      </entity>
>>> >    </document>
>>> > </datasource>
>>> > </dataConfig>
>>> >
>>> >
>>> > Solr parses this just fine and allows me to issue a
>>> > /dataimport?command=full-import and it runs, but it only runs against
>>> > the "first" document (blog_entries). It doesnt run against the 2nd
>>> > document (mesh_categories).
>>> >
>>> > If I remove the 2 document elements and wrap both entity sets in just
>>> > one document tag, then both sets get indexed, which seemingly achieves
>>> > my goal. This just doesnt make sense from my understanding of how DIH
>>> > works. My 2 content types are indeed separate so they logically
>>> > represent two document types, not one.
>>> >
>>> > Is this correct? What am I missing here?
>>> >
>>> > Thanks
>>> > -Rupert
>>> >
>>>
>>
>>
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: Specifying multiple documents in DataImportHandler dataConfig

Posted by Bertie Shen <be...@gmail.com>.

I have figured out a way to solve this problem: just specify a
single <document> blah blah blah </document>. Under <document>, specify
multiple top level entity entries, each of which corresponds to one table
data.

So each top level entry will map one row in it to a document in Lucene
index. <document> in DIH is *NOT* mapped to a document in Lucene index while
top-level entity is. I feel <document> tag is redundant and misleading in
data config and thus should be removed.

Cheers.

On Sat, Nov 7, 2009 at 9:43 AM, Bertie Shen <be...@gmail.com> wrote:

> I have the same problem. I had thought we could specify multiple <document>
> blah blah blah</document>s, each of which is mapping one table in the RDBMS.
> But I found it was not the case. It only picks the first <document>blah blah
> blah</document> to do indexing.
>
> I think Rupert's  and my request are pretty common. Basically there are
> multiple tables in RDBMS, and we want each row in each table become a
> document in Lucene index. How can we write one data config.xml file to let
> DataImportHandler import multiple tables at the same time?
>
> Rupert, have you figured out a way to do it?
>
> Thanks.
>
>
>
> On Tue, Sep 8, 2009 at 3:42 PM, Rupert Fiasco <ru...@gmail.com> wrote:
>
>> Maybe I should be more clear: I have multiple tables in my DB that I
>> need to save to my Solr index. In my app code I have logic to persist
>> each table, which maps to an application model to Solr. This is fine.
>> I am just trying to speed up indexing time by using DIH instead of
>> going through my application. From what I understand of DIH I can
>> specify one dataSource element and then a series of document/entity
>> sets, for each of my models. But like I said before, DIH only appears
>> to want to index the first document declared under the dataSource tag.
>>
>> -Rupert
>>
>> On Tue, Sep 8, 2009 at 4:05 PM, Rupert Fiasco<ru...@gmail.com> wrote:
>> > I am using the DataImportHandler with a JDBC datasource. From my
>> > understanding of DIH, for each of my "content types" e.g. Blog posts,
>> > Mesh Categories, etc I would construct a series of document/entity
>> > sets, like
>> >
>> > <dataConfig>
>> > <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://...." />
>> >
>> >    <!-- BLOG ENTRIES -->
>> >    <document name="blog_entries">
>> >      <entity name="blog_entries" query="select
>> > id,title,keywords,summary,data,title as name_fc,'BlogEntry' as type
>> > from blog_entries">
>> >        <field column="id" name="pk_i" />
>> >        <field column="id" name="id" />
>> >        <field column="title" name="text_t" />
>> >        <field column="data" name="text_t" />
>> >      </entity>
>> >    </document>
>> >
>> >    <!-- MESH CATEGORIES -->
>> >    <document name="mesh_category">
>> >      <entity name="mesh_categories" query="select
>> > id,name,node_key,name as name_fc,'MeshCategory' as type from
>> > mesh_categories">
>> >        <field column="id" name="pk_i" />
>> >        <field column="id" name="id" />
>> >        <field column="name" name="text_t" />
>> >        <field column="node_key" name="string" />
>> >        <field column="name_fc" name="facet_value" />
>> >        <field column="type" name="type_t" />
>> >      </entity>
>> >    </document>
>> > </datasource>
>> > </dataConfig>
>> >
>> >
>> > Solr parses this just fine and allows me to issue a
>> > /dataimport?command=full-import and it runs, but it only runs against
>> > the "first" document (blog_entries). It doesnt run against the 2nd
>> > document (mesh_categories).
>> >
>> > If I remove the 2 document elements and wrap both entity sets in just
>> > one document tag, then both sets get indexed, which seemingly achieves
>> > my goal. This just doesnt make sense from my understanding of how DIH
>> > works. My 2 content types are indeed separate so they logically
>> > represent two document types, not one.
>> >
>> > Is this correct? What am I missing here?
>> >
>> > Thanks
>> > -Rupert
>> >
>>
>
>

Re: Specifying multiple documents in DataImportHandler dataConfig

Posted by Bertie Shen <be...@gmail.com>.

I have the same problem. I had thought we could specify multiple <document>
blah blah blah</document>s, each of which is mapping one table in the RDBMS.
But I found it was not the case. It only picks the first <document>blah blah
blah</document> to do indexing.

I think Rupert's  and my request are pretty common. Basically there are
multiple tables in RDBMS, and we want each row in each table become a
document in Lucene index. How can we write one data config.xml file to let
DataImportHandler import multiple tables at the same time?

Rupert, have you figured out a way to do it?

Thanks.


On Tue, Sep 8, 2009 at 3:42 PM, Rupert Fiasco <ru...@gmail.com> wrote:

> Maybe I should be more clear: I have multiple tables in my DB that I
> need to save to my Solr index. In my app code I have logic to persist
> each table, which maps to an application model to Solr. This is fine.
> I am just trying to speed up indexing time by using DIH instead of
> going through my application. From what I understand of DIH I can
> specify one dataSource element and then a series of document/entity
> sets, for each of my models. But like I said before, DIH only appears
> to want to index the first document declared under the dataSource tag.
>
> -Rupert
>
> On Tue, Sep 8, 2009 at 4:05 PM, Rupert Fiasco<ru...@gmail.com> wrote:
> > I am using the DataImportHandler with a JDBC datasource. From my
> > understanding of DIH, for each of my "content types" e.g. Blog posts,
> > Mesh Categories, etc I would construct a series of document/entity
> > sets, like
> >
> > <dataConfig>
> > <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://...." />
> >
> >    <!-- BLOG ENTRIES -->
> >    <document name="blog_entries">
> >      <entity name="blog_entries" query="select
> > id,title,keywords,summary,data,title as name_fc,'BlogEntry' as type
> > from blog_entries">
> >        <field column="id" name="pk_i" />
> >        <field column="id" name="id" />
> >        <field column="title" name="text_t" />
> >        <field column="data" name="text_t" />
> >      </entity>
> >    </document>
> >
> >    <!-- MESH CATEGORIES -->
> >    <document name="mesh_category">
> >      <entity name="mesh_categories" query="select
> > id,name,node_key,name as name_fc,'MeshCategory' as type from
> > mesh_categories">
> >        <field column="id" name="pk_i" />
> >        <field column="id" name="id" />
> >        <field column="name" name="text_t" />
> >        <field column="node_key" name="string" />
> >        <field column="name_fc" name="facet_value" />
> >        <field column="type" name="type_t" />
> >      </entity>
> >    </document>
> > </datasource>
> > </dataConfig>
> >
> >
> > Solr parses this just fine and allows me to issue a
> > /dataimport?command=full-import and it runs, but it only runs against
> > the "first" document (blog_entries). It doesnt run against the 2nd
> > document (mesh_categories).
> >
> > If I remove the 2 document elements and wrap both entity sets in just
> > one document tag, then both sets get indexed, which seemingly achieves
> > my goal. This just doesnt make sense from my understanding of how DIH
> > works. My 2 content types are indeed separate so they logically
> > represent two document types, not one.
> >
> > Is this correct? What am I missing here?
> >
> > Thanks
> > -Rupert
> >
>

Re: Specifying multiple documents in DataImportHandler dataConfig

Posted by Fergus McMenemie <fe...@twig.me.uk>.

You can only have one document tag and the entities must be nested
within that.

>From the wiki, if you issue a simple "/dataimport?command=full-import"
all top level entities will be processed.


>Maybe I should be more clear: I have multiple tables in my DB that I
>need to save to my Solr index. In my app code I have logic to persist
>each table, which maps to an application model to Solr. This is fine.
>I am just trying to speed up indexing time by using DIH instead of
>going through my application. From what I understand of DIH I can
>specify one dataSource element and then a series of document/entity
>sets, for each of my models. But like I said before, DIH only appears
>to want to index the first document declared under the dataSource tag.
>
>-Rupert
>
>On Tue, Sep 8, 2009 at 4:05 PM, Rupert Fiasco<ru...@gmail.com> wrote:
>> I am using the DataImportHandler with a JDBC datasource. From my
>> understanding of DIH, for each of my "content types" e.g. Blog posts,
>> Mesh Categories, etc I would construct a series of document/entity
>> sets, like
>>
>> <dataConfig>
>> <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://...." />
>>
>>    <!-- BLOG ENTRIES -->
>>    <document name="blog_entries">
>>      <entity name="blog_entries" query="select
>> id,title,keywords,summary,data,title as name_fc,'BlogEntry' as type
>> from blog_entries">
>>        <field column="id" name="pk_i" />
>>        <field column="id" name="id" />
>>        <field column="title" name="text_t" />
>>        <field column="data" name="text_t" />
>>      </entity>
>>    </document>
>>
>>    <!-- MESH CATEGORIES -->
>>    <document name="mesh_category">
>>      <entity name="mesh_categories" query="select
>> id,name,node_key,name as name_fc,'MeshCategory' as type from
>> mesh_categories">
>>        <field column="id" name="pk_i" />
>>        <field column="id" name="id" />
>>        <field column="name" name="text_t" />
>>        <field column="node_key" name="string" />
>>        <field column="name_fc" name="facet_value" />
>>        <field column="type" name="type_t" />
>>      </entity>
>>    </document>
>> </datasource>
>> </dataConfig>
>>
>>
>> Solr parses this just fine and allows me to issue a
>> /dataimport?command=full-import and it runs, but it only runs against
>> the "first" document (blog_entries). It doesnt run against the 2nd
>> document (mesh_categories).
>>
>> If I remove the 2 document elements and wrap both entity sets in just
>> one document tag, then both sets get indexed, which seemingly achieves
>> my goal. This just doesnt make sense from my understanding of how DIH
>> works. My 2 content types are indeed separate so they logically
>> represent two document types, not one.
>>
>> Is this correct? What am I missing here?
>>
>> Thanks
>> -Rupert
>>

-- 

===============================================================
Fergus McMenemie               Email:fergus@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Re: Specifying multiple documents in DataImportHandler dataConfig

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.

DIH allows only <document> tag. you may have multiple root <entity>
tags and you may invoke them by name(s). When no name is passed all
root entities are invoked one after another.

On Wed, Sep 9, 2009 at 5:12 AM, Rupert Fiasco<ru...@gmail.com> wrote:
> Maybe I should be more clear: I have multiple tables in my DB that I
> need to save to my Solr index. In my app code I have logic to persist
> each table, which maps to an application model to Solr. This is fine.
> I am just trying to speed up indexing time by using DIH instead of
> going through my application. From what I understand of DIH I can
> specify one dataSource element and then a series of document/entity
> sets, for each of my models. But like I said before, DIH only appears
> to want to index the first document declared under the dataSource tag.
>
> -Rupert
>
> On Tue, Sep 8, 2009 at 4:05 PM, Rupert Fiasco<ru...@gmail.com> wrote:
>> I am using the DataImportHandler with a JDBC datasource. From my
>> understanding of DIH, for each of my "content types" e.g. Blog posts,
>> Mesh Categories, etc I would construct a series of document/entity
>> sets, like
>>
>> <dataConfig>
>> <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://...." />
>>
>>    <!-- BLOG ENTRIES -->
>>    <document name="blog_entries">
>>      <entity name="blog_entries" query="select
>> id,title,keywords,summary,data,title as name_fc,'BlogEntry' as type
>> from blog_entries">
>>        <field column="id" name="pk_i" />
>>        <field column="id" name="id" />
>>        <field column="title" name="text_t" />
>>        <field column="data" name="text_t" />
>>      </entity>
>>    </document>
>>
>>    <!-- MESH CATEGORIES -->
>>    <document name="mesh_category">
>>      <entity name="mesh_categories" query="select
>> id,name,node_key,name as name_fc,'MeshCategory' as type from
>> mesh_categories">
>>        <field column="id" name="pk_i" />
>>        <field column="id" name="id" />
>>        <field column="name" name="text_t" />
>>        <field column="node_key" name="string" />
>>        <field column="name_fc" name="facet_value" />
>>        <field column="type" name="type_t" />
>>      </entity>
>>    </document>
>> </datasource>
>> </dataConfig>
>>
>>
>> Solr parses this just fine and allows me to issue a
>> /dataimport?command=full-import and it runs, but it only runs against
>> the "first" document (blog_entries). It doesnt run against the 2nd
>> document (mesh_categories).
>>
>> If I remove the 2 document elements and wrap both entity sets in just
>> one document tag, then both sets get indexed, which seemingly achieves
>> my goal. This just doesnt make sense from my understanding of how DIH
>> works. My 2 content types are indeed separate so they logically
>> represent two document types, not one.
>>
>> Is this correct? What am I missing here?
>>
>> Thanks
>> -Rupert
>>
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: Specifying multiple documents in DataImportHandler dataConfig

Posted by Rupert Fiasco <ru...@gmail.com>.

Maybe I should be more clear: I have multiple tables in my DB that I
need to save to my Solr index. In my app code I have logic to persist
each table, which maps to an application model to Solr. This is fine.
I am just trying to speed up indexing time by using DIH instead of
going through my application. From what I understand of DIH I can
specify one dataSource element and then a series of document/entity
sets, for each of my models. But like I said before, DIH only appears
to want to index the first document declared under the dataSource tag.

-Rupert

On Tue, Sep 8, 2009 at 4:05 PM, Rupert Fiasco<ru...@gmail.com> wrote:
> I am using the DataImportHandler with a JDBC datasource. From my
> understanding of DIH, for each of my "content types" e.g. Blog posts,
> Mesh Categories, etc I would construct a series of document/entity
> sets, like
>
> <dataConfig>
> <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://...." />
>
>    <!-- BLOG ENTRIES -->
>    <document name="blog_entries">
>      <entity name="blog_entries" query="select
> id,title,keywords,summary,data,title as name_fc,'BlogEntry' as type
> from blog_entries">
>        <field column="id" name="pk_i" />
>        <field column="id" name="id" />
>        <field column="title" name="text_t" />
>        <field column="data" name="text_t" />
>      </entity>
>    </document>
>
>    <!-- MESH CATEGORIES -->
>    <document name="mesh_category">
>      <entity name="mesh_categories" query="select
> id,name,node_key,name as name_fc,'MeshCategory' as type from
> mesh_categories">
>        <field column="id" name="pk_i" />
>        <field column="id" name="id" />
>        <field column="name" name="text_t" />
>        <field column="node_key" name="string" />
>        <field column="name_fc" name="facet_value" />
>        <field column="type" name="type_t" />
>      </entity>
>    </document>
> </datasource>
> </dataConfig>
>
>
> Solr parses this just fine and allows me to issue a
> /dataimport?command=full-import and it runs, but it only runs against
> the "first" document (blog_entries). It doesnt run against the 2nd
> document (mesh_categories).
>
> If I remove the 2 document elements and wrap both entity sets in just
> one document tag, then both sets get indexed, which seemingly achieves
> my goal. This just doesnt make sense from my understanding of how DIH
> works. My 2 content types are indeed separate so they logically
> represent two document types, not one.
>
> Is this correct? What am I missing here?
>
> Thanks
> -Rupert
>