You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by venn hardy <ve...@hotmail.com> on 2009/09/10 04:00:09 UTC

Extract info from parent node during data import

Hello, 

 

I am using SOLR 1.4 (from nighly build) and its URLDataSource in conjunction with the XPathEntityProcessor. I have successfully imported XML content, but I think I may have found a limitation when it comes to the commonField attribute in the DataImportHandler. 

 

Before writing my own parser to read in a whole XML document, I thought I'd post the question here (since I got some great advice last time).

 

The bulk of my content is contained within each <item> tag. However, each item has a parent called <category> and each category has a name which I would like to import. In my forEach loop I specify the /document/category/item as the collection of items I am interested in. Is there anyway to extract an element from underneath a parent node? To be a more more specific (see eg xml below). I would like to index the following:

- category: Category 1; id: 1; author: Author 1

- category: Category 1; id: 2; author: Author 2

- category: Category 2; id: 3; author: Author 3

- category: Category 2; id: 4; author: Author 4

 

Any ideas on how I can get to a parent node from within a child during data import? If it cant be done, what do you suggest would be the best way so I can keep using the DataImportHandler... would XSLT be a good idea to 'flatten out' the structure a bit?

 

Thanks

 

This is what my XML document looks like:

<document>
 <category>
  <name>Category 1</name>
  <item>
   <id>1</id>
   <author>Author 1</author>
  </item>
  <item>
   <id>2</id>
   <author>Author 2</author>
  </item>
 </category>
 <category>
  <name>Category 2</name>
  <item>
   <id>3</id>
   <author>Author 3</author>
  </item>
  <item>
   <id>4</id>
   <author>Author 4</author>
  </item>
 </category>
</document>

 

And this is what my dataConfig looks like:
<dataConfig>
  <dataSource type="URLDataSource" />
  <document>
   <entity name="archive" pk="id" url="http://localhost:9080/data/20090817070752.xml" processor="XPathEntityProcessor" forEach="/document/category/item" transformer="DateFormatTransformer" stream="true" dataSource="dataSource">
    <field column="category" xpath="/document/category/name" commonField="true" />
    <field column="id" xpath="/document/category/item/id" />
    <field column="author" xpath="/document/category/item/author" />
   </entity>
  </document>
</dataConfig>

 

This is how I have specified my schema
<fields>
   <field name="id" type="string" indexed="true" stored="true" required="true" /> 
   <field name="author" type="string" indexed="true" stored="true"/>
   <field name="category" type="string" indexed="true" stored="true"/>
</fields>

<uniqueKey>id</uniqueKey>
<defaultSearchField>id</defaultSearchField>

 


 

_________________________________________________________________
Need a place to rent, buy or share? Let us find your next place for you! 
http://clk.atdmt.com/NMN/go/157631292/direct/01/

Re: Extract info from parent node during data import

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.
in my tests both seems to be working. I had misspelt the column as
"catgoryname" is that why?

keep in mind that you get extra docs for each "category" also



On Thu, Sep 10, 2009 at 5:53 PM, venn hardy <ve...@hotmail.com> wrote:
>
> Hi Paul,
> The forEach="/document/category/item | /document/category/name" didn't work (no categoryname was stored or indexed).
> However forEach="/document/category/item | /document/category" seems to work well. I am not sure why category on its own works, but not category/name...
> But thanks for tip. It wasn't as painful as I thought it would be.
> Venn
>
>> From: noble.paul@corp.aol.com
>> Date: Thu, 10 Sep 2009 09:58:21 +0530
>> Subject: Re: Extract info from parent node during data import
>> To: solr-user@lucene.apache.org
>>
>> try this
>>
>> add two xpaths in your forEach
>>
>> forEach="/document/category/item | /document/category/name"
>>
>> and add a field as follows
>>
>> <field column="catgoryname" xpath ="/document/category/name"
>> commonField="true"/>
>>
>> Please try it out and let me know.
>>
>> On Thu, Sep 10, 2009 at 7:30 AM, venn hardy <ve...@hotmail.com> wrote:
>> >
>> > Hello,
>> >
>> >
>> >
>> > I am using SOLR 1.4 (from nighly build) and its URLDataSource in conjunction with the XPathEntityProcessor. I have successfully imported XML content, but I think I may have found a limitation when it comes to the commonField attribute in the DataImportHandler.
>> >
>> >
>> >
>> > Before writing my own parser to read in a whole XML document, I thought I'd post the question here (since I got some great advice last time).
>> >
>> >
>> >
>> > The bulk of my content is contained within each <item> tag. However, each item has a parent called <category> and each category has a name which I would like to import. In my forEach loop I specify the /document/category/item as the collection of items I am interested in. Is there anyway to extract an element from underneath a parent node? To be a more more specific (see eg xml below). I would like to index the following:
>> >
>> > - category: Category 1; id: 1; author: Author 1
>> >
>> > - category: Category 1; id: 2; author: Author 2
>> >
>> > - category: Category 2; id: 3; author: Author 3
>> >
>> > - category: Category 2; id: 4; author: Author 4
>> >
>> >
>> >
>> > Any ideas on how I can get to a parent node from within a child during data import? If it cant be done, what do you suggest would be the best way so I can keep using the DataImportHandler... would XSLT be a good idea to 'flatten out' the structure a bit?
>> >
>> >
>> >
>> > Thanks
>> >
>> >
>> >
>> > This is what my XML document looks like:
>> >
>> > <document>
>> >  <category>
>> >  <name>Category 1</name>
>> >  <item>
>> >   <id>1</id>
>> >   <author>Author 1</author>
>> >  </item>
>> >  <item>
>> >   <id>2</id>
>> >   <author>Author 2</author>
>> >  </item>
>> >  </category>
>> >  <category>
>> >  <name>Category 2</name>
>> >  <item>
>> >   <id>3</id>
>> >   <author>Author 3</author>
>> >  </item>
>> >  <item>
>> >   <id>4</id>
>> >   <author>Author 4</author>
>> >  </item>
>> >  </category>
>> > </document>
>> >
>> >
>> >
>> > And this is what my dataConfig looks like:
>> > <dataConfig>
>> >  <dataSource type="URLDataSource" />
>> >  <document>
>> >   <entity name="archive" pk="id" url="http://localhost:9080/data/20090817070752.xml" processor="XPathEntityProcessor" forEach="/document/category/item" transformer="DateFormatTransformer" stream="true" dataSource="dataSource">
>> >    <field column="category" xpath="/document/category/name" commonField="true" />
>> >    <field column="id" xpath="/document/category/item/id" />
>> >    <field column="author" xpath="/document/category/item/author" />
>> >   </entity>
>> >  </document>
>> > </dataConfig>
>> >
>> >
>> >
>> > This is how I have specified my schema
>> > <fields>
>> >   <field name="id" type="string" indexed="true" stored="true" required="true" />
>> >   <field name="author" type="string" indexed="true" stored="true"/>
>> >   <field name="category" type="string" indexed="true" stored="true"/>
>> > </fields>
>> >
>> > <uniqueKey>id</uniqueKey>
>> > <defaultSearchField>id</defaultSearchField>
>> >
>> >
>> >
>> >
>> >
>> >
>> > _________________________________________________________________
>> > Need a place to rent, buy or share? Let us find your next place for you!
>> > http://clk.atdmt.com/NMN/go/157631292/direct/01/
>>
>>
>>
>> --
>> -----------------------------------------------------
>> Noble Paul | Principal Engineer| AOL | http://aol.com
>
> _________________________________________________________________
> Get Hotmail on your iPhone Find out how here
> http://windowslive.ninemsn.com.au/article.aspx?id=845706



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: Extract info from parent node during data import

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.
On Sat, Sep 12, 2009 at 12:24 PM, Fergus McMenemie <fe...@twig.me.uk> wrote:
>>On Fri, Sep 11, 2009 at 6:48 AM, venn hardy <ve...@hotmail.com> wrote:
>>>
>>> Hi Fergus,
>>>
>>> When I debugged in the development console http://localhost:9080/solr/admin/dataimport.jsp?handler=/dataimport
>>>
>>> I had no problems. Each category/item seems to be only indexed once, and no parent fields are available (except the category name).
>>>
>>> I am not entirely sure how the forEach statement works, but my interpretation of forEach="/document/category/item | /document/category" is something like this:
>>>
>>> 1. Whenever DIH encounters a document/category it will extract the /document/category/
>>>
>>> name field as a common field
>>> 2. Whenever DIH encounters a document/category/item it will extract all of the item fields.
>>> 3. When all fields have been encountered, save the document in solr and go to the next category/item
>>
>>/document/category/item | /document/category
>>
>>means there are two paths which triggers a new doc (it is possible to
>>have more). Whenever it encounters the closing tag of that xpath , it
>>emits all the fields it collected since the opening of the same tag.
>>after that it clears all the fields it collected since the opening of
>>the tag.
>>
>>If there are fields it collected before opening of the same tag, it retains it
>
>
> Nice and clear, but that is not what I see.
>
> With my test case with forEach="/record | /record/mediaBlock"
> I see that for each /record/mediaBlock "document" indexed it contains all fields
> from the parent "/record" document as well. A search over mediaBlock s returns lots
> of extra fields from the parent which did not have the commonField attribute. I
> will try and produce a testcase

yes it does . . /record/mediaBlock will have all the fields collected
from /record as well. It is by design
.
>
>
>>>
>>>
>>>> Date: Thu, 10 Sep 2009 14:19:31 +0100
>>>> To: solr-user@lucene.apache.org
>>>> From: fergus@twig.me.uk
>>>> Subject: RE: Extract info from parent node during data import
>>>>
>>>> >Hi Paul,
>>>> >The forEach="/document/category/item | /document/category/name" didn't work (no categoryname was stored or indexed).
>>>> >However forEach="/document/category/item | /document/category" seems to work well. I am not sure why category on its own works, but not category/name...
>>>> >But thanks for tip. It wasn't as painful as I thought it would be.
>>>> >Venn
>>>>
>>>> Hmmm, I had bother with this. Although each occurance of /document/category/item
>>>> causes a new solr document to indexed, that document contained all the fields from
>>>> the parent element as well.
>>>>
>>>> Did you see this?
>>>>
>>>> >
>>>> >> From: noble.paul@corp.aol.com
>>>> >> Date: Thu, 10 Sep 2009 09:58:21 +0530
>>>> >> Subject: Re: Extract info from parent node during data import
>>>> >> To: solr-user@lucene.apache.org
>>>> >>
>>>> >> try this
>>>> >>
>>>> >> add two xpaths in your forEach
>>>> >>
>>>> >> forEach="/document/category/item | /document/category/name"
>>>> >>
>>>> >> and add a field as follows
>>>> >>
>>>> >> <field column="catgoryname" xpath ="/document/category/name"
>>>> >> commonField="true"/>
>>>> >>
>>>> >> Please try it out and let me know.
>>>> >>
>>>> >> On Thu, Sep 10, 2009 at 7:30 AM, venn hardy <ve...@hotmail.com> wrote:
>>>> >> >
>>>> >> > Hello,
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > I am using SOLR 1.4 (from nighly build) and its URLDataSource in conjunction with the XPathEntityProcessor. I have successfully imported XML content, but I think I may have found a limitation when it comes to the commonField attribute in the DataImportHandler.
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > Before writing my own parser to read in a whole XML document, I thought I'd post the question here (since I got some great advice last time).
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > The bulk of my content is contained within each <item> tag. However, each item has a parent called <category> and each category has a name which I would like to import. In my forEach loop I specify the /document/category/item as the collection of items I am interested in. Is there anyway to extract an element from underneath a parent node? To be a more more specific (see eg xml below). I would like to index the following:
>>>> >> >
>>>> >> > - category: Category 1; id: 1; author: Author 1
>>>> >> >
>>>> >> > - category: Category 1; id: 2; author: Author 2
>>>> >> >
>>>> >> > - category: Category 2; id: 3; author: Author 3
>>>> >> >
>>>> >> > - category: Category 2; id: 4; author: Author 4
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > Any ideas on how I can get to a parent node from within a child during data import? If it cant be done, what do you suggest would be the best way so I can keep using the DataImportHandler... would XSLT be a good idea to 'flatten out' the structure a bit?
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > Thanks
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > This is what my XML document looks like:
>>>> >> >
>>>> >> > <document>
>>>> >> > <category>
>>>> >> > <name>Category 1</name>
>>>> >> > <item>
>>>> >> > <id>1</id>
>>>> >> > <author>Author 1</author>
>>>> >> > </item>
>>>> >> > <item>
>>>> >> > <id>2</id>
>>>> >> > <author>Author 2</author>
>>>> >> > </item>
>>>> >> > </category>
>>>> >> > <category>
>>>> >> > <name>Category 2</name>
>>>> >> > <item>
>>>> >> > <id>3</id>
>>>> >> > <author>Author 3</author>
>>>> >> > </item>
>>>> >> > <item>
>>>> >> > <id>4</id>
>>>> >> > <author>Author 4</author>
>>>> >> > </item>
>>>> >> > </category>
>>>> >> > </document>
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > And this is what my dataConfig looks like:
>>>> >> > <dataConfig>
>>>> >> > <dataSource type="URLDataSource" />
>>>> >> > <document>
>>>> >> > <entity name="archive" pk="id" url="http://localhost:9080/data/20090817070752.xml" processor="XPathEntityProcessor" forEach="/document/category/item" transformer="DateFormatTransformer" stream="true" dataSource="dataSource">
>>>> >> > <field column="category" xpath="/document/category/name" commonField="true" />
>>>> >> > <field column="id" xpath="/document/category/item/id" />
>>>> >> > <field column="author" xpath="/document/category/item/author" />
>>>> >> > </entity>
>>>> >> > </document>
>>>> >> > </dataConfig>
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > This is how I have specified my schema
>>>> >> > <fields>
>>>> >> > <field name="id" type="string" indexed="true" stored="true" required="true" />
>>>> >> > <field name="author" type="string" indexed="true" stored="true"/>
>>>> >> > <field name="category" type="string" indexed="true" stored="true"/>
>>>> >> > </fields>
>>>> >> >
>>>> >> > <uniqueKey>id</uniqueKey>
>>>> >> > <defaultSearchField>id</defaultSearchField>
>>>> >> >
>
> --
>
> ===============================================================
> Fergus McMenemie               Email:fergus@twig.me.uk
> Techmore Ltd                   Phone:(UK) 07721 376021
>
> Unix/Mac/Intranets             Analyst Programmer
> ===============================================================
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: Extract info from parent node during data import

Posted by Fergus McMenemie <fe...@twig.me.uk>.
>On Fri, Sep 11, 2009 at 6:48 AM, venn hardy <ve...@hotmail.com> wrote:
>>
>> Hi Fergus,
>>
>> When I debugged in the development console http://localhost:9080/solr/admin/dataimport.jsp?handler=/dataimport
>>
>> I had no problems. Each category/item seems to be only indexed once, and no parent fields are available (except the category name).
>>
>> I am not entirely sure how the forEach statement works, but my interpretation of forEach="/document/category/item | /document/category" is something like this:
>>
>> 1. Whenever DIH encounters a document/category it will extract the /document/category/
>>
>> name field as a common field
>> 2. Whenever DIH encounters a document/category/item it will extract all of the item fields.
>> 3. When all fields have been encountered, save the document in solr and go to the next category/item
>
>/document/category/item | /document/category
>
>means there are two paths which triggers a new doc (it is possible to
>have more). Whenever it encounters the closing tag of that xpath , it
>emits all the fields it collected since the opening of the same tag.
>after that it clears all the fields it collected since the opening of
>the tag.
>
>If there are fields it collected before opening of the same tag, it retains it


Nice and clear, but that is not what I see.

With my test case with forEach="/record | /record/mediaBlock"
I see that for each /record/mediaBlock "document" indexed it contains all fields
from the parent "/record" document as well. A search over mediaBlock s returns lots
of extra fields from the parent which did not have the commonField attribute. I 
will try and produce a testcase.


>>
>>
>>> Date: Thu, 10 Sep 2009 14:19:31 +0100
>>> To: solr-user@lucene.apache.org
>>> From: fergus@twig.me.uk
>>> Subject: RE: Extract info from parent node during data import
>>>
>>> >Hi Paul,
>>> >The forEach="/document/category/item | /document/category/name" didn't work (no categoryname was stored or indexed).
>>> >However forEach="/document/category/item | /document/category" seems to work well. I am not sure why category on its own works, but not category/name...
>>> >But thanks for tip. It wasn't as painful as I thought it would be.
>>> >Venn
>>>
>>> Hmmm, I had bother with this. Although each occurance of /document/category/item
>>> causes a new solr document to indexed, that document contained all the fields from
>>> the parent element as well.
>>>
>>> Did you see this?
>>>
>>> >
>>> >> From: noble.paul@corp.aol.com
>>> >> Date: Thu, 10 Sep 2009 09:58:21 +0530
>>> >> Subject: Re: Extract info from parent node during data import
>>> >> To: solr-user@lucene.apache.org
>>> >>
>>> >> try this
>>> >>
>>> >> add two xpaths in your forEach
>>> >>
>>> >> forEach="/document/category/item | /document/category/name"
>>> >>
>>> >> and add a field as follows
>>> >>
>>> >> <field column="catgoryname" xpath ="/document/category/name"
>>> >> commonField="true"/>
>>> >>
>>> >> Please try it out and let me know.
>>> >>
>>> >> On Thu, Sep 10, 2009 at 7:30 AM, venn hardy <ve...@hotmail.com> wrote:
>>> >> >
>>> >> > Hello,
>>> >> >
>>> >> >
>>> >> >
>>> >> > I am using SOLR 1.4 (from nighly build) and its URLDataSource in conjunction with the XPathEntityProcessor. I have successfully imported XML content, but I think I may have found a limitation when it comes to the commonField attribute in the DataImportHandler.
>>> >> >
>>> >> >
>>> >> >
>>> >> > Before writing my own parser to read in a whole XML document, I thought I'd post the question here (since I got some great advice last time).
>>> >> >
>>> >> >
>>> >> >
>>> >> > The bulk of my content is contained within each <item> tag. However, each item has a parent called <category> and each category has a name which I would like to import. In my forEach loop I specify the /document/category/item as the collection of items I am interested in. Is there anyway to extract an element from underneath a parent node? To be a more more specific (see eg xml below). I would like to index the following:
>>> >> >
>>> >> > - category: Category 1; id: 1; author: Author 1
>>> >> >
>>> >> > - category: Category 1; id: 2; author: Author 2
>>> >> >
>>> >> > - category: Category 2; id: 3; author: Author 3
>>> >> >
>>> >> > - category: Category 2; id: 4; author: Author 4
>>> >> >
>>> >> >
>>> >> >
>>> >> > Any ideas on how I can get to a parent node from within a child during data import? If it cant be done, what do you suggest would be the best way so I can keep using the DataImportHandler... would XSLT be a good idea to 'flatten out' the structure a bit?
>>> >> >
>>> >> >
>>> >> >
>>> >> > Thanks
>>> >> >
>>> >> >
>>> >> >
>>> >> > This is what my XML document looks like:
>>> >> >
>>> >> > <document>
>>> >> > <category>
>>> >> > <name>Category 1</name>
>>> >> > <item>
>>> >> > <id>1</id>
>>> >> > <author>Author 1</author>
>>> >> > </item>
>>> >> > <item>
>>> >> > <id>2</id>
>>> >> > <author>Author 2</author>
>>> >> > </item>
>>> >> > </category>
>>> >> > <category>
>>> >> > <name>Category 2</name>
>>> >> > <item>
>>> >> > <id>3</id>
>>> >> > <author>Author 3</author>
>>> >> > </item>
>>> >> > <item>
>>> >> > <id>4</id>
>>> >> > <author>Author 4</author>
>>> >> > </item>
>>> >> > </category>
>>> >> > </document>
>>> >> >
>>> >> >
>>> >> >
>>> >> > And this is what my dataConfig looks like:
>>> >> > <dataConfig>
>>> >> > <dataSource type="URLDataSource" />
>>> >> > <document>
>>> >> > <entity name="archive" pk="id" url="http://localhost:9080/data/20090817070752.xml" processor="XPathEntityProcessor" forEach="/document/category/item" transformer="DateFormatTransformer" stream="true" dataSource="dataSource">
>>> >> > <field column="category" xpath="/document/category/name" commonField="true" />
>>> >> > <field column="id" xpath="/document/category/item/id" />
>>> >> > <field column="author" xpath="/document/category/item/author" />
>>> >> > </entity>
>>> >> > </document>
>>> >> > </dataConfig>
>>> >> >
>>> >> >
>>> >> >
>>> >> > This is how I have specified my schema
>>> >> > <fields>
>>> >> > <field name="id" type="string" indexed="true" stored="true" required="true" />
>>> >> > <field name="author" type="string" indexed="true" stored="true"/>
>>> >> > <field name="category" type="string" indexed="true" stored="true"/>
>>> >> > </fields>
>>> >> >
>>> >> > <uniqueKey>id</uniqueKey>
>>> >> > <defaultSearchField>id</defaultSearchField>
>>> >> >

-- 

===============================================================
Fergus McMenemie               Email:fergus@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Re: Extract info from parent node during data import

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.
On Fri, Sep 11, 2009 at 6:48 AM, venn hardy <ve...@hotmail.com> wrote:
>
> Hi Fergus,
>
> When I debugged in the development console http://localhost:9080/solr/admin/dataimport.jsp?handler=/dataimport
>
> I had no problems. Each category/item seems to be only indexed once, and no parent fields are available (except the category name).
>
> I am not entirely sure how the forEach statement works, but my interpretation of forEach="/document/category/item | /document/category" is something like this:
>
> 1. Whenever DIH encounters a document/category it will extract the /document/category/
>
> name field as a common field
> 2. Whenever DIH encounters a document/category/item it will extract all of the item fields.
> 3. When all fields have been encountered, save the document in solr and go to the next category/item

/document/category/item | /document/category

means there are two paths which triggers a new doc (it is possible to
have more). Whenever it encounters the closing tag of that xpath , it
emits all the fields it collected since the opening of the same tag.
after that it clears all the fields it collected since the opening of
the tag.

If there are fields it collected before opening of the same tag, it retains it



>
>
>> Date: Thu, 10 Sep 2009 14:19:31 +0100
>> To: solr-user@lucene.apache.org
>> From: fergus@twig.me.uk
>> Subject: RE: Extract info from parent node during data import
>>
>> >Hi Paul,
>> >The forEach="/document/category/item | /document/category/name" didn't work (no categoryname was stored or indexed).
>> >However forEach="/document/category/item | /document/category" seems to work well. I am not sure why category on its own works, but not category/name...
>> >But thanks for tip. It wasn't as painful as I thought it would be.
>> >Venn
>>
>> Hmmm, I had bother with this. Although each occurance of /document/category/item
>> causes a new solr document to indexed, that document contained all the fields from
>> the parent element as well.
>>
>> Did you see this?
>>
>> >
>> >> From: noble.paul@corp.aol.com
>> >> Date: Thu, 10 Sep 2009 09:58:21 +0530
>> >> Subject: Re: Extract info from parent node during data import
>> >> To: solr-user@lucene.apache.org
>> >>
>> >> try this
>> >>
>> >> add two xpaths in your forEach
>> >>
>> >> forEach="/document/category/item | /document/category/name"
>> >>
>> >> and add a field as follows
>> >>
>> >> <field column="catgoryname" xpath ="/document/category/name"
>> >> commonField="true"/>
>> >>
>> >> Please try it out and let me know.
>> >>
>> >> On Thu, Sep 10, 2009 at 7:30 AM, venn hardy <ve...@hotmail.com> wrote:
>> >> >
>> >> > Hello,
>> >> >
>> >> >
>> >> >
>> >> > I am using SOLR 1.4 (from nighly build) and its URLDataSource in conjunction with the XPathEntityProcessor. I have successfully imported XML content, but I think I may have found a limitation when it comes to the commonField attribute in the DataImportHandler.
>> >> >
>> >> >
>> >> >
>> >> > Before writing my own parser to read in a whole XML document, I thought I'd post the question here (since I got some great advice last time).
>> >> >
>> >> >
>> >> >
>> >> > The bulk of my content is contained within each <item> tag. However, each item has a parent called <category> and each category has a name which I would like to import. In my forEach loop I specify the /document/category/item as the collection of items I am interested in. Is there anyway to extract an element from underneath a parent node? To be a more more specific (see eg xml below). I would like to index the following:
>> >> >
>> >> > - category: Category 1; id: 1; author: Author 1
>> >> >
>> >> > - category: Category 1; id: 2; author: Author 2
>> >> >
>> >> > - category: Category 2; id: 3; author: Author 3
>> >> >
>> >> > - category: Category 2; id: 4; author: Author 4
>> >> >
>> >> >
>> >> >
>> >> > Any ideas on how I can get to a parent node from within a child during data import? If it cant be done, what do you suggest would be the best way so I can keep using the DataImportHandler... would XSLT be a good idea to 'flatten out' the structure a bit?
>> >> >
>> >> >
>> >> >
>> >> > Thanks
>> >> >
>> >> >
>> >> >
>> >> > This is what my XML document looks like:
>> >> >
>> >> > <document>
>> >> > <category>
>> >> > <name>Category 1</name>
>> >> > <item>
>> >> > <id>1</id>
>> >> > <author>Author 1</author>
>> >> > </item>
>> >> > <item>
>> >> > <id>2</id>
>> >> > <author>Author 2</author>
>> >> > </item>
>> >> > </category>
>> >> > <category>
>> >> > <name>Category 2</name>
>> >> > <item>
>> >> > <id>3</id>
>> >> > <author>Author 3</author>
>> >> > </item>
>> >> > <item>
>> >> > <id>4</id>
>> >> > <author>Author 4</author>
>> >> > </item>
>> >> > </category>
>> >> > </document>
>> >> >
>> >> >
>> >> >
>> >> > And this is what my dataConfig looks like:
>> >> > <dataConfig>
>> >> > <dataSource type="URLDataSource" />
>> >> > <document>
>> >> > <entity name="archive" pk="id" url="http://localhost:9080/data/20090817070752.xml" processor="XPathEntityProcessor" forEach="/document/category/item" transformer="DateFormatTransformer" stream="true" dataSource="dataSource">
>> >> > <field column="category" xpath="/document/category/name" commonField="true" />
>> >> > <field column="id" xpath="/document/category/item/id" />
>> >> > <field column="author" xpath="/document/category/item/author" />
>> >> > </entity>
>> >> > </document>
>> >> > </dataConfig>
>> >> >
>> >> >
>> >> >
>> >> > This is how I have specified my schema
>> >> > <fields>
>> >> > <field name="id" type="string" indexed="true" stored="true" required="true" />
>> >> > <field name="author" type="string" indexed="true" stored="true"/>
>> >> > <field name="category" type="string" indexed="true" stored="true"/>
>> >> > </fields>
>> >> >
>> >> > <uniqueKey>id</uniqueKey>
>> >> > <defaultSearchField>id</defaultSearchField>
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > _________________________________________________________________
>> >> > Need a place to rent, buy or share? Let us find your next place for you!
>> >> > http://clk.atdmt.com/NMN/go/157631292/direct/01/
>> >>
>> >>
>> >>
>> >> --
>> >> -----------------------------------------------------
>> >> Noble Paul | Principal Engineer| AOL | http://aol.com
>> >
>> >_________________________________________________________________
>> >Get Hotmail on your iPhone Find out how here
>> >http://windowslive.ninemsn.com.au/article.aspx?id=845706
>>
>> --
>>
>> ===============================================================
>> Fergus McMenemie Email:fergus@twig.me.uk
>> Techmore Ltd Phone:(UK) 07721 376021
>>
>> Unix/Mac/Intranets Analyst Programmer
>> ===============================================================
>
> _________________________________________________________________
> Need a place to rent, buy or share? Let us find your next place for you!
> http://clk.atdmt.com/NMN/go/157631292/direct/01/



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

RE: Extract info from parent node during data import

Posted by venn hardy <ve...@hotmail.com>.
Hi Fergus,

When I debugged in the development console http://localhost:9080/solr/admin/dataimport.jsp?handler=/dataimport

I had no problems. Each category/item seems to be only indexed once, and no parent fields are available (except the category name).

I am not entirely sure how the forEach statement works, but my interpretation of forEach="/document/category/item | /document/category" is something like this:

1. Whenever DIH encounters a document/category it will extract the /document/category/

name field as a common field
2. Whenever DIH encounters a document/category/item it will extract all of the item fields.
3. When all fields have been encountered, save the document in solr and go to the next category/item

 
> Date: Thu, 10 Sep 2009 14:19:31 +0100
> To: solr-user@lucene.apache.org
> From: fergus@twig.me.uk
> Subject: RE: Extract info from parent node during data import
> 
> >Hi Paul,
> >The forEach="/document/category/item | /document/category/name" didn't work (no categoryname was stored or indexed).
> >However forEach="/document/category/item | /document/category" seems to work well. I am not sure why category on its own works, but not category/name...
> >But thanks for tip. It wasn't as painful as I thought it would be.
> >Venn
> 
> Hmmm, I had bother with this. Although each occurance of /document/category/item 
> causes a new solr document to indexed, that document contained all the fields from
> the parent element as well.
> 
> Did you see this?
> 
> >
> >> From: noble.paul@corp.aol.com
> >> Date: Thu, 10 Sep 2009 09:58:21 +0530
> >> Subject: Re: Extract info from parent node during data import
> >> To: solr-user@lucene.apache.org
> >> 
> >> try this
> >> 
> >> add two xpaths in your forEach
> >> 
> >> forEach="/document/category/item | /document/category/name"
> >> 
> >> and add a field as follows
> >> 
> >> <field column="catgoryname" xpath ="/document/category/name"
> >> commonField="true"/>
> >> 
> >> Please try it out and let me know.
> >> 
> >> On Thu, Sep 10, 2009 at 7:30 AM, venn hardy <ve...@hotmail.com> wrote:
> >> >
> >> > Hello,
> >> >
> >> >
> >> >
> >> > I am using SOLR 1.4 (from nighly build) and its URLDataSource in conjunction with the XPathEntityProcessor. I have successfully imported XML content, but I think I may have found a limitation when it comes to the commonField attribute in the DataImportHandler.
> >> >
> >> >
> >> >
> >> > Before writing my own parser to read in a whole XML document, I thought I'd post the question here (since I got some great advice last time).
> >> >
> >> >
> >> >
> >> > The bulk of my content is contained within each <item> tag. However, each item has a parent called <category> and each category has a name which I would like to import. In my forEach loop I specify the /document/category/item as the collection of items I am interested in. Is there anyway to extract an element from underneath a parent node? To be a more more specific (see eg xml below). I would like to index the following:
> >> >
> >> > - category: Category 1; id: 1; author: Author 1
> >> >
> >> > - category: Category 1; id: 2; author: Author 2
> >> >
> >> > - category: Category 2; id: 3; author: Author 3
> >> >
> >> > - category: Category 2; id: 4; author: Author 4
> >> >
> >> >
> >> >
> >> > Any ideas on how I can get to a parent node from within a child during data import? If it cant be done, what do you suggest would be the best way so I can keep using the DataImportHandler... would XSLT be a good idea to 'flatten out' the structure a bit?
> >> >
> >> >
> >> >
> >> > Thanks
> >> >
> >> >
> >> >
> >> > This is what my XML document looks like:
> >> >
> >> > <document>
> >> > <category>
> >> > <name>Category 1</name>
> >> > <item>
> >> > <id>1</id>
> >> > <author>Author 1</author>
> >> > </item>
> >> > <item>
> >> > <id>2</id>
> >> > <author>Author 2</author>
> >> > </item>
> >> > </category>
> >> > <category>
> >> > <name>Category 2</name>
> >> > <item>
> >> > <id>3</id>
> >> > <author>Author 3</author>
> >> > </item>
> >> > <item>
> >> > <id>4</id>
> >> > <author>Author 4</author>
> >> > </item>
> >> > </category>
> >> > </document>
> >> >
> >> >
> >> >
> >> > And this is what my dataConfig looks like:
> >> > <dataConfig>
> >> > <dataSource type="URLDataSource" />
> >> > <document>
> >> > <entity name="archive" pk="id" url="http://localhost:9080/data/20090817070752.xml" processor="XPathEntityProcessor" forEach="/document/category/item" transformer="DateFormatTransformer" stream="true" dataSource="dataSource">
> >> > <field column="category" xpath="/document/category/name" commonField="true" />
> >> > <field column="id" xpath="/document/category/item/id" />
> >> > <field column="author" xpath="/document/category/item/author" />
> >> > </entity>
> >> > </document>
> >> > </dataConfig>
> >> >
> >> >
> >> >
> >> > This is how I have specified my schema
> >> > <fields>
> >> > <field name="id" type="string" indexed="true" stored="true" required="true" />
> >> > <field name="author" type="string" indexed="true" stored="true"/>
> >> > <field name="category" type="string" indexed="true" stored="true"/>
> >> > </fields>
> >> >
> >> > <uniqueKey>id</uniqueKey>
> >> > <defaultSearchField>id</defaultSearchField>
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > _________________________________________________________________
> >> > Need a place to rent, buy or share? Let us find your next place for you!
> >> > http://clk.atdmt.com/NMN/go/157631292/direct/01/
> >> 
> >> 
> >> 
> >> -- 
> >> -----------------------------------------------------
> >> Noble Paul | Principal Engineer| AOL | http://aol.com
> >
> >_________________________________________________________________
> >Get Hotmail on your iPhone Find out how here
> >http://windowslive.ninemsn.com.au/article.aspx?id=845706
> 
> -- 
> 
> ===============================================================
> Fergus McMenemie Email:fergus@twig.me.uk
> Techmore Ltd Phone:(UK) 07721 376021
> 
> Unix/Mac/Intranets Analyst Programmer
> ===============================================================

_________________________________________________________________
Need a place to rent, buy or share? Let us find your next place for you! 
http://clk.atdmt.com/NMN/go/157631292/direct/01/

RE: Extract info from parent node during data import

Posted by Fergus McMenemie <fe...@twig.me.uk>.
>Hi Paul,
>The forEach="/document/category/item | /document/category/name" didn't work (no categoryname was stored or indexed).
>However forEach="/document/category/item | /document/category" seems to work well. I am not sure why category on its own works, but not category/name...
>But thanks for tip. It wasn't as painful as I thought it would be.
>Venn

Hmmm, I had bother with this. Although each occurance of /document/category/item 
causes a new solr document to indexed, that document contained all the fields from
the parent element as well.

Did you see this?

>
>> From: noble.paul@corp.aol.com
>> Date: Thu, 10 Sep 2009 09:58:21 +0530
>> Subject: Re: Extract info from parent node during data import
>> To: solr-user@lucene.apache.org
>> 
>> try this
>> 
>> add two xpaths in your forEach
>> 
>> forEach="/document/category/item | /document/category/name"
>> 
>> and add a field as follows
>> 
>> <field column="catgoryname" xpath ="/document/category/name"
>> commonField="true"/>
>> 
>> Please try it out and let me know.
>> 
>> On Thu, Sep 10, 2009 at 7:30 AM, venn hardy <ve...@hotmail.com> wrote:
>> >
>> > Hello,
>> >
>> >
>> >
>> > I am using SOLR 1.4 (from nighly build) and its URLDataSource in conjunction with the XPathEntityProcessor. I have successfully imported XML content, but I think I may have found a limitation when it comes to the commonField attribute in the DataImportHandler.
>> >
>> >
>> >
>> > Before writing my own parser to read in a whole XML document, I thought I'd post the question here (since I got some great advice last time).
>> >
>> >
>> >
>> > The bulk of my content is contained within each <item> tag. However, each item has a parent called <category> and each category has a name which I would like to import. In my forEach loop I specify the /document/category/item as the collection of items I am interested in. Is there anyway to extract an element from underneath a parent node? To be a more more specific (see eg xml below). I would like to index the following:
>> >
>> > - category: Category 1; id: 1; author: Author 1
>> >
>> > - category: Category 1; id: 2; author: Author 2
>> >
>> > - category: Category 2; id: 3; author: Author 3
>> >
>> > - category: Category 2; id: 4; author: Author 4
>> >
>> >
>> >
>> > Any ideas on how I can get to a parent node from within a child during data import? If it cant be done, what do you suggest would be the best way so I can keep using the DataImportHandler... would XSLT be a good idea to 'flatten out' the structure a bit?
>> >
>> >
>> >
>> > Thanks
>> >
>> >
>> >
>> > This is what my XML document looks like:
>> >
>> > <document>
>> >  <category>
>> >  <name>Category 1</name>
>> >  <item>
>> >   <id>1</id>
>> >   <author>Author 1</author>
>> >  </item>
>> >  <item>
>> >   <id>2</id>
>> >   <author>Author 2</author>
>> >  </item>
>> >  </category>
>> >  <category>
>> >  <name>Category 2</name>
>> >  <item>
>> >   <id>3</id>
>> >   <author>Author 3</author>
>> >  </item>
>> >  <item>
>> >   <id>4</id>
>> >   <author>Author 4</author>
>> >  </item>
>> >  </category>
>> > </document>
>> >
>> >
>> >
>> > And this is what my dataConfig looks like:
>> > <dataConfig>
>> >  <dataSource type="URLDataSource" />
>> >  <document>
>> >   <entity name="archive" pk="id" url="http://localhost:9080/data/20090817070752.xml" processor="XPathEntityProcessor" forEach="/document/category/item" transformer="DateFormatTransformer" stream="true" dataSource="dataSource">
>> >    <field column="category" xpath="/document/category/name" commonField="true" />
>> >    <field column="id" xpath="/document/category/item/id" />
>> >    <field column="author" xpath="/document/category/item/author" />
>> >   </entity>
>> >  </document>
>> > </dataConfig>
>> >
>> >
>> >
>> > This is how I have specified my schema
>> > <fields>
>> >   <field name="id" type="string" indexed="true" stored="true" required="true" />
>> >   <field name="author" type="string" indexed="true" stored="true"/>
>> >   <field name="category" type="string" indexed="true" stored="true"/>
>> > </fields>
>> >
>> > <uniqueKey>id</uniqueKey>
>> > <defaultSearchField>id</defaultSearchField>
>> >
>> >
>> >
>> >
>> >
>> >
>> > _________________________________________________________________
>> > Need a place to rent, buy or share? Let us find your next place for you!
>> > http://clk.atdmt.com/NMN/go/157631292/direct/01/
>> 
>> 
>> 
>> -- 
>> -----------------------------------------------------
>> Noble Paul | Principal Engineer| AOL | http://aol.com
>
>_________________________________________________________________
>Get Hotmail on your iPhone Find out how here
>http://windowslive.ninemsn.com.au/article.aspx?id=845706

-- 

===============================================================
Fergus McMenemie               Email:fergus@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

RE: Extract info from parent node during data import

Posted by venn hardy <ve...@hotmail.com>.
Hi Paul,
The forEach="/document/category/item | /document/category/name" didn't work (no categoryname was stored or indexed).
However forEach="/document/category/item | /document/category" seems to work well. I am not sure why category on its own works, but not category/name...
But thanks for tip. It wasn't as painful as I thought it would be.
Venn

> From: noble.paul@corp.aol.com
> Date: Thu, 10 Sep 2009 09:58:21 +0530
> Subject: Re: Extract info from parent node during data import
> To: solr-user@lucene.apache.org
> 
> try this
> 
> add two xpaths in your forEach
> 
> forEach="/document/category/item | /document/category/name"
> 
> and add a field as follows
> 
> <field column="catgoryname" xpath ="/document/category/name"
> commonField="true"/>
> 
> Please try it out and let me know.
> 
> On Thu, Sep 10, 2009 at 7:30 AM, venn hardy <ve...@hotmail.com> wrote:
> >
> > Hello,
> >
> >
> >
> > I am using SOLR 1.4 (from nighly build) and its URLDataSource in conjunction with the XPathEntityProcessor. I have successfully imported XML content, but I think I may have found a limitation when it comes to the commonField attribute in the DataImportHandler.
> >
> >
> >
> > Before writing my own parser to read in a whole XML document, I thought I'd post the question here (since I got some great advice last time).
> >
> >
> >
> > The bulk of my content is contained within each <item> tag. However, each item has a parent called <category> and each category has a name which I would like to import. In my forEach loop I specify the /document/category/item as the collection of items I am interested in. Is there anyway to extract an element from underneath a parent node? To be a more more specific (see eg xml below). I would like to index the following:
> >
> > - category: Category 1; id: 1; author: Author 1
> >
> > - category: Category 1; id: 2; author: Author 2
> >
> > - category: Category 2; id: 3; author: Author 3
> >
> > - category: Category 2; id: 4; author: Author 4
> >
> >
> >
> > Any ideas on how I can get to a parent node from within a child during data import? If it cant be done, what do you suggest would be the best way so I can keep using the DataImportHandler... would XSLT be a good idea to 'flatten out' the structure a bit?
> >
> >
> >
> > Thanks
> >
> >
> >
> > This is what my XML document looks like:
> >
> > <document>
> >  <category>
> >  <name>Category 1</name>
> >  <item>
> >   <id>1</id>
> >   <author>Author 1</author>
> >  </item>
> >  <item>
> >   <id>2</id>
> >   <author>Author 2</author>
> >  </item>
> >  </category>
> >  <category>
> >  <name>Category 2</name>
> >  <item>
> >   <id>3</id>
> >   <author>Author 3</author>
> >  </item>
> >  <item>
> >   <id>4</id>
> >   <author>Author 4</author>
> >  </item>
> >  </category>
> > </document>
> >
> >
> >
> > And this is what my dataConfig looks like:
> > <dataConfig>
> >  <dataSource type="URLDataSource" />
> >  <document>
> >   <entity name="archive" pk="id" url="http://localhost:9080/data/20090817070752.xml" processor="XPathEntityProcessor" forEach="/document/category/item" transformer="DateFormatTransformer" stream="true" dataSource="dataSource">
> >    <field column="category" xpath="/document/category/name" commonField="true" />
> >    <field column="id" xpath="/document/category/item/id" />
> >    <field column="author" xpath="/document/category/item/author" />
> >   </entity>
> >  </document>
> > </dataConfig>
> >
> >
> >
> > This is how I have specified my schema
> > <fields>
> >   <field name="id" type="string" indexed="true" stored="true" required="true" />
> >   <field name="author" type="string" indexed="true" stored="true"/>
> >   <field name="category" type="string" indexed="true" stored="true"/>
> > </fields>
> >
> > <uniqueKey>id</uniqueKey>
> > <defaultSearchField>id</defaultSearchField>
> >
> >
> >
> >
> >
> >
> > _________________________________________________________________
> > Need a place to rent, buy or share? Let us find your next place for you!
> > http://clk.atdmt.com/NMN/go/157631292/direct/01/
> 
> 
> 
> -- 
> -----------------------------------------------------
> Noble Paul | Principal Engineer| AOL | http://aol.com

_________________________________________________________________
Get Hotmail on your iPhone Find out how here
http://windowslive.ninemsn.com.au/article.aspx?id=845706

Re: Extract info from parent node during data import

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.
try this

add two xpaths in your forEach

forEach="/document/category/item | /document/category/name"

and add a field as follows

<field column="catgoryname" xpath ="/document/category/name"
commonField="true"/>

Please try it out and let me know.

On Thu, Sep 10, 2009 at 7:30 AM, venn hardy <ve...@hotmail.com> wrote:
>
> Hello,
>
>
>
> I am using SOLR 1.4 (from nighly build) and its URLDataSource in conjunction with the XPathEntityProcessor. I have successfully imported XML content, but I think I may have found a limitation when it comes to the commonField attribute in the DataImportHandler.
>
>
>
> Before writing my own parser to read in a whole XML document, I thought I'd post the question here (since I got some great advice last time).
>
>
>
> The bulk of my content is contained within each <item> tag. However, each item has a parent called <category> and each category has a name which I would like to import. In my forEach loop I specify the /document/category/item as the collection of items I am interested in. Is there anyway to extract an element from underneath a parent node? To be a more more specific (see eg xml below). I would like to index the following:
>
> - category: Category 1; id: 1; author: Author 1
>
> - category: Category 1; id: 2; author: Author 2
>
> - category: Category 2; id: 3; author: Author 3
>
> - category: Category 2; id: 4; author: Author 4
>
>
>
> Any ideas on how I can get to a parent node from within a child during data import? If it cant be done, what do you suggest would be the best way so I can keep using the DataImportHandler... would XSLT be a good idea to 'flatten out' the structure a bit?
>
>
>
> Thanks
>
>
>
> This is what my XML document looks like:
>
> <document>
>  <category>
>  <name>Category 1</name>
>  <item>
>   <id>1</id>
>   <author>Author 1</author>
>  </item>
>  <item>
>   <id>2</id>
>   <author>Author 2</author>
>  </item>
>  </category>
>  <category>
>  <name>Category 2</name>
>  <item>
>   <id>3</id>
>   <author>Author 3</author>
>  </item>
>  <item>
>   <id>4</id>
>   <author>Author 4</author>
>  </item>
>  </category>
> </document>
>
>
>
> And this is what my dataConfig looks like:
> <dataConfig>
>  <dataSource type="URLDataSource" />
>  <document>
>   <entity name="archive" pk="id" url="http://localhost:9080/data/20090817070752.xml" processor="XPathEntityProcessor" forEach="/document/category/item" transformer="DateFormatTransformer" stream="true" dataSource="dataSource">
>    <field column="category" xpath="/document/category/name" commonField="true" />
>    <field column="id" xpath="/document/category/item/id" />
>    <field column="author" xpath="/document/category/item/author" />
>   </entity>
>  </document>
> </dataConfig>
>
>
>
> This is how I have specified my schema
> <fields>
>   <field name="id" type="string" indexed="true" stored="true" required="true" />
>   <field name="author" type="string" indexed="true" stored="true"/>
>   <field name="category" type="string" indexed="true" stored="true"/>
> </fields>
>
> <uniqueKey>id</uniqueKey>
> <defaultSearchField>id</defaultSearchField>
>
>
>
>
>
>
> _________________________________________________________________
> Need a place to rent, buy or share? Let us find your next place for you!
> http://clk.atdmt.com/NMN/go/157631292/direct/01/



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com