You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Paul, Noble" <no...@corp.aol.com> on 2009/09/16 14:16:25 UTC

Re: Extract info from parent node during data import (redirect:)

Fergus,

Implementing  wildcard (//tagname) is definitely possible. I would love
to see it working. But if you wish to take a dig at it I shall do
whatever I can to help.

>What is the use case that makes flow though so useful? 
We do not know to which forEach xpath a given field is associated with.
Currently you can clean up the fields using a transformer. There is an
implicit field '$forEach' which tells you about the xpath tag for each
record that is emitted.

>The recently added comments in XPathRecordReader are a great help and I
>was planning to add more. Might this be an issue?
I would love to have it. Give a patch and I shall commit it.
XPathRecordReader is a blackbox and AFAIK I am the only one who knows
it. I would love to have more eyes on that.

>I would like to open a JIRA for improving XPathRecordReader.
Please go ahead. You can paste the contents of this mail in the list .
There may be others with similar ideas

Noble.
-----Original Message-----
>>>Noble
>>>>
>>>>/document/category/item | /document/category
>>>>
>>>>means there are two paths which triggers a new doc (it is possible to
>>>>have more). Whenever it encounters the closing tag of that xpath , it
>>>>emits all the fields it collected since the opening of the same tag.
>>>>after that it clears all the fields it collected since the opening of
>>>>the tag.
>>>>
>>>>If there are fields it collected before opening of the same tag, it 
>>>>retains it
>>>
>>>
>>> Nice and clear, but that is not what I see.
>>>
>>> With my test case with forEach="/record | /record/mediaBlock"
>>> I see that for each /record/mediaBlock "document" indexed it contains
>
>>> all fields from the parent "/record" document as well. A search over 
>>> mediaBlock s returns lots of extra fields from the parent which did 
>>> not have the commonField attribute. I will try and produce a testcase
>>
>>yes it does . . /record/mediaBlock will have all the fields collected 
>>from /record as well.  *****It is by design******
>
>Oh!
>
>I had always considered it a bug or at least a limitation. After all if
>we have the "commonField" attribute why do we need an automatic flow
>through of all collected fields from parent nodes. This feature is as
>far as I can see undocumented and at the same time unintuitive.
>It also, in my case, causes tons more information to be indexed than is
>needed.
>
>I have spent a while thinking through possible use cases. My use case
>involves having documents we want to search as a whole and behave as
>normal. At the same time these documents contain inner sections we wish
>to treat as sub-documents; in my case I a have pictures with associated
>captions which I wish to search separately. Having indexed the documents
>with forEach="/record | /record/mediaBlock" my picture search works
>nicely but I have a nasty side effect when performing searches over the
>rest of the document. Because fields from the parent node are also
>present in the children, when I search for any text the same document
>gets returned many times, once due to the text in the parent node and
>again for each picture placed in the document. I have a work around for
>this issue but have always considered it a bug.
>
>What is the use case that makes flow though so useful?
>
>I had just started playing with the code to see how easy this would be
>to change. The recently added comments in XPathRecordReader are a great
>help and I was planning to add more. Might this be an issue?
>
>I have noted, while lurking on the solr mail lists, that requests for
>this type of functionality keep coming up; to be able to restrict
>searches to a sub section of a document. I have really needed this sort
>of thinks many times with the type of stuff I work with.
>
>My other planned activity was to see how easy xpaths such as //tagname
>would be implement. Since my latest data-config.xml looks like:-
>
><field column="para32"   name="text" xpath="/record/address/para"
>flatten="true" />
><field column="para40"   name="text" xpath="/record/authoredBy/para"
>flatten="true" />
><field column="para43"   name="text"
>xpath="/record/dataGroup/address/para"  flatten="true" />
><field column="para47"   name="text"
>xpath="/record/dataGroup/keyPersonnel/doubleList/first/para"
>flatten="true" />
><field column="para49"   name="text"
>xpath="/record/dataGroup/keyPersonnel/doubleList/second/para"
>flatten="true" />
><field column="para50"   name="text"
>xpath="/record/dataGroup/keyPersonnel/para"  flatten="true" />
><field column="para51"   name="text" xpath="/record/dataGroup/para"
>flatten="true" />
><field column="para57"   name="text"
>xpath="/record/doubleList/first/para"  flatten="true" />
><field column="para59"   name="text"
>xpath="/record/doubleList/second/para"  flatten="true" />
><field column="para63"   name="text"
>xpath="/record/keyPersonnel/doubleList/first/para"  flatten="true" />
><field column="para65"   name="text"
>xpath="/record/keyPersonnel/doubleList/second/para"  flatten="true" />
><field column="para68"   name="text" xpath="/record/list/listItem/para"
>flatten="true" />
><field column="para75"   name="text"
>xpath="/record/mediaBlock/doubleList/first/para"  flatten="true" />
><field column="para77"   name="text"
>xpath="/record/mediaBlock/doubleList/second/para"  flatten="true" />
><field column="para172"  name="text" xpath="/record/noteGroup/note/para"
>flatten="true" /> <field column="para174"  name="text"
>xpath="/record/para"  flatten="true" /> <field column="para179"
>name="text"
>xpath="/record/relatedInfo/list/listItem/relatedArticle/para"
>flatten="true" /> <field column="para184"  name="text"
>xpath="/record/sect1/address/dataGroup/para"  flatten="true" /> <field
>column="para185"  name="text" xpath="/record/sect1/address/para"
>flatten="true" /> <field column="para195"  name="text"
>xpath="/record/sect1/dataGroup/address/para"  flatten="true" /> <field
>column="para199"  name="text"
>xpath="/record/sect1/dataGroup/keyPersonnel/doubleList/first/para"
>flatten="true" /> <field column="para201"  name="text"
>xpath="/record/sect1/dataGroup/keyPersonnel/doubleList/second/para"
>flatten="true" /> <field column="para202"  name="text"
>xpath="/record/sect1/dataGroup/keyPersonnel/para"  flatten="true" />
><field column="para203"  name="text"
>xpath="/record/sect1/dataGroup/para"  flatten="true" /> <field
>column="para208"  name="text"
>xpath="/record/sect1/doubleList/first/para"  flatten="true" /> <field
>column="para212"  name="text"
>xpath="/record/sect1/doubleList/second/list/listItem/para"
>flatten="true" /> <field column="para213"  name="text"
>xpath="/record/sect1/doubleList/second/para"  flatten="true" /> <field
>column="para217"  name="text"
>xpath="/record/sect1/keyPersonnel/doubleList/first/para"  flatten="true"
>/> <field column="para219"  name="text"
>xpath="/record/sect1/keyPersonnel/doubleList/second/para"
>flatten="true" /> <field column="para220"  name="text"
>xpath="/record/sect1/keyPersonnel/para"  flatten="true" /> <field
>column="para225"  name="text"
>xpath="/record/sect1/list/listItem/list/listItem/para"  flatten="true"
>/> <field column="para226"  name="text"
>xpath="/record/sect1/list/listItem/para"  flatten="true" /> <field
>column="para240"  name="text" xpath="/record/sect1/para"  flatten="true"
>/> <field column="para244"  name="text"
>xpath="/record/sect1/sect2/doubleList/first/para"  flatten="true" />
><field column="para246"  name="text"
>xpath="/record/sect1/sect2/doubleList/second/para"  flatten="true" />
><field column="para251"  name="text"
>xpath="/record/sect1/sect2/list/listItem/list/listItem/para"
>flatten="true" /> <field column="para252"  name="text"
>xpath="/record/sect1/sect2/list/listItem/para"  flatten="true" /> <field
>column="para258"  name="text"
>xpath="/record/sect1/sect2/noteGroup/note/para"  flatten="true" />
><field column="para259"  name="text" xpath="/record/sect1/sect2/para"
>flatten="true" /> <field column="para265"  name="text"
>xpath="/record/sect1/sect2/sect3/list/listItem/list/listItem/para"
>flatten="true" /> <field column="para266"  name="text"
>xpath="/record/sect1/sect2/sect3/list/listItem/para"  flatten="true" />
><field column="para271"  name="text"
>xpath="/record/sect1/sect2/sect3/para"  flatten="true" /> <field
>column="para275"  name="text"
>xpath="/record/sect1/sect2/sect3/sect4/list/listItem/para"
>flatten="true" /> <field column="para279"  name="text"
>xpath="/record/sect1/sect2/sect3/sect4/para"  flatten="true" /> <field
>column="para284"  name="text"
>xpath="/record/sect1/sect2/sect3/sect4/sect5/para"  flatten="true" />
><field column="para295"  name="text"
>xpath="/record/sect1/sect2/sect3/table/tgroup/tbody/row/entry/noteGroup/
>note/para"  flatten="true" /> <field column="para297"  name="text"
>xpath="/record/sect1/sect2/sect3/table/tgroup/tbody/row/entry/para"
>flatten="true" /> <field column="para301"  name="text"
>xpath="/record/sect1/sect2/sect3/table/tgroup/thead/row/entry/para"
>flatten="true" /> <field column="para312"  name="text"
>xpath="/record/sect1/sect2/table/tgroup/tbody/row/entry/list/listItem/pa
>ra"  flatten="true" /> <field column="para315"  name="text"
>xpath="/record/sect1/sect2/table/tgroup/tbody/row/entry/noteGroup/note/p
>ara"  flatten="true" /> <field column="para316"  name="text"
>xpath="/record/sect1/sect2/table/tgroup/tbody/row/entry/noteGroup/para"
>flatten="true" /> <field column="para318"  name="text"
>xpath="/record/sect1/sect2/table/tgroup/tbody/row/entry/para"
>flatten="true" /> <field column="para322"  name="text"
>xpath="/record/sect1/sect2/table/tgroup/thead/row/entry/para"
>flatten="true" /> <field column="para341"  name="text"
>xpath="/record/sect1/table/tgroup/tbody/row/entry/noteGroup/note/para"
>flatten="true" /> <field column="para342"  name="text"
>xpath="/record/sect1/table/tgroup/tbody/row/entry/noteGroup/para"
>flatten="true" /> <field column="para344"  name="text"
>xpath="/record/sect1/table/tgroup/tbody/row/entry/para"  flatten="true"
>/> <field column="para348"  name="text"
>xpath="/record/sect1/table/tgroup/thead/row/entry/para"  flatten="true"
>/> <field column="para371"  name="text"
>xpath="/record/table/tgroup/tbody/row/entry/noteGroup/note/para"
>flatten="true" /> <field column="para373"  name="text"
>xpath="/record/table/tgroup/tbody/row/entry/para"  flatten="true" />
><field column="para377"  name="text"
>xpath="/record/table/tgroup/thead/row/entry/para"  flatten="true" />
>
>Which is nuts!
>
>I would like to open a JIRA for improving XPathRecordReader.
>
>Regds Fergus.


Re: Extract info from parent node during data import (redirect:)

Posted by Fergus McMenemie <fe...@twig.me.uk>.
JIRA SOLR-1437 created 

  "DIH: Enhance XPathRecordReader to deal with //tagname and other improvements."

>Fergus,
>
>Implementing  wildcard (//tagname) is definitely possible. I would love
>to see it working. But if you wish to take a dig at it I shall do
>whatever I can to help.
>
>>What is the use case that makes flow though so useful? 
>We do not know to which forEach xpath a given field is associated with.
>Currently you can clean up the fields using a transformer. There is an
>implicit field '$forEach' which tells you about the xpath tag for each
>record that is emitted.
>
>>The recently added comments in XPathRecordReader are a great help and I
>>was planning to add more. Might this be an issue?
>I would love to have it. Give a patch and I shall commit it.
>XPathRecordReader is a blackbox and AFAIK I am the only one who knows
>it. I would love to have more eyes on that.
>
>>I would like to open a JIRA for improving XPathRecordReader.
>Please go ahead. You can paste the contents of this mail in the list .
>There may be others with similar ideas
>
>Noble.

-- 

===============================================================
Fergus McMenemie               Email:fergus@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================