You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Paul, Noble" <no...@corp.aol.com> on 2009/09/16 14:16:25 UTC
Re: Extract info from parent node during data import (redirect:)
Fergus,
Implementing wildcard (//tagname) is definitely possible. I would love
to see it working. But if you wish to take a dig at it I shall do
whatever I can to help.
>What is the use case that makes flow though so useful?
We do not know to which forEach xpath a given field is associated with.
Currently you can clean up the fields using a transformer. There is an
implicit field '$forEach' which tells you about the xpath tag for each
record that is emitted.
>The recently added comments in XPathRecordReader are a great help and I
>was planning to add more. Might this be an issue?
I would love to have it. Give a patch and I shall commit it.
XPathRecordReader is a blackbox and AFAIK I am the only one who knows
it. I would love to have more eyes on that.
>I would like to open a JIRA for improving XPathRecordReader.
Please go ahead. You can paste the contents of this mail in the list .
There may be others with similar ideas
Noble.
-----Original Message-----
>>>Noble
>>>>
>>>>/document/category/item | /document/category
>>>>
>>>>means there are two paths which triggers a new doc (it is possible to
>>>>have more). Whenever it encounters the closing tag of that xpath , it
>>>>emits all the fields it collected since the opening of the same tag.
>>>>after that it clears all the fields it collected since the opening of
>>>>the tag.
>>>>
>>>>If there are fields it collected before opening of the same tag, it
>>>>retains it
>>>
>>>
>>> Nice and clear, but that is not what I see.
>>>
>>> With my test case with forEach="/record | /record/mediaBlock"
>>> I see that for each /record/mediaBlock "document" indexed it contains
>
>>> all fields from the parent "/record" document as well. A search over
>>> mediaBlock s returns lots of extra fields from the parent which did
>>> not have the commonField attribute. I will try and produce a testcase
>>
>>yes it does . . /record/mediaBlock will have all the fields collected
>>from /record as well. *****It is by design******
>
>Oh!
>
>I had always considered it a bug or at least a limitation. After all if
>we have the "commonField" attribute why do we need an automatic flow
>through of all collected fields from parent nodes. This feature is as
>far as I can see undocumented and at the same time unintuitive.
>It also, in my case, causes tons more information to be indexed than is
>needed.
>
>I have spent a while thinking through possible use cases. My use case
>involves having documents we want to search as a whole and behave as
>normal. At the same time these documents contain inner sections we wish
>to treat as sub-documents; in my case I a have pictures with associated
>captions which I wish to search separately. Having indexed the documents
>with forEach="/record | /record/mediaBlock" my picture search works
>nicely but I have a nasty side effect when performing searches over the
>rest of the document. Because fields from the parent node are also
>present in the children, when I search for any text the same document
>gets returned many times, once due to the text in the parent node and
>again for each picture placed in the document. I have a work around for
>this issue but have always considered it a bug.
>
>What is the use case that makes flow though so useful?
>
>I had just started playing with the code to see how easy this would be
>to change. The recently added comments in XPathRecordReader are a great
>help and I was planning to add more. Might this be an issue?
>
>I have noted, while lurking on the solr mail lists, that requests for
>this type of functionality keep coming up; to be able to restrict
>searches to a sub section of a document. I have really needed this sort
>of thinks many times with the type of stuff I work with.
>
>My other planned activity was to see how easy xpaths such as //tagname
>would be implement. Since my latest data-config.xml looks like:-
>
><field column="para32" name="text" xpath="/record/address/para"
>flatten="true" />
><field column="para40" name="text" xpath="/record/authoredBy/para"
>flatten="true" />
><field column="para43" name="text"
>xpath="/record/dataGroup/address/para" flatten="true" />
><field column="para47" name="text"
>xpath="/record/dataGroup/keyPersonnel/doubleList/first/para"
>flatten="true" />
><field column="para49" name="text"
>xpath="/record/dataGroup/keyPersonnel/doubleList/second/para"
>flatten="true" />
><field column="para50" name="text"
>xpath="/record/dataGroup/keyPersonnel/para" flatten="true" />
><field column="para51" name="text" xpath="/record/dataGroup/para"
>flatten="true" />
><field column="para57" name="text"
>xpath="/record/doubleList/first/para" flatten="true" />
><field column="para59" name="text"
>xpath="/record/doubleList/second/para" flatten="true" />
><field column="para63" name="text"
>xpath="/record/keyPersonnel/doubleList/first/para" flatten="true" />
><field column="para65" name="text"
>xpath="/record/keyPersonnel/doubleList/second/para" flatten="true" />
><field column="para68" name="text" xpath="/record/list/listItem/para"
>flatten="true" />
><field column="para75" name="text"
>xpath="/record/mediaBlock/doubleList/first/para" flatten="true" />
><field column="para77" name="text"
>xpath="/record/mediaBlock/doubleList/second/para" flatten="true" />
><field column="para172" name="text" xpath="/record/noteGroup/note/para"
>flatten="true" /> <field column="para174" name="text"
>xpath="/record/para" flatten="true" /> <field column="para179"
>name="text"
>xpath="/record/relatedInfo/list/listItem/relatedArticle/para"
>flatten="true" /> <field column="para184" name="text"
>xpath="/record/sect1/address/dataGroup/para" flatten="true" /> <field
>column="para185" name="text" xpath="/record/sect1/address/para"
>flatten="true" /> <field column="para195" name="text"
>xpath="/record/sect1/dataGroup/address/para" flatten="true" /> <field
>column="para199" name="text"
>xpath="/record/sect1/dataGroup/keyPersonnel/doubleList/first/para"
>flatten="true" /> <field column="para201" name="text"
>xpath="/record/sect1/dataGroup/keyPersonnel/doubleList/second/para"
>flatten="true" /> <field column="para202" name="text"
>xpath="/record/sect1/dataGroup/keyPersonnel/para" flatten="true" />
><field column="para203" name="text"
>xpath="/record/sect1/dataGroup/para" flatten="true" /> <field
>column="para208" name="text"
>xpath="/record/sect1/doubleList/first/para" flatten="true" /> <field
>column="para212" name="text"
>xpath="/record/sect1/doubleList/second/list/listItem/para"
>flatten="true" /> <field column="para213" name="text"
>xpath="/record/sect1/doubleList/second/para" flatten="true" /> <field
>column="para217" name="text"
>xpath="/record/sect1/keyPersonnel/doubleList/first/para" flatten="true"
>/> <field column="para219" name="text"
>xpath="/record/sect1/keyPersonnel/doubleList/second/para"
>flatten="true" /> <field column="para220" name="text"
>xpath="/record/sect1/keyPersonnel/para" flatten="true" /> <field
>column="para225" name="text"
>xpath="/record/sect1/list/listItem/list/listItem/para" flatten="true"
>/> <field column="para226" name="text"
>xpath="/record/sect1/list/listItem/para" flatten="true" /> <field
>column="para240" name="text" xpath="/record/sect1/para" flatten="true"
>/> <field column="para244" name="text"
>xpath="/record/sect1/sect2/doubleList/first/para" flatten="true" />
><field column="para246" name="text"
>xpath="/record/sect1/sect2/doubleList/second/para" flatten="true" />
><field column="para251" name="text"
>xpath="/record/sect1/sect2/list/listItem/list/listItem/para"
>flatten="true" /> <field column="para252" name="text"
>xpath="/record/sect1/sect2/list/listItem/para" flatten="true" /> <field
>column="para258" name="text"
>xpath="/record/sect1/sect2/noteGroup/note/para" flatten="true" />
><field column="para259" name="text" xpath="/record/sect1/sect2/para"
>flatten="true" /> <field column="para265" name="text"
>xpath="/record/sect1/sect2/sect3/list/listItem/list/listItem/para"
>flatten="true" /> <field column="para266" name="text"
>xpath="/record/sect1/sect2/sect3/list/listItem/para" flatten="true" />
><field column="para271" name="text"
>xpath="/record/sect1/sect2/sect3/para" flatten="true" /> <field
>column="para275" name="text"
>xpath="/record/sect1/sect2/sect3/sect4/list/listItem/para"
>flatten="true" /> <field column="para279" name="text"
>xpath="/record/sect1/sect2/sect3/sect4/para" flatten="true" /> <field
>column="para284" name="text"
>xpath="/record/sect1/sect2/sect3/sect4/sect5/para" flatten="true" />
><field column="para295" name="text"
>xpath="/record/sect1/sect2/sect3/table/tgroup/tbody/row/entry/noteGroup/
>note/para" flatten="true" /> <field column="para297" name="text"
>xpath="/record/sect1/sect2/sect3/table/tgroup/tbody/row/entry/para"
>flatten="true" /> <field column="para301" name="text"
>xpath="/record/sect1/sect2/sect3/table/tgroup/thead/row/entry/para"
>flatten="true" /> <field column="para312" name="text"
>xpath="/record/sect1/sect2/table/tgroup/tbody/row/entry/list/listItem/pa
>ra" flatten="true" /> <field column="para315" name="text"
>xpath="/record/sect1/sect2/table/tgroup/tbody/row/entry/noteGroup/note/p
>ara" flatten="true" /> <field column="para316" name="text"
>xpath="/record/sect1/sect2/table/tgroup/tbody/row/entry/noteGroup/para"
>flatten="true" /> <field column="para318" name="text"
>xpath="/record/sect1/sect2/table/tgroup/tbody/row/entry/para"
>flatten="true" /> <field column="para322" name="text"
>xpath="/record/sect1/sect2/table/tgroup/thead/row/entry/para"
>flatten="true" /> <field column="para341" name="text"
>xpath="/record/sect1/table/tgroup/tbody/row/entry/noteGroup/note/para"
>flatten="true" /> <field column="para342" name="text"
>xpath="/record/sect1/table/tgroup/tbody/row/entry/noteGroup/para"
>flatten="true" /> <field column="para344" name="text"
>xpath="/record/sect1/table/tgroup/tbody/row/entry/para" flatten="true"
>/> <field column="para348" name="text"
>xpath="/record/sect1/table/tgroup/thead/row/entry/para" flatten="true"
>/> <field column="para371" name="text"
>xpath="/record/table/tgroup/tbody/row/entry/noteGroup/note/para"
>flatten="true" /> <field column="para373" name="text"
>xpath="/record/table/tgroup/tbody/row/entry/para" flatten="true" />
><field column="para377" name="text"
>xpath="/record/table/tgroup/thead/row/entry/para" flatten="true" />
>
>Which is nuts!
>
>I would like to open a JIRA for improving XPathRecordReader.
>
>Regds Fergus.
Re: Extract info from parent node during data import (redirect:)
Posted by Fergus McMenemie <fe...@twig.me.uk>.
JIRA SOLR-1437 created
"DIH: Enhance XPathRecordReader to deal with //tagname and other improvements."
>Fergus,
>
>Implementing wildcard (//tagname) is definitely possible. I would love
>to see it working. But if you wish to take a dig at it I shall do
>whatever I can to help.
>
>>What is the use case that makes flow though so useful?
>We do not know to which forEach xpath a given field is associated with.
>Currently you can clean up the fields using a transformer. There is an
>implicit field '$forEach' which tells you about the xpath tag for each
>record that is emitted.
>
>>The recently added comments in XPathRecordReader are a great help and I
>>was planning to add more. Might this be an issue?
>I would love to have it. Give a patch and I shall commit it.
>XPathRecordReader is a blackbox and AFAIK I am the only one who knows
>it. I would love to have more eyes on that.
>
>>I would like to open a JIRA for improving XPathRecordReader.
>Please go ahead. You can paste the contents of this mail in the list .
>There may be others with similar ideas
>
>Noble.
--
===============================================================
Fergus McMenemie Email:fergus@twig.me.uk
Techmore Ltd Phone:(UK) 07721 376021
Unix/Mac/Intranets Analyst Programmer
===============================================================