You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Twomey, David" <da...@novartis.com> on 2012/04/30 22:46:17 UTC

correct XPATH syntax

Is this possible in DataImportHandler

I want the following XML to all collapse into one Author field

<AuthorList CompleteYN="Y">
 <Author ValidYN="Y">
  <LastName>Sørlie</LastName>
  <ForeName>T</ForeName>
  <Initials>T</Initials>
 </Author>
 <Author ValidYN="Y">
  <LastName>Perou</LastName>
  <ForeName>C M</ForeName>
  <Initials>CM</Initials>
 </Author>
 <Author ValidYN="Y">
  <LastName>Tibshirani</LastName>
  <ForeName>R</ForeName>
  <Initials>R</Initials>
 </Author>
...

So my XPATH is like 


Re: correct XPATH syntax

Posted by Lance Norskog <go...@gmail.com>.
The XPath implementation in DIH is very minimal- it is tuned for
speed, not features. The XSL option lets you do everything you could
want, with a slower engine.

On Thu, May 3, 2012 at 7:30 AM, lboutros <bo...@gmail.com> wrote:
> ok, not that easy :)
>
> I did not test it myself but it seems that you could use an XSL
> preprocessing with the 'xsl' option in your XPathEntityProcessor :
>
> http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml-1
>
> You could transform the author part as you wish and then import the author
> field with your actual configuration.
>
> Ludovic.
>
> -----
> Jouve
> France.
> --
> View this message in context: http://lucene.472066.n3.nabble.com/correct-XPATH-syntax-tp3951804p3959397.html
> Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Lance Norskog
goksron@gmail.com

Re: correct XPATH syntax

Posted by lboutros <bo...@gmail.com>.
ok, not that easy :)

I did not test it myself but it seems that you could use an XSL
preprocessing with the 'xsl' option in your XPathEntityProcessor :

http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml-1

You could transform the author part as you wish and then import the author
field with your actual configuration.

Ludovic.

-----
Jouve
France.
--
View this message in context: http://lucene.472066.n3.nabble.com/correct-XPATH-syntax-tp3951804p3959397.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: correct XPATH syntax

Posted by "Twomey, David" <da...@novartis.com>.
Is what I want even possible with XPathEntityProcessor?

It sort of works now - I didn't realize the "flatten" attribute is an attribute of field instead of entity.

BUT it's still not what I would like.

The XML looks like below and it's nested within /MedlineCitationSet/MedlineCitation/Article/

<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Starremans</LastName>
<ForeName>Patrick G J F</ForeName>
<Initials>PG</Initials>
</Author><Author ValidYN="Y">
<LastName>van der Kemp</LastName>
<ForeName>Annemiete W C M</ForeName>
<Initials>AW</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Knoers</LastName>
<ForeName>Nine V A M</ForeName>
<Initials>NV</Initials>
</Author>
<Author ValidYN="Y">
<LastName>van den Heuvel</LastName>
<ForeName>Lambertus P W J</ForeName>
<Initials>LP</Initials>
</Author>
</AuthorList>

What I would like to see in the index author field is
<author>Starremans PG, Van der Kemp AW, etc   </author>  note "lastname Initials",  no forename.


When I set Xpath like this
<field column="author"         xpath="/MedlineCitationSet/MedlineCitation/Article/AuthorList/Author" flatten="true" />

I get this in the index
<arr name="author">
<str>Starremans Patrick G J F PG</str>
<str>Van der Kemp Annemiete W C M AW</str>
.
.
</arr>
note: the forename field is included

My author field in the schema.xml is
<field name="author" type="textgen" indexed="true" stored="true" multiValued="true" required="false"/>

So is this even possible with XPathEntityProcessor?

Thanks
David




On 5/3/12 8:40 AM, "lboutros" <bo...@gmail.com>> wrote:

Hi David,

what do you want to do with the 'commonField' option ?

Is it possible to have the part of the schema for the author field please ?
Is the author field stored ?

Ludovic.

-----
Jouve
France.
--
View this message in context: http://lucene.472066.n3.nabble.com/correct-XPATH-syntax-tp3951804p3959097.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: correct XPATH syntax

Posted by lboutros <bo...@gmail.com>.
Hi David,

what do you want to do with the 'commonField' option ?

Is it possible to have the part of the schema for the author field please ?
Is the author field stored ?

Ludovic.

-----
Jouve
France.
--
View this message in context: http://lucene.472066.n3.nabble.com/correct-XPATH-syntax-tp3951804p3959097.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: correct XPATH syntax

Posted by "Twomey, David" <da...@novartis.com>.
Ludovic,

Thanks for your help.  I tried your suggestion but it didn't work for
Authors.  Below are 3 snippets from data-config.xml, the XML file and the
XML response from the DB

Data-config:
             <entity name="medlineFiles" processor="XPathEntityProcessor"
                url="${medlineFileList.fileAbsolutePath}"
                forEach="/MedlineCitationSet/MedlineCitation"
                
transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer,Log
Transformer"
                logTemplate="   processing
${medlineFileList.fileAbsolutePath}"
                logLevel="info"
                flatten="true"
                stream="true">

                <field column="pmid"
xpath="/MedlineCitationSet/MedlineCitation/PMID"   commonField="true" />
                <field column="journal_name"
xpath="/MedlineCitationSet/MedlineCitation/Article/Journal/Title"
commonField="true" />
                <field column="title"
xpath="/MedlineCitationSet/MedlineCitation/Article/ArticleTitle"
commonField="true" />
                <field column="abstract"
xpath="/MedlineCitationSet/MedlineCitation/Article/Abstract/AbstractText"
 commonField="true" />
                <field column="author"
xpath="/MedlineCitationSet/MedlineCitation/Article/AuthorList/Author"
commonField="false" />
                <field column="year"
xpath="/MedlineCitationSet/MedlineCitation/Article/Journal/JournalIssue/Pub
Date/Year"   commonField="true" />

              </entity>



XML Snippet for Author:
<AuthorList CompleteYN="Y">
 <Author ValidYN="Y">
  <LastName>Malathi</LastName>
  <ForeName>K</ForeName>
  <Initials>K</Initials>
 </Author>
 <Author ValidYN="Y">
  <LastName>Xiao</LastName>
  <ForeName>Y</ForeName>
  <Initials>Y</Initials>
 </Author>
 <Author ValidYN="Y">
  <LastName>Mitchell</LastName>
  <ForeName>A P</ForeName>
  <Initials>AP</Initials>
 </Author>
</AuthorList>


Response from SOLR:

<arr name="author">
<str></str>
<str></str>
<str></str>
<str></str>
<str></str>
<str></str>
<str></str>
<str></str>
<str></str>
<str></str>
<str></str>
<str></str>
<str></str>
<str></str>
</arr>
<str name="journal_name">Journal of cancer research and clinical
oncology</str>




Thanks
David

On 5/1/12 8:05 AM, "lboutros" <bo...@gmail.com> wrote:

>Hi David,
>
>I think you should add this option : flatten=true
>
>and the could you try to use this XPath :
>
>/MedlineCitationSet/MedlineCitation/AuthorList/Author
>
>see here for the description :
>
>http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config
>.xml-1
>
>I don't think the that the commonField option is needed here, I think you
>should suppress it.
>
>Ludovic. 
>
>-----
>Jouve
>France.
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/correct-XPATH-syntax-tp3951804p3952812.
>html
>Sent from the Solr - User mailing list archive at Nabble.com.


Re: correct XPATH syntax

Posted by lboutros <bo...@gmail.com>.
Hi David,

I think you should add this option : flatten=true

and the could you try to use this XPath :

/MedlineCitationSet/MedlineCitation/AuthorList/Author

see here for the description :

http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml-1

I don't think the that the commonField option is needed here, I think you
should suppress it.

Ludovic. 

-----
Jouve
France.
--
View this message in context: http://lucene.472066.n3.nabble.com/correct-XPATH-syntax-tp3951804p3952812.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: correct XPATH syntax

Posted by "Twomey, David" <da...@novartis.com>.
Answering my own question:  I think I can do this by writing a script that
concats the Lastname, Forname and Initials and adding that to xpath =
/AuthorList/Author 

Yes?

On 4/30/12 4:49 PM, "Twomey, David" <da...@novartis.com> wrote:

>Sorry hit send too soon.  Continued the email below
>
>On 4/30/12 4:46 PM, "Twomey, David" <da...@novartis.com> wrote:
>
>>
>>Is this possible in DataImportHandler
>>
>>I want the following XML to all collapse into one mult-valued Author
>>field
>>
>><AuthorList CompleteYN="Y">
>> <Author ValidYN="Y">
>>  <LastName>Sørlie</LastName>
>>  <ForeName>T</ForeName>
>>  <Initials>T</Initials>
>> </Author>
>> <Author ValidYN="Y">
>>  <LastName>Perou</LastName>
>>  <ForeName>C M</ForeName>
>>  <Initials>CM</Initials>
>> </Author>
>> <Author ValidYN="Y">
>>  <LastName>Tibshirani</LastName>
>>  <ForeName>R</ForeName>
>>  <Initials>R</Initials>
>> </Author>
>>...
>>
>>So my XPATH is like
>>xpath="/MedlineCitationSet/MedlineCitation/AuthorList/??"
>>commonField="true" />
>
>>
>


Re: correct XPATH syntax

Posted by "Twomey, David" <da...@novartis.com>.
Sorry hit send too soon.  Continued the email below

On 4/30/12 4:46 PM, "Twomey, David" <da...@novartis.com> wrote:

>
>Is this possible in DataImportHandler
>
>I want the following XML to all collapse into one mult-valued Author field
>
><AuthorList CompleteYN="Y">
> <Author ValidYN="Y">
>  <LastName>Sørlie</LastName>
>  <ForeName>T</ForeName>
>  <Initials>T</Initials>
> </Author>
> <Author ValidYN="Y">
>  <LastName>Perou</LastName>
>  <ForeName>C M</ForeName>
>  <Initials>CM</Initials>
> </Author>
> <Author ValidYN="Y">
>  <LastName>Tibshirani</LastName>
>  <ForeName>R</ForeName>
>  <Initials>R</Initials>
> </Author>
>...
>
>So my XPATH is like
>xpath="/MedlineCitationSet/MedlineCitation/AuthorList/??"
>commonField="true" />

>