You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jay Hill <ja...@gmail.com> on 2009/07/02 02:01:49 UTC

DIH: Limited xpath syntax unable to parse all xml elements

I'm using the XPathEntityProcessor to parse an xml structure that looks like
this:

<book>
    <author>Joe Smith</author>
    <title>World Atlas</title>
    <body>
        <chapter>
            <p>Content I want is here</p>
            <p>More content I want is here.</p>
            <p>Still more content here.>/p>
        </chapter>
    </body>
</book>

The author and title parse out fine:       <field column="title"
xpath="/book/title"/>  <field column="author" xpath="/book/author"/>

But I can't get at the data inside the <p> tags. I want to get all
non-markup text inside the body tag with something like this:

<field column="body" xpath="/book/body/chapter//p"/>

but that is not supported.

Does anyone know of a way that I can get the content within the <p> tags
without the markup?

Thanks,
-Jay

Re: DIH: Limited xpath syntax unable to parse all xml elements

Posted by Jay Hill <ja...@gmail.com>.

Thanks Fergus, setting the field to multivalued did work:
      <field column="body" xpath="/book/body/chapter/p" flatten="true"/>
gets all the <p> elements as multivalue fields in the body field.

The only thing is, the body field is used by some other content sources, so
I have to look at the implications setting it to multi-valued will have on
the other data sources. Still, this might do the trick.

Thanks to all that helped on this!

-Jay



On Thu, Jul 2, 2009 at 11:40 AM, Fergus McMenemie <fe...@twig.me.uk> wrote:

> >Shalin Shekhar Mangar wrote:
> >> On Thu, Jul 2, 2009 at 11:08 PM, Mark Miller <ma...@gmail.com>
> wrote:
> >>
> >>
> >>> It looks like DIH implements its own subset of the Xpath spec.
> >>>
> >>
> >>
> >> Right, DIH has a streaming implementation supporting a subset of XPath
> only.
> >> The supported things are in the wiki examples.
> >>
> >>
> >>
> >>> I don't see any tests with multiple matching sub nodes, so perhaps DIH
> >>> Xpath does not properly support that and just selects the last matching
> >>> node?
> >>>
> >>
> >>
> >> It selects all matching nodes. But if the field is not multi-valued, it
> will
> >> store only the last value. I guess this is what is happening here.
> >>
> >>
> >So do you think it should match them all and add the concatenated text
> >as one field?
> >
> >That would be more Xpath like I think, and less arbitrary than just
> >choosing the last one.
>
> Only when the field in schema.xml in not multiValued. If the field is
> multiValued is should still behave as at present?
>
> Also... what went wrong with the suggested:-
>     <field column="body" xpath="/book/body/chapter flatten="true"/>
>
> Regards Fergus.
>

Re: DIH: Limited xpath syntax unable to parse all xml elements

Posted by Fergus McMenemie <fe...@twig.me.uk>.

>Shalin Shekhar Mangar wrote:
>> On Thu, Jul 2, 2009 at 11:08 PM, Mark Miller <ma...@gmail.com> wrote:
>>
>>   
>>> It looks like DIH implements its own subset of the Xpath spec.
>>>     
>>
>>
>> Right, DIH has a streaming implementation supporting a subset of XPath only.
>> The supported things are in the wiki examples.
>>
>>
>>   
>>> I don't see any tests with multiple matching sub nodes, so perhaps DIH
>>> Xpath does not properly support that and just selects the last matching
>>> node?
>>>     
>>
>>
>> It selects all matching nodes. But if the field is not multi-valued, it will
>> store only the last value. I guess this is what is happening here.
>>
>>   
>So do you think it should match them all and add the concatenated text 
>as one field?
>
>That would be more Xpath like I think, and less arbitrary than just 
>choosing the last one.

Only when the field in schema.xml in not multiValued. If the field is
multiValued is should still behave as at present?

Also... what went wrong with the suggested:-
    <field column="body" xpath="/book/body/chapter flatten="true"/>

Regards Fergus.

Re: DIH: Limited xpath syntax unable to parse all xml elements

Posted by Jay Hill <ja...@gmail.com>.

I'm on the trunk, built on July 2: 1.4-dev 789506

Thanks,
-Jay

On Thu, Jul 2, 2009 at 11:33 AM, Shalin Shekhar Mangar <
shalinmangar@gmail.com> wrote:

> On Thu, Jul 2, 2009 at 11:38 PM, Mark Miller <ma...@gmail.com>
> wrote:
>
> > Shalin Shekhar Mangar wrote:
> >
> >>
> >> It selects all matching nodes. But if the field is not multi-valued, it
> >> will
> >> store only the last value. I guess this is what is happening here.
> >>
> >>
> >>
> > So do you think it should match them all and add the concatenated text as
> > one field?
> >
> > That would be more Xpath like I think, and less arbitrary than just
> > choosing the last one.
> >
>
> I won't call it arbitrary because it creates a SolrInputDocument with
> values
> from all the matching nodes just like you'd create any multi-valued field.
> The problem is that his field is not declared to be multi-valued. The same
> would happen if you posted an XML document to /update with multiple values
> for a single-valued field.
>
> XPathEntityProcessor provides the flatten="true" option if you want to add
> it as concatenated test. Jay mentioned that flatten did not work for him
> which is something we should investigate.
>
> Jay, which version of Solr are you running? The flatten option is a 1.4
> feature (added with SOLR-1003).
> --
> Regards,
> Shalin Shekhar Mangar.
>

Re: DIH: Limited xpath syntax unable to parse all xml elements

Posted by Mark Miller <ma...@gmail.com>.

Shalin Shekhar Mangar wrote:
> On Thu, Jul 2, 2009 at 11:38 PM, Mark Miller <ma...@gmail.com> wrote:
>
>   
>> Shalin Shekhar Mangar wrote:
>>
>>     
>>> It selects all matching nodes. But if the field is not multi-valued, it
>>> will
>>> store only the last value. I guess this is what is happening here.
>>>
>>>
>>>
>>>       
>> So do you think it should match them all and add the concatenated text as
>> one field?
>>
>> That would be more Xpath like I think, and less arbitrary than just
>> choosing the last one.
>>
>>     
>
> I won't call it arbitrary because it creates a SolrInputDocument with values
> from all the matching nodes just like you'd create any multi-valued field.
>   
Then shouldnt it throw an error? If your field is not multivalued, but 
the XML is multivalued, it does seem arbitrary to pick the last node 
when Xpath says to select them all.

It seems it should through an error (saying to use flatten or a 
multifield?) or concatenate all the text?

-- 
- Mark

http://www.lucidimagination.com

Re: DIH: Limited xpath syntax unable to parse all xml elements

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

On Thu, Jul 2, 2009 at 11:38 PM, Mark Miller <ma...@gmail.com> wrote:

> Shalin Shekhar Mangar wrote:
>
>>
>> It selects all matching nodes. But if the field is not multi-valued, it
>> will
>> store only the last value. I guess this is what is happening here.
>>
>>
>>
> So do you think it should match them all and add the concatenated text as
> one field?
>
> That would be more Xpath like I think, and less arbitrary than just
> choosing the last one.
>

I won't call it arbitrary because it creates a SolrInputDocument with values
from all the matching nodes just like you'd create any multi-valued field.
The problem is that his field is not declared to be multi-valued. The same
would happen if you posted an XML document to /update with multiple values
for a single-valued field.

XPathEntityProcessor provides the flatten="true" option if you want to add
it as concatenated test. Jay mentioned that flatten did not work for him
which is something we should investigate.

Jay, which version of Solr are you running? The flatten option is a 1.4
feature (added with SOLR-1003).
-- 
Regards,
Shalin Shekhar Mangar.

Re: DIH: Limited xpath syntax unable to parse all xml elements

Posted by Mark Miller <ma...@gmail.com>.

Shalin Shekhar Mangar wrote:
> On Thu, Jul 2, 2009 at 11:08 PM, Mark Miller <ma...@gmail.com> wrote:
>
>   
>> It looks like DIH implements its own subset of the Xpath spec.
>>     
>
>
> Right, DIH has a streaming implementation supporting a subset of XPath only.
> The supported things are in the wiki examples.
>
>
>   
>> I don't see any tests with multiple matching sub nodes, so perhaps DIH
>> Xpath does not properly support that and just selects the last matching
>> node?
>>     
>
>
> It selects all matching nodes. But if the field is not multi-valued, it will
> store only the last value. I guess this is what is happening here.
>
>   
So do you think it should match them all and add the concatenated text 
as one field?

That would be more Xpath like I think, and less arbitrary than just 
choosing the last one.

-- 
- Mark

http://www.lucidimagination.com

Re: DIH: Limited xpath syntax unable to parse all xml elements

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

On Thu, Jul 2, 2009 at 11:08 PM, Mark Miller <ma...@gmail.com> wrote:

> It looks like DIH implements its own subset of the Xpath spec.

Right, DIH has a streaming implementation supporting a subset of XPath only.
The supported things are in the wiki examples.

> I don't see any tests with multiple matching sub nodes, so perhaps DIH
> Xpath does not properly support that and just selects the last matching
> node?

It selects all matching nodes. But if the field is not multi-valued, it will
store only the last value. I guess this is what is happening here.

-- 
Regards,
Shalin Shekhar Mangar.

Re: DIH: Limited xpath syntax unable to parse all xml elements

Posted by Mark Miller <ma...@gmail.com>.

It looks like DIH implements its own subset of the Xpath spec. I don't 
see any tests with multiple matching sub nodes, so perhaps DIH Xpath 
does not properly support that and just selects the last matching node?

Also, I don't think the double / matters. That would just allow more 
nodes in between, but since there are not any in between in your example 
document, its the same as a single /.

- Mark

Jay Hill wrote:
> It is not multivalued. The intention is to get all text under they <body>
> element into one "body" field in the index that is not multivalued.
> Essentially everything within the <body> element minus the markup.
>
> Thanks,
> -Jay
>
>
> On Thu, Jul 2, 2009 at 8:55 AM, Fergus McMenemie <fe...@twig.me.uk> wrote:
>
>   
>>> Thanks Noble, I gave those examples a try.
>>>
>>> If I use <field column="body" xpath="/book/body/chapter/p" />  I only get
>>> the text from the last <p> element, not from all elements.
>>>       
>> Hmmmmm, I am sure I have done this. In your schema.xml is the
>> field "body" multiValued or not?
>>
>>
>>     
>>> If I use <field column="body" xpath="/book/body/chapter" flatten="true"/>
>>> or <field column="body" xpath="/book/body/chapter/" flatten="true"/> I
>>>       
>> don't
>>     
>>> get back anything for the body column.
>>>
>>> So the first example is close, but it only gets the text for the last <p>
>>> element. If I could get all <p> elements at the same level that would be
>>> what I need. The double-slash (/book/body/chapter//p) doesn't seem to be
>>> supported.
>>>
>>> Thanks,
>>> -Jay
>>>
>>>
>>> 2009/7/1 Noble Paul ??????  Â Ë³Ë <no...@corp.aol.com>
>>>
>>>       
>>>> complete xpath is not supported
>>>>
>>>> /book/body/chapter/p
>>>>
>>>> should work.
>>>>
>>>> if you wish all the text under <chapter> irrespective of nesting , tag
>>>> names use this
>>>> <field column="body" xpath="/book/body/chapter flatten="true"/>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jul 2, 2009 at 5:31 AM, Jay Hill<ja...@gmail.com> wrote:
>>>>         
>>>>> I'm using the XPathEntityProcessor to parse an xml structure that
>>>>>           
>> looks
>>     
>>>> like
>>>>         
>>>>> this:
>>>>>
>>>>> <book>
>>>>>    <author>Joe Smith</author>
>>>>>    <title>World Atlas</title>
>>>>>    <body>
>>>>>        <chapter>
>>>>>            <p>Content I want is here</p>
>>>>>            <p>More content I want is here.</p>
>>>>>            <p>Still more content here.>/p>
>>>>>        </chapter>
>>>>>    </body>
>>>>> </book>
>>>>>
>>>>> The author and title parse out fine:       <field column="title"
>>>>> xpath="/book/title"/>  <field column="author" xpath="/book/author"/>
>>>>>
>>>>> But I can't get at the data inside the <p> tags. I want to get all
>>>>> non-markup text inside the body tag with something like this:
>>>>>
>>>>> <field column="body" xpath="/book/body/chapter//p"/>
>>>>>
>>>>> but that is not supported.
>>>>>
>>>>> Does anyone know of a way that I can get the content within the <p>
>>>>>           
>> tags
>>     
>>>>> without the markup?
>>>>>
>>>>> Thanks,
>>>>> -Jay
>>>>>
>>>>>           
>>>>
>>>> --
>>>> -----------------------------------------------------
>>>> Noble Paul | Principal Engineer| AOL | http://aol.com
>>>>
>>>>         
>> --
>>
>> ===============================================================
>> Fergus McMenemie               Email:fergus@twig.me.uk<Em...@twig.me.uk>
>> Techmore Ltd                   Phone:(UK) 07721 376021
>>
>> Unix/Mac/Intranets             Analyst Programmer
>> ===============================================================
>>
>>     
>
>   




-- 
- Mark

http://www.lucidimagination.com

Re: DIH: Limited xpath syntax unable to parse all xml elements

Posted by Jay Hill <ja...@gmail.com>.

It is not multivalued. The intention is to get all text under they <body>
element into one "body" field in the index that is not multivalued.
Essentially everything within the <body> element minus the markup.

Thanks,
-Jay


On Thu, Jul 2, 2009 at 8:55 AM, Fergus McMenemie <fe...@twig.me.uk> wrote:

> >Thanks Noble, I gave those examples a try.
> >
> >If I use <field column="body" xpath="/book/body/chapter/p" />  I only get
> >the text from the last <p> element, not from all elements.
>
> Hmmmmm, I am sure I have done this. In your schema.xml is the
> field "body" multiValued or not?
>
>
> >
> >If I use <field column="body" xpath="/book/body/chapter" flatten="true"/>
> >or <field column="body" xpath="/book/body/chapter/" flatten="true"/> I
> don't
> >get back anything for the body column.
> >
> >So the first example is close, but it only gets the text for the last <p>
> >element. If I could get all <p> elements at the same level that would be
> >what I need. The double-slash (/book/body/chapter//p) doesn't seem to be
> >supported.
> >
> >Thanks,
> >-Jay
> >
> >
> >2009/7/1 Noble Paul ??????  Â Ë³Ë <no...@corp.aol.com>
> >
> >> complete xpath is not supported
> >>
> >> /book/body/chapter/p
> >>
> >> should work.
> >>
> >> if you wish all the text under <chapter> irrespective of nesting , tag
> >> names use this
> >> <field column="body" xpath="/book/body/chapter flatten="true"/>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Thu, Jul 2, 2009 at 5:31 AM, Jay Hill<ja...@gmail.com> wrote:
> >> > I'm using the XPathEntityProcessor to parse an xml structure that
> looks
> >> like
> >> > this:
> >> >
> >> > <book>
> >> >    <author>Joe Smith</author>
> >> >    <title>World Atlas</title>
> >> >    <body>
> >> >        <chapter>
> >> >            <p>Content I want is here</p>
> >> >            <p>More content I want is here.</p>
> >> >            <p>Still more content here.>/p>
> >> >        </chapter>
> >> >    </body>
> >> > </book>
> >> >
> >> > The author and title parse out fine:       <field column="title"
> >> > xpath="/book/title"/>  <field column="author" xpath="/book/author"/>
> >> >
> >> > But I can't get at the data inside the <p> tags. I want to get all
> >> > non-markup text inside the body tag with something like this:
> >> >
> >> > <field column="body" xpath="/book/body/chapter//p"/>
> >> >
> >> > but that is not supported.
> >> >
> >> > Does anyone know of a way that I can get the content within the <p>
> tags
> >> > without the markup?
> >> >
> >> > Thanks,
> >> > -Jay
> >> >
> >>
> >>
> >>
> >> --
> >> -----------------------------------------------------
> >> Noble Paul | Principal Engineer| AOL | http://aol.com
> >>
>
> --
>
> ===============================================================
> Fergus McMenemie               Email:fergus@twig.me.uk<Em...@twig.me.uk>
> Techmore Ltd                   Phone:(UK) 07721 376021
>
> Unix/Mac/Intranets             Analyst Programmer
> ===============================================================
>

Re: DIH: Limited xpath syntax unable to parse all xml elements

Posted by Fergus McMenemie <fe...@twig.me.uk>.

>Thanks Noble, I gave those examples a try.
>
>If I use <field column="body" xpath="/book/body/chapter/p" />  I only get
>the text from the last <p> element, not from all elements.

Hmmmmm, I am sure I have done this. In your schema.xml is the
field "body" multiValued or not?


>
>If I use <field column="body" xpath="/book/body/chapter" flatten="true"/>
>or <field column="body" xpath="/book/body/chapter/" flatten="true"/> I don't
>get back anything for the body column.
>
>So the first example is close, but it only gets the text for the last <p>
>element. If I could get all <p> elements at the same level that would be
>what I need. The double-slash (/book/body/chapter//p) doesn't seem to be
>supported.
>
>Thanks,
>-Jay
>
>
>2009/7/1 Noble Paul ?????? Â Ë³Ë <no...@corp.aol.com>
>
>> complete xpath is not supported
>>
>> /book/body/chapter/p
>>
>> should work.
>>
>> if you wish all the text under <chapter> irrespective of nesting , tag
>> names use this
>> <field column="body" xpath="/book/body/chapter flatten="true"/>
>>
>>
>>
>>
>>
>>
>> On Thu, Jul 2, 2009 at 5:31 AM, Jay Hill<ja...@gmail.com> wrote:
>> > I'm using the XPathEntityProcessor to parse an xml structure that looks
>> like
>> > this:
>> >
>> > <book>
>> >    <author>Joe Smith</author>
>> >    <title>World Atlas</title>
>> >    <body>
>> >        <chapter>
>> >            <p>Content I want is here</p>
>> >            <p>More content I want is here.</p>
>> >            <p>Still more content here.>/p>
>> >        </chapter>
>> >    </body>
>> > </book>
>> >
>> > The author and title parse out fine:       <field column="title"
>> > xpath="/book/title"/>  <field column="author" xpath="/book/author"/>
>> >
>> > But I can't get at the data inside the <p> tags. I want to get all
>> > non-markup text inside the body tag with something like this:
>> >
>> > <field column="body" xpath="/book/body/chapter//p"/>
>> >
>> > but that is not supported.
>> >
>> > Does anyone know of a way that I can get the content within the <p> tags
>> > without the markup?
>> >
>> > Thanks,
>> > -Jay
>> >
>>
>>
>>
>> --
>> -----------------------------------------------------
>> Noble Paul | Principal Engineer| AOL | http://aol.com
>>

-- 

===============================================================
Fergus McMenemie               Email:fergus@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Re: DIH: Limited xpath syntax unable to parse all xml elements

Posted by Jay Hill <ja...@gmail.com>.

Thanks Noble, I gave those examples a try.

If I use <field column="body" xpath="/book/body/chapter/p" />  I only get
the text from the last <p> element, not from all elements.

If I use <field column="body" xpath="/book/body/chapter" flatten="true"/>
or <field column="body" xpath="/book/body/chapter/" flatten="true"/> I don't
get back anything for the body column.

So the first example is close, but it only gets the text for the last <p>
element. If I could get all <p> elements at the same level that would be
what I need. The double-slash (/book/body/chapter//p) doesn't seem to be
supported.

Thanks,
-Jay


2009/7/1 Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>

> complete xpath is not supported
>
> /book/body/chapter/p
>
> should work.
>
> if you wish all the text under <chapter> irrespective of nesting , tag
> names use this
> <field column="body" xpath="/book/body/chapter flatten="true"/>
>
>
>
>
>
>
> On Thu, Jul 2, 2009 at 5:31 AM, Jay Hill<ja...@gmail.com> wrote:
> > I'm using the XPathEntityProcessor to parse an xml structure that looks
> like
> > this:
> >
> > <book>
> >    <author>Joe Smith</author>
> >    <title>World Atlas</title>
> >    <body>
> >        <chapter>
> >            <p>Content I want is here</p>
> >            <p>More content I want is here.</p>
> >            <p>Still more content here.>/p>
> >        </chapter>
> >    </body>
> > </book>
> >
> > The author and title parse out fine:       <field column="title"
> > xpath="/book/title"/>  <field column="author" xpath="/book/author"/>
> >
> > But I can't get at the data inside the <p> tags. I want to get all
> > non-markup text inside the body tag with something like this:
> >
> > <field column="body" xpath="/book/body/chapter//p"/>
> >
> > but that is not supported.
> >
> > Does anyone know of a way that I can get the content within the <p> tags
> > without the markup?
> >
> > Thanks,
> > -Jay
> >
>
>
>
> --
> -----------------------------------------------------
> Noble Paul | Principal Engineer| AOL | http://aol.com
>

Re: DIH: Limited xpath syntax unable to parse all xml elements

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.

complete xpath is not supported

/book/body/chapter/p

should work.

if you wish all the text under <chapter> irrespective of nesting , tag
names use this
<field column="body" xpath="/book/body/chapter flatten="true"/>






On Thu, Jul 2, 2009 at 5:31 AM, Jay Hill<ja...@gmail.com> wrote:
> I'm using the XPathEntityProcessor to parse an xml structure that looks like
> this:
>
> <book>
>    <author>Joe Smith</author>
>    <title>World Atlas</title>
>    <body>
>        <chapter>
>            <p>Content I want is here</p>
>            <p>More content I want is here.</p>
>            <p>Still more content here.>/p>
>        </chapter>
>    </body>
> </book>
>
> The author and title parse out fine:       <field column="title"
> xpath="/book/title"/>  <field column="author" xpath="/book/author"/>
>
> But I can't get at the data inside the <p> tags. I want to get all
> non-markup text inside the body tag with something like this:
>
> <field column="body" xpath="/book/body/chapter//p"/>
>
> but that is not supported.
>
> Does anyone know of a way that I can get the content within the <p> tags
> without the markup?
>
> Thanks,
> -Jay
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: DIH: Limited xpath syntax unable to parse all xml elements

Posted by Mark Miller <ma...@gmail.com>.

Hmmm - my very limited understanding of xpath says that /book/body/chapter/p
should work.

Some quick testing with XPath Expression Testbed shows both
/book/body/chapter/p and /book/body/chapter//p selecting the right nodes.

I'm not sure what's up.

Are you actually looking for /book/body/chapter/p/text() ? That would select
the text of the paras rather than the nodes.

I'm not too familiar with how DIH uses xpath expressions though.

The xpath test site I like to use (not that I have used much xpath) is:
http://www.whitebeam.org/library/guide/TechNotes/xpathtestbed.rhtm

-- 
- Mark

http://www.lucidimagination.com

On Wed, Jul 1, 2009 at 8:01 PM, Jay Hill <ja...@gmail.com> wrote:

> I'm using the XPathEntityProcessor to parse an xml structure that looks
> like
> this:
>
> <book>
>    <author>Joe Smith</author>
>    <title>World Atlas</title>
>    <body>
>        <chapter>
>            <p>Content I want is here</p>
>            <p>More content I want is here.</p>
>            <p>Still more content here.>/p>
>        </chapter>
>    </body>
> </book>
>
> The author and title parse out fine:       <field column="title"
> xpath="/book/title"/>  <field column="author" xpath="/book/author"/>
>
> But I can't get at the data inside the <p> tags. I want to get all
> non-markup text inside the body tag with something like this:
>
> <field column="body" xpath="/book/body/chapter//p"/>
>
> but that is not supported.
>
> Does anyone know of a way that I can get the content within the <p> tags
> without the markup?
>
> Thanks,
> -Jay
>