You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Dan Davis <da...@gmail.com> on 2014/12/08 22:42:32 UTC

DIH XPathEntityProcessor question

When I have a forEach attribute like the following:


forEach="/medical-topics/medical-topic/health-topic[@language='English']"

And then need to match an attribute of that, is there any alternative to
spelling it all out:

     <field column="url"
xpath="/medical-topics/medical-topic/health-topic[@language='English']/@url"/>

I suppose I could do "//health-topic/@url" since the document should then
have a single health-topic (as long as I know they don't nest).

Re: DIH XPathEntityProcessor question

Posted by Dan Davis <da...@danizen.net>.

Yes, that worked quite well.   I still need the "//tagname" but that is the
only DIH incantation I need.   This will substantially accelerate things.

On Mon, Dec 8, 2014 at 5:37 PM, Dan Davis <da...@danizen.net> wrote:

> The problem is that XPathEntityProcessor implements Xpath on its own, and
> implements a subset of XPath.  So, if the input document is small enough,
> it makes no sense to fight it.   One possibility is to apply an XSLT to the
> file before processing ite
>
> This blog post
> <http://www.andornot.com/blog/post/Sample-Solr-DataImportHandler-for-XML-Files.aspx>
> shows a worked example.   The XSL transform takes place before the forEach
> or field specifications, which is the principal question I had about it
> from the documentation.  This is also illustrated in the initQuery()
> private method of XPathEntityProcessor.    You can see the transformation
> being applied before the forEach.  This will not scale to extremely large
> XML documents including millions of rows - that is why they have the
> stream="true" argument there, so that you don't preprocess the document.
> In my case, the entire XML file is 29M, and so I think I could do the XSL
> transformation and then do for each document.
>
> This potentially shortens my time frame of moving to Apache Solr
> substantially, because the common case with our previous indexer is to run
> XSLT to trasform to the document format desired by the indexer.
>
> On Mon, Dec 8, 2014 at 5:10 PM, Alexandre Rafalovitch <ar...@gmail.com>
> wrote:
>
>> I don't believe there are any alternatives. At least I could not get
>> anything but the full path to work.
>>
>> Regards,
>>    Alex.
>> Personal: http://www.outerthoughts.com/ and @arafalov
>> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
>> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>>
>>
>> On 8 December 2014 at 17:01, Dan Davis <da...@gmail.com> wrote:
>> > In experimentation with a much simpler and smaller XML file, it doesn't
>> > look like '//health-topic/@url" will not work, nor will '//@url' etc.
>>   So
>> > far, only spelling it all out will work.
>> > With child elements, such as <title>, an xpath of "//title" works fine,
>> but
>> > it  is beginning to same dangerous.
>> >
>> > Is there any short-hand for the current node or the match?
>> >
>> > On Mon, Dec 8, 2014 at 4:42 PM, Dan Davis <da...@gmail.com> wrote:
>> >
>> >> When I have a forEach attribute like the following:
>> >>
>> >>
>> >>
>> forEach="/medical-topics/medical-topic/health-topic[@language='English']"
>> >>
>> >> And then need to match an attribute of that, is there any alternative
>> to
>> >> spelling it all out:
>> >>
>> >>      <field column="url"
>> >>
>> xpath="/medical-topics/medical-topic/health-topic[@language='English']/@url"/>
>> >>
>> >> I suppose I could do "//health-topic/@url" since the document should
>> then
>> >> have a single health-topic (as long as I know they don't nest).
>> >>
>> >>
>>
>
>

Re: DIH XPathEntityProcessor question

Posted by Dan Davis <da...@danizen.net>.

The problem is that XPathEntityProcessor implements Xpath on its own, and
implements a subset of XPath.  So, if the input document is small enough,
it makes no sense to fight it.   One possibility is to apply an XSLT to the
file before processing ite

This blog post
<http://www.andornot.com/blog/post/Sample-Solr-DataImportHandler-for-XML-Files.aspx>
shows a worked example.   The XSL transform takes place before the forEach
or field specifications, which is the principal question I had about it
from the documentation.  This is also illustrated in the initQuery()
private method of XPathEntityProcessor.    You can see the transformation
being applied before the forEach.  This will not scale to extremely large
XML documents including millions of rows - that is why they have the
stream="true" argument there, so that you don't preprocess the document.
In my case, the entire XML file is 29M, and so I think I could do the XSL
transformation and then do for each document.

This potentially shortens my time frame of moving to Apache Solr
substantially, because the common case with our previous indexer is to run
XSLT to trasform to the document format desired by the indexer.

On Mon, Dec 8, 2014 at 5:10 PM, Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> I don't believe there are any alternatives. At least I could not get
> anything but the full path to work.
>
> Regards,
>    Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On 8 December 2014 at 17:01, Dan Davis <da...@gmail.com> wrote:
> > In experimentation with a much simpler and smaller XML file, it doesn't
> > look like '//health-topic/@url" will not work, nor will '//@url' etc.
> So
> > far, only spelling it all out will work.
> > With child elements, such as <title>, an xpath of "//title" works fine,
> but
> > it  is beginning to same dangerous.
> >
> > Is there any short-hand for the current node or the match?
> >
> > On Mon, Dec 8, 2014 at 4:42 PM, Dan Davis <da...@gmail.com> wrote:
> >
> >> When I have a forEach attribute like the following:
> >>
> >>
> >>
> forEach="/medical-topics/medical-topic/health-topic[@language='English']"
> >>
> >> And then need to match an attribute of that, is there any alternative to
> >> spelling it all out:
> >>
> >>      <field column="url"
> >>
> xpath="/medical-topics/medical-topic/health-topic[@language='English']/@url"/>
> >>
> >> I suppose I could do "//health-topic/@url" since the document should
> then
> >> have a single health-topic (as long as I know they don't nest).
> >>
> >>
>

Re: DIH XPathEntityProcessor question

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

I don't believe there are any alternatives. At least I could not get
anything but the full path to work.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 8 December 2014 at 17:01, Dan Davis <da...@gmail.com> wrote:
> In experimentation with a much simpler and smaller XML file, it doesn't
> look like '//health-topic/@url" will not work, nor will '//@url' etc.    So
> far, only spelling it all out will work.
> With child elements, such as <title>, an xpath of "//title" works fine, but
> it  is beginning to same dangerous.
>
> Is there any short-hand for the current node or the match?
>
> On Mon, Dec 8, 2014 at 4:42 PM, Dan Davis <da...@gmail.com> wrote:
>
>> When I have a forEach attribute like the following:
>>
>>
>> forEach="/medical-topics/medical-topic/health-topic[@language='English']"
>>
>> And then need to match an attribute of that, is there any alternative to
>> spelling it all out:
>>
>>      <field column="url"
>> xpath="/medical-topics/medical-topic/health-topic[@language='English']/@url"/>
>>
>> I suppose I could do "//health-topic/@url" since the document should then
>> have a single health-topic (as long as I know they don't nest).
>>
>>

Re: DIH XPathEntityProcessor question

Posted by Dan Davis <da...@gmail.com>.

In experimentation with a much simpler and smaller XML file, it doesn't
look like '//health-topic/@url" will not work, nor will '//@url' etc.    So
far, only spelling it all out will work.
With child elements, such as <title>, an xpath of "//title" works fine, but
it  is beginning to same dangerous.

Is there any short-hand for the current node or the match?

On Mon, Dec 8, 2014 at 4:42 PM, Dan Davis <da...@gmail.com> wrote:

> When I have a forEach attribute like the following:
>
>
> forEach="/medical-topics/medical-topic/health-topic[@language='English']"
>
> And then need to match an attribute of that, is there any alternative to
> spelling it all out:
>
>      <field column="url"
> xpath="/medical-topics/medical-topic/health-topic[@language='English']/@url"/>
>
> I suppose I could do "//health-topic/@url" since the document should then
> have a single health-topic (as long as I know they don't nest).
>
>