You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tor Henning Ueland <to...@gmail.com> on 2010/06/07 15:17:56 UTC

Tips on recursive xml-parsing in dataConfig

Hi,

I am doing some testing of dataimport to Solr from XML-documents with
many children in the children. To parse the children i some levels
down using Xpath goes fine, but the speed is very slow. (~1 minute per
document, on a quad Xeon server). When i do the same using the format
solr wants it, the parsing time is 0.02 seconds per document.

I have published a quick example here:
http://pastebin.com/adhcEvRx

My question is:

I hope that i have done something wrong in the child-parsing  (as you
can see, it goes down quite a few levels). Can anybody point me in the
right direction so i can speed up the process?  I have been looking
around for some examples, but nobody gives examples of such deep data
indexing.

PS: I know there are some bugs in the Xpath naming etc, but it is just
a rough example :)

-- 
Best regars
Tor Henning Ueland

Re: Tips on recursive xml-parsing in dataConfig

Posted by Tor Henning Ueland <to...@gmail.com>.
The case changed to not using those xml-files at all, i ended up using
some other datafiles as sources, witch had everything flat, so no
recursion was needed afterall. But thanks for the input! :)

Best regards

On Tue, Jun 8, 2010 at 11:08 AM, Geert-Jan Brits <gb...@gmail.com> wrote:
> my bad, it looks like XPathEntityProcessor doesn't support relative xpaths.
>
> However, I quickly looked at the Slashdot example (which is pretty good
> actually) at http://wiki.apache.org/solr/DataImportHandler.
> From that I infer that you use only 1 entity per xml-doc. And within that
> entity use multiple field declararations with xpath-attributes to extract
> the values you want.
> So even though your xml-dcoument is nested (like most xml's are) your
> field-declarations are not.
>
> I think your best bet is to read the slashdot example and go from there.
>
> For now, I'm not entirely sure what you want a solr-document to be in your
> example. i.e:
> - 1 solr-document per 1 xml-document (as supplied)
> - or 1 solr-doc per CHAP  per PARA or per SUB?
>
> Once you know that, perhaps coming up with a decent pointer is easier.
>
> HTH,
> Geert-Jan
>
>
> <http://wiki.apache.org/solr/DataImportHandler>
>
> 2010/6/8 Tor Henning Ueland <to...@gmail.com>
>
>> I have tried both to change the datasource per child node to use the
>> parent nodes name, and tried to making the Xpath`s relative, all
>> causing either exceptions telling that Xpath must start with /, or
>> nullpointer exceptions ( nsfgrantsdir document : null).
>>
>> Best regards
>>
>> On Mon, Jun 7, 2010 at 4:12 PM, Geert-Jan Brits <gb...@gmail.com> wrote:
>> > I'm guessing (I'm not familiar with the xml dataimport handler, but I am
>> > pretty familiar with Xpath)
>> > that your problem lies in having absolute xpath-queries, instead of
>> relative
>> > xpath queries to your parent node.
>> >
>> > e.g: /DOK/TEKST/KAP is absolute ( the prefixed '/' tells it to be). Try
>> > 'KAP' instead.
>> > The same for all xpaths deeper in the tree.
>> >
>> > Geert-Jan
>> >
>> > 2010/6/7 Tor Henning Ueland <to...@gmail.com>
>> >
>> >> Hi,
>> >>
>> >> I am doing some testing of dataimport to Solr from XML-documents with
>> >> many children in the children. To parse the children i some levels
>> >> down using Xpath goes fine, but the speed is very slow. (~1 minute per
>> >> document, on a quad Xeon server). When i do the same using the format
>> >> solr wants it, the parsing time is 0.02 seconds per document.
>> >>
>> >> I have published a quick example here:
>> >> http://pastebin.com/adhcEvRx
>> >>
>> >> My question is:
>> >>
>> >> I hope that i have done something wrong in the child-parsing  (as you
>> >> can see, it goes down quite a few levels). Can anybody point me in the
>> >> right direction so i can speed up the process?  I have been looking
>> >> around for some examples, but nobody gives examples of such deep data
>> >> indexing.
>> >>
>> >> PS: I know there are some bugs in the Xpath naming etc, but it is just
>> >> a rough example :)
>> >>
>> >> --
>> >> Best regars
>> >> Tor Henning Ueland
>> >>
>> >
>>
>>
>>
>> --
>> Mvh
>> Tor Henning Ueland
>>
>



-- 
Mvh
Tor Henning Ueland

Re: Tips on recursive xml-parsing in dataConfig

Posted by Geert-Jan Brits <gb...@gmail.com>.
my bad, it looks like XPathEntityProcessor doesn't support relative xpaths.

However, I quickly looked at the Slashdot example (which is pretty good
actually) at http://wiki.apache.org/solr/DataImportHandler.
>From that I infer that you use only 1 entity per xml-doc. And within that
entity use multiple field declararations with xpath-attributes to extract
the values you want.
So even though your xml-dcoument is nested (like most xml's are) your
field-declarations are not.

I think your best bet is to read the slashdot example and go from there.

For now, I'm not entirely sure what you want a solr-document to be in your
example. i.e:
- 1 solr-document per 1 xml-document (as supplied)
- or 1 solr-doc per CHAP  per PARA or per SUB?

Once you know that, perhaps coming up with a decent pointer is easier.

HTH,
Geert-Jan


<http://wiki.apache.org/solr/DataImportHandler>

2010/6/8 Tor Henning Ueland <to...@gmail.com>

> I have tried both to change the datasource per child node to use the
> parent nodes name, and tried to making the Xpath`s relative, all
> causing either exceptions telling that Xpath must start with /, or
> nullpointer exceptions ( nsfgrantsdir document : null).
>
> Best regards
>
> On Mon, Jun 7, 2010 at 4:12 PM, Geert-Jan Brits <gb...@gmail.com> wrote:
> > I'm guessing (I'm not familiar with the xml dataimport handler, but I am
> > pretty familiar with Xpath)
> > that your problem lies in having absolute xpath-queries, instead of
> relative
> > xpath queries to your parent node.
> >
> > e.g: /DOK/TEKST/KAP is absolute ( the prefixed '/' tells it to be). Try
> > 'KAP' instead.
> > The same for all xpaths deeper in the tree.
> >
> > Geert-Jan
> >
> > 2010/6/7 Tor Henning Ueland <to...@gmail.com>
> >
> >> Hi,
> >>
> >> I am doing some testing of dataimport to Solr from XML-documents with
> >> many children in the children. To parse the children i some levels
> >> down using Xpath goes fine, but the speed is very slow. (~1 minute per
> >> document, on a quad Xeon server). When i do the same using the format
> >> solr wants it, the parsing time is 0.02 seconds per document.
> >>
> >> I have published a quick example here:
> >> http://pastebin.com/adhcEvRx
> >>
> >> My question is:
> >>
> >> I hope that i have done something wrong in the child-parsing  (as you
> >> can see, it goes down quite a few levels). Can anybody point me in the
> >> right direction so i can speed up the process?  I have been looking
> >> around for some examples, but nobody gives examples of such deep data
> >> indexing.
> >>
> >> PS: I know there are some bugs in the Xpath naming etc, but it is just
> >> a rough example :)
> >>
> >> --
> >> Best regars
> >> Tor Henning Ueland
> >>
> >
>
>
>
> --
> Mvh
> Tor Henning Ueland
>

Re: Tips on recursive xml-parsing in dataConfig

Posted by Tor Henning Ueland <to...@gmail.com>.
I have tried both to change the datasource per child node to use the
parent nodes name, and tried to making the Xpath`s relative, all
causing either exceptions telling that Xpath must start with /, or
nullpointer exceptions ( nsfgrantsdir document : null).

Best regards

On Mon, Jun 7, 2010 at 4:12 PM, Geert-Jan Brits <gb...@gmail.com> wrote:
> I'm guessing (I'm not familiar with the xml dataimport handler, but I am
> pretty familiar with Xpath)
> that your problem lies in having absolute xpath-queries, instead of relative
> xpath queries to your parent node.
>
> e.g: /DOK/TEKST/KAP is absolute ( the prefixed '/' tells it to be). Try
> 'KAP' instead.
> The same for all xpaths deeper in the tree.
>
> Geert-Jan
>
> 2010/6/7 Tor Henning Ueland <to...@gmail.com>
>
>> Hi,
>>
>> I am doing some testing of dataimport to Solr from XML-documents with
>> many children in the children. To parse the children i some levels
>> down using Xpath goes fine, but the speed is very slow. (~1 minute per
>> document, on a quad Xeon server). When i do the same using the format
>> solr wants it, the parsing time is 0.02 seconds per document.
>>
>> I have published a quick example here:
>> http://pastebin.com/adhcEvRx
>>
>> My question is:
>>
>> I hope that i have done something wrong in the child-parsing  (as you
>> can see, it goes down quite a few levels). Can anybody point me in the
>> right direction so i can speed up the process?  I have been looking
>> around for some examples, but nobody gives examples of such deep data
>> indexing.
>>
>> PS: I know there are some bugs in the Xpath naming etc, but it is just
>> a rough example :)
>>
>> --
>> Best regars
>> Tor Henning Ueland
>>
>



-- 
Mvh
Tor Henning Ueland

Re: Tips on recursive xml-parsing in dataConfig

Posted by Geert-Jan Brits <gb...@gmail.com>.
I'm guessing (I'm not familiar with the xml dataimport handler, but I am
pretty familiar with Xpath)
that your problem lies in having absolute xpath-queries, instead of relative
xpath queries to your parent node.

e.g: /DOK/TEKST/KAP is absolute ( the prefixed '/' tells it to be). Try
'KAP' instead.
The same for all xpaths deeper in the tree.

Geert-Jan

2010/6/7 Tor Henning Ueland <to...@gmail.com>

> Hi,
>
> I am doing some testing of dataimport to Solr from XML-documents with
> many children in the children. To parse the children i some levels
> down using Xpath goes fine, but the speed is very slow. (~1 minute per
> document, on a quad Xeon server). When i do the same using the format
> solr wants it, the parsing time is 0.02 seconds per document.
>
> I have published a quick example here:
> http://pastebin.com/adhcEvRx
>
> My question is:
>
> I hope that i have done something wrong in the child-parsing  (as you
> can see, it goes down quite a few levels). Can anybody point me in the
> right direction so i can speed up the process?  I have been looking
> around for some examples, but nobody gives examples of such deep data
> indexing.
>
> PS: I know there are some bugs in the Xpath naming etc, but it is just
> a rough example :)
>
> --
> Best regars
> Tor Henning Ueland
>