You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Stan Lee <sl...@gmail.com> on 2016/08/16 13:15:22 UTC

What's the best practices for indexing XML Content with dynamic XML Elements (SOLR 6.1) ?

We currently have a Microsoft SQL table with a XML datatype. We use DIH to
import the XML Content as is, that is not using the XPathEntityProcessor.
If the elements of the XML content is known, XPathEntity make sense. Could
someone kindly suggest the right way of handling such scenario, without
impacting search performance?
Which tokenizer should we be using?


Thanks.

Re: What's the best practices for indexing XML Content with dynamic XML Elements (SOLR 6.1) ?

Posted by Stan Lee <sl...@gmail.com>.

Sorry for not being specific. I believe this SOLR plugin (LUX) may fit my
scenario (query without knowing the tag in advance).
http://luxdb.org/README.html

On Tue, Aug 16, 2016 at 12:18 PM, Erick Erickson <er...@gmail.com>
wrote:

> You haven't really described the scenario you want
> to implement. I get that you have raw XML of an
> unknown structure. What do you want to _do_ with that?
>
> 1> if all you want to do is index the data (i.e. strip the tags)
> try HtmlStripCharFilterFactory.
> 2> If you want to intelligently take content of the XML
> and ingest it into specific Solr fields, I don't think you'll be
> able to do that without writing some specific code to
> parse the XML, explore it and "do the right thing" with it
> which will probably involve SolrJ, an XML parser and
> some programming.
>
> Best,
> Erick
>
> On Tue, Aug 16, 2016 at 6:15 AM, Stan Lee <sl...@gmail.com> wrote:
> > We currently have a Microsoft SQL table with a XML datatype. We use DIH
> to
> > import the XML Content as is, that is not using the XPathEntityProcessor.
> > If the elements of the XML content is known, XPathEntity make sense.
> Could
> > someone kindly suggest the right way of handling such scenario, without
> > impacting search performance?
> > Which tokenizer should we be using?
> >
> >
> > Thanks.
>

Re: What's the best practices for indexing XML Content with dynamic XML Elements (SOLR 6.1) ?

Posted by Erick Erickson <er...@gmail.com>.

You haven't really described the scenario you want
to implement. I get that you have raw XML of an
unknown structure. What do you want to _do_ with that?

1> if all you want to do is index the data (i.e. strip the tags)
try HtmlStripCharFilterFactory.
2> If you want to intelligently take content of the XML
and ingest it into specific Solr fields, I don't think you'll be
able to do that without writing some specific code to
parse the XML, explore it and "do the right thing" with it
which will probably involve SolrJ, an XML parser and
some programming.

Best,
Erick

On Tue, Aug 16, 2016 at 6:15 AM, Stan Lee <sl...@gmail.com> wrote:
> We currently have a Microsoft SQL table with a XML datatype. We use DIH to
> import the XML Content as is, that is not using the XPathEntityProcessor.
> If the elements of the XML content is known, XPathEntity make sense. Could
> someone kindly suggest the right way of handling such scenario, without
> impacting search performance?
> Which tokenizer should we be using?
>
>
> Thanks.