You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Jana, Kumar Raja" <kj...@ptc.com> on 2008/12/22 13:19:33 UTC

Scoped searches in XML documents

Hi,

 

I want to perform scoped searches in XML documents using Solr. I am
using Solr-Cell to index my document files. I've noticed that when I
index an xml file to Solr (via Solr-Cell) the field tags get stripped
off and only the values are sent to Solr.

i.e. Say I have an XML document which contains the following data:

<test>

    <node1>

        <inner_node1>XYZ</inner_node1>

        <inner_node2>ABC</inner_node2>

        <sometag>PPPP</sometag>

    </node1>

    <node1>

        ....

    </node1>

</test>

 

When I index this xml file, only the field values(XYZ, ABC and PPPP)
seem to go to Solr and the tag elements are stripped off!!! (Although
probing a bit more into the cause seems to point out that this is what
Apache Tika does).

 

Is there any setting or feature which would enable me to preserve the
field/tag information and hence allow me to perform scoped searches
using Solr?

 

Just to clear any confusion by the term "scoped search":

What I mean by scoped search is when I index the above xml document,
Scoped search would allow me to find all occurrences of ABC within the
<inner_node2> XML tag.

 

 

-Kumar


RE: Scoped searches in XML documents

Posted by "Binkley, Peter" <Pe...@ualberta.ca>.
It sounds like you haven't yet looked at the way Solr handles fields. I
assume that Solr-Cell (which I haven't looked at yet but hope to soon)
indexes everything into a single field. When using Solr on its own, the
first thing you do is create a schema that specifies the fields you want
in your index; you then massage your xml into the form Solr expects. In
your example you would end up with input documents somehting like 

<doc>
	<field name="inner_node1">XYZ</field>
	<field name="inner_node2">ABC</field>
	<field name="sometag">PPPP</field>
</doc>

(That applies to updating the index by posting xml to Solr; there are
many other mechanisms for populating the index now, but the basic ideas
of specifying fields remain the same).
The wiki page on Solr schemas (http://wiki.apache.org/solr/SchemaXml)
and the sample schema linked there will make it clear how to specify
your fields. 

You will then be able to specify fields in your queries like
"sometag:PPPP".  

Now you'll need to figure out how this underlying Solr functionality is
exposed by Solr-Cell, but I hope this is a start.

Peter


> -----Original Message-----
> From: Jana, Kumar Raja [mailto:kjana@ptc.com] 
> Sent: Monday, December 22, 2008 6:30 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Scoped searches in XML documents
> 
> Hi Shalin,
> 
> Thanks for the quick response. I've found my mistake. It was 
> actually a silly setting in my application before sending the 
> documents to Solr-Cell which was stripping off the xml tags. 
> I was able to index the document with the xml tags. Sorry for 
> being so hasty.
> 
> So the only question left is, will I be able to perform 
> scoped searches using Solr? Is this already implemented in 
> Solr or is there a workaround?
> 
> Thanks
> Kumar
> 
> 
> -----Original Message-----
> From: Shalin Shekhar Mangar [mailto:shalinmangar@gmail.com]
> Sent: Monday, December 22, 2008 6:27 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Scoped searches in XML documents
> 
> If your XML documents are of a fixed schema, you may want to 
> look at DataImportHandler with XPathEntityProcessor
> 
> http://wiki.apache.org/solr/DataImportHandler
> 
> On Mon, Dec 22, 2008 at 5:49 PM, Jana, Kumar Raja 
> <kj...@ptc.com> wrote:
> 
> > Hi,
> >
> >
> >
> > I want to perform scoped searches in XML documents using Solr. I am 
> > using Solr-Cell to index my document files. I've noticed 
> that when I 
> > index an xml file to Solr (via Solr-Cell) the field tags 
> get stripped 
> > off and only the values are sent to Solr.
> >
> > i.e. Say I have an XML document which contains the following data:
> >
> > <test>
> >
> >    <node1>
> >
> >        <inner_node1>XYZ</inner_node1>
> >
> >        <inner_node2>ABC</inner_node2>
> >
> >        <sometag>PPPP</sometag>
> >
> >    </node1>
> >
> >    <node1>
> >
> >        ....
> >
> >    </node1>
> >
> > </test>
> >
> >
> >
> > When I index this xml file, only the field values(XYZ, ABC 
> and PPPP) 
> > seem to go to Solr and the tag elements are stripped off!!! 
> (Although 
> > probing a bit more into the cause seems to point out that 
> this is what 
> > Apache Tika does).
> >
> >
> >
> > Is there any setting or feature which would enable me to 
> preserve the 
> > field/tag information and hence allow me to perform scoped searches 
> > using Solr?
> >
> >
> >
> > Just to clear any confusion by the term "scoped search":
> >
> > What I mean by scoped search is when I index the above xml 
> document, 
> > Scoped search would allow me to find all occurrences of ABC 
> within the 
> > <inner_node2> XML tag.
> >
> >
> >
> >
> >
> > -Kumar
> >
> >
> 
> 
> --
> Regards,
> Shalin Shekhar Mangar.
> 
> 

Re: Scoped searches in XML documents

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
I don't know much about Solr Cell but if you can see each node's content in
different fields in Solr then you should be able to query it too.

On Mon, Dec 22, 2008 at 6:59 PM, Jana, Kumar Raja <kj...@ptc.com> wrote:

> Hi Shalin,
>
> Thanks for the quick response. I've found my mistake. It was actually a
> silly setting in my application before sending the documents to
> Solr-Cell which was stripping off the xml tags. I was able to index the
> document with the xml tags. Sorry for being so hasty.
>
> So the only question left is, will I be able to perform scoped searches
> using Solr? Is this already implemented in Solr or is there a
> workaround?
>
> Thanks
> Kumar
>
>
> -----Original Message-----
> From: Shalin Shekhar Mangar [mailto:shalinmangar@gmail.com]
> Sent: Monday, December 22, 2008 6:27 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Scoped searches in XML documents
>
> If your XML documents are of a fixed schema, you may want to look at
> DataImportHandler with XPathEntityProcessor
>
> http://wiki.apache.org/solr/DataImportHandler
>
> On Mon, Dec 22, 2008 at 5:49 PM, Jana, Kumar Raja <kj...@ptc.com> wrote:
>
> > Hi,
> >
> >
> >
> > I want to perform scoped searches in XML documents using Solr. I am
> > using Solr-Cell to index my document files. I've noticed that when I
> > index an xml file to Solr (via Solr-Cell) the field tags get stripped
> > off and only the values are sent to Solr.
> >
> > i.e. Say I have an XML document which contains the following data:
> >
> > <test>
> >
> >    <node1>
> >
> >        <inner_node1>XYZ</inner_node1>
> >
> >        <inner_node2>ABC</inner_node2>
> >
> >        <sometag>PPPP</sometag>
> >
> >    </node1>
> >
> >    <node1>
> >
> >        ....
> >
> >    </node1>
> >
> > </test>
> >
> >
> >
> > When I index this xml file, only the field values(XYZ, ABC and PPPP)
> > seem to go to Solr and the tag elements are stripped off!!! (Although
> > probing a bit more into the cause seems to point out that this is what
> > Apache Tika does).
> >
> >
> >
> > Is there any setting or feature which would enable me to preserve the
> > field/tag information and hence allow me to perform scoped searches
> > using Solr?
> >
> >
> >
> > Just to clear any confusion by the term "scoped search":
> >
> > What I mean by scoped search is when I index the above xml document,
> > Scoped search would allow me to find all occurrences of ABC within the
> > <inner_node2> XML tag.
> >
> >
> >
> >
> >
> > -Kumar
> >
> >
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
Regards,
Shalin Shekhar Mangar.

RE: Scoped searches in XML documents

Posted by "Jana, Kumar Raja" <kj...@ptc.com>.
Hi Shalin,

Thanks for the quick response. I've found my mistake. It was actually a
silly setting in my application before sending the documents to
Solr-Cell which was stripping off the xml tags. I was able to index the
document with the xml tags. Sorry for being so hasty.

So the only question left is, will I be able to perform scoped searches
using Solr? Is this already implemented in Solr or is there a
workaround?

Thanks
Kumar


-----Original Message-----
From: Shalin Shekhar Mangar [mailto:shalinmangar@gmail.com] 
Sent: Monday, December 22, 2008 6:27 PM
To: solr-user@lucene.apache.org
Subject: Re: Scoped searches in XML documents

If your XML documents are of a fixed schema, you may want to look at
DataImportHandler with XPathEntityProcessor

http://wiki.apache.org/solr/DataImportHandler

On Mon, Dec 22, 2008 at 5:49 PM, Jana, Kumar Raja <kj...@ptc.com> wrote:

> Hi,
>
>
>
> I want to perform scoped searches in XML documents using Solr. I am
> using Solr-Cell to index my document files. I've noticed that when I
> index an xml file to Solr (via Solr-Cell) the field tags get stripped
> off and only the values are sent to Solr.
>
> i.e. Say I have an XML document which contains the following data:
>
> <test>
>
>    <node1>
>
>        <inner_node1>XYZ</inner_node1>
>
>        <inner_node2>ABC</inner_node2>
>
>        <sometag>PPPP</sometag>
>
>    </node1>
>
>    <node1>
>
>        ....
>
>    </node1>
>
> </test>
>
>
>
> When I index this xml file, only the field values(XYZ, ABC and PPPP)
> seem to go to Solr and the tag elements are stripped off!!! (Although
> probing a bit more into the cause seems to point out that this is what
> Apache Tika does).
>
>
>
> Is there any setting or feature which would enable me to preserve the
> field/tag information and hence allow me to perform scoped searches
> using Solr?
>
>
>
> Just to clear any confusion by the term "scoped search":
>
> What I mean by scoped search is when I index the above xml document,
> Scoped search would allow me to find all occurrences of ABC within the
> <inner_node2> XML tag.
>
>
>
>
>
> -Kumar
>
>


-- 
Regards,
Shalin Shekhar Mangar.

Re: Scoped searches in XML documents

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
If your XML documents are of a fixed schema, you may want to look at
DataImportHandler with XPathEntityProcessor

http://wiki.apache.org/solr/DataImportHandler

On Mon, Dec 22, 2008 at 5:49 PM, Jana, Kumar Raja <kj...@ptc.com> wrote:

> Hi,
>
>
>
> I want to perform scoped searches in XML documents using Solr. I am
> using Solr-Cell to index my document files. I've noticed that when I
> index an xml file to Solr (via Solr-Cell) the field tags get stripped
> off and only the values are sent to Solr.
>
> i.e. Say I have an XML document which contains the following data:
>
> <test>
>
>    <node1>
>
>        <inner_node1>XYZ</inner_node1>
>
>        <inner_node2>ABC</inner_node2>
>
>        <sometag>PPPP</sometag>
>
>    </node1>
>
>    <node1>
>
>        ....
>
>    </node1>
>
> </test>
>
>
>
> When I index this xml file, only the field values(XYZ, ABC and PPPP)
> seem to go to Solr and the tag elements are stripped off!!! (Although
> probing a bit more into the cause seems to point out that this is what
> Apache Tika does).
>
>
>
> Is there any setting or feature which would enable me to preserve the
> field/tag information and hence allow me to perform scoped searches
> using Solr?
>
>
>
> Just to clear any confusion by the term "scoped search":
>
> What I mean by scoped search is when I index the above xml document,
> Scoped search would allow me to find all occurrences of ABC within the
> <inner_node2> XML tag.
>
>
>
>
>
> -Kumar
>
>


-- 
Regards,
Shalin Shekhar Mangar.