You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Thung, Peter C CIV SPAWARSYSCEN-PACIFIC, 56340" <pe...@navy.mil> on 2009/10/01 11:40:55 UTC

Question on modifying solr behavior on indexing xml files..

1.  In my playing around with 
sending in an XML document within a an XML CDATA tag,
with termVectors="true"
 
I noticed the following behavior:
<person>peter</person>
collapses to the term
personpeterperson
instead of
person
and 
peter separately.
 
I realize I could try and do a search and replaces of characters like
<>"=  to a space so that the default parser/indexer can preserve element
names.
However, I'm wondering if someon could point me to where one might do
this withing
the solr or apache lucene code as a proper plug in with maybe an example
that I could use
as a template.  Also where in the solrconfig.xml file I would want to
change to reference the new parser.
 
2.  My other question would also be if this technique would work for XML
type messages embedded
in Microsoft Excel, or Powerpoint presentations where I would like to
preserve knowining xml element term frequencies
where I would try and leverage the component that automatically indexes
microsoft documents.
Would I need to modify that component and customize it?
 
-Peter

Re: Question on modifying solr behavior on indexing xml files..

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

On Thu, Oct 1, 2009 at 3:10 PM, Thung, Peter C CIV SPAWARSYSCEN-PACIFIC,
56340 <pe...@navy.mil> wrote:

> 1.  In my playing around with
> sending in an XML document within a an XML CDATA tag,
> with termVectors="true"
>
> I noticed the following behavior:
> <person>peter</person>
> collapses to the term
> personpeterperson
> instead of
> person
> and
> peter separately.
>
> I realize I could try and do a search and replaces of characters like
> <>"=  to a space so that the default parser/indexer can preserve element
> names.
> However, I'm wondering if someon could point me to where one might do
> this withing
> the solr or apache lucene code as a proper plug in with maybe an example
> that I could use
> as a template.  Also where in the solrconfig.xml file I would want to
> change to reference the new parser.
>
>
Solr is agnostic of the content in a schema field. It does not know that it
is XML and hence it will do blind tokenization/filtering as defined for the
field type in schema.xml

If all you want is to do a full-text search on words found somewhere in that
XML, then your approach of replacing <>"= to a space will work fine. You can
use the PatternReplaceFilter and specify a regex which matches these special
characters and replaces them by a space.

<filter class="solr.PatternReplaceFilterFactory" pattern="([<>="])"
replacement=" " replace="all"/>

Or you can use the MappingCharFilter (solr 1.4 feature) and specify a
mapping file which has these special characters mapped to a space.

<charFilter class="solr.MappingCharFilterFactory"
mapping="special-xml-symbols.txt"/>

The file should be in the format:
characterToBeReplaced => replacementChar

However, if you want to preserve the structure of the XML document, it is
best to parse it out yourself and put contents into Solr fields before
sending it to Solr. You may also want to look at DataImportHandler and
XPathEntityProcessor which is commonly used for importing XML files.

http://wiki.apache.org/solr/DataImportHandler

> 2.  My other question would also be if this technique would work for XML
> type messages embedded
> in Microsoft Excel, or Powerpoint presentations where I would like to
> preserve knowining xml element term frequencies
> where I would try and leverage the component that automatically indexes
> microsoft documents.
> Would I need to modify that component and customize it?
>
>
Perhaps somebody who knows about Solr Cell can answer this but I think it
should work.

-- 
Regards,
Shalin Shekhar Mangar.