You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Thung, Peter C CIV SPAWARSYSCEN-PACIFIC, 56340" <pe...@navy.mil> on 2009/09/28 11:12:09 UTC

Question on trying to Index and XML document...

With a basically default install of the trunk version of solr 1.4
when trying to index an xml file, it appears that the xml tags
seem to get stripped when indexed.
 
If the tag names and their frequenicies are important to me for search 
purposes could someone tell me what
my options are to not have solr strip out xml tags?
for example
 
if I have and xml tag of
<tag1> hello </tag1>
I'd like to see tag1 appear twice as a term and count as 2 is some
termFrequency vector.
 
I was trying out the examples from this link
http://wiki.apache.org/solr/ExtractingRequestHandler
 
and sending in an xml file.
 
Would I need to modify some exsiting code or is it just a configuration
to not strip out xml tags in processing?
 
-Peter
 
 
 
 
 
 

******************************************************************

Peter Thung

Software Developer

IBS Project Technical Lead -Web Developer

 

Code 56340  - Net-centric ISR Development Branch

Joint & National ISR Systems Division

Inteligence, Surveillance and Reconnaissance Department

US Navy Space & Naval Warfare Systems Center Pacific (SSC PAC)

Topside Campus, Bldg A33, room 0055

53560 Hull Street, San Diego, CA 92152

 

UNCLASS Email: peter.thung@navy.mil

SIPRNET Email: thungp@spawar.navy.smil.mil

COMM (Primary): (619) 553-6513

COMM (Secondary):(619) 553-0777

FAX: (619) 553-1586

******************************************************************

 

 

Re: Question on trying to Index and XML document...

Posted by Lance Norskog <go...@gmail.com>.
Another way to index XML data is to use the normal Solr XML updater
and wrap your XML documents inside CDATA blocks.

On Mon, Sep 28, 2009 at 2:12 AM, Thung, Peter C CIV
SPAWARSYSCEN-PACIFIC, 56340 <pe...@navy.mil> wrote:
> With a basically default install of the trunk version of solr 1.4
> when trying to index an xml file, it appears that the xml tags
> seem to get stripped when indexed.
>
> If the tag names and their frequenicies are important to me for search
> purposes could someone tell me what
> my options are to not have solr strip out xml tags?
> for example
>
> if I have and xml tag of
> <tag1> hello </tag1>
> I'd like to see tag1 appear twice as a term and count as 2 is some
> termFrequency vector.
>
> I was trying out the examples from this link
> http://wiki.apache.org/solr/ExtractingRequestHandler
>
> and sending in an xml file.
>
> Would I need to modify some exsiting code or is it just a configuration
> to not strip out xml tags in processing?
>
> -Peter
>
>
>
>
>
>
>
> ******************************************************************
>
> Peter Thung
>
> Software Developer
>
> IBS Project Technical Lead -Web Developer
>
>
>
> Code 56340  - Net-centric ISR Development Branch
>
> Joint & National ISR Systems Division
>
> Inteligence, Surveillance and Reconnaissance Department
>
> US Navy Space & Naval Warfare Systems Center Pacific (SSC PAC)
>
> Topside Campus, Bldg A33, room 0055
>
> 53560 Hull Street, San Diego, CA 92152
>
>
>
> UNCLASS Email: peter.thung@navy.mil
>
> SIPRNET Email: thungp@spawar.navy.smil.mil
>
> COMM (Primary): (619) 553-6513
>
> COMM (Secondary):(619) 553-0777
>
> FAX: (619) 553-1586
>
> ******************************************************************
>
>
>
>
>



-- 
Lance Norskog
goksron@gmail.com