You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by David Neubert <de...@yahoo.com> on 2007/11/08 05:18:25 UTC

What is the best way to index xml data preserving the mark up?

I am sure this is 101 question, but I am bit confused about indexing xml data using SOLR.

I have rich xml content (books) that need to searched at granular levels (specifically paragraph and sentence levels very accurately, no approximations).  My source text has exact <p></p> and <s></s> tags for this purpose.  I have built this app in previous versions (using other search engines) indexing the text twice, (1) where every paragraph was a virtual document and (2) where every sentence was a virtual document  -- both extracted from the source file (which was a singe xml file for the entire book).  I have of course thought about using an XML engine eXists or Xindices, but I am prefer to the stability and user base and performance that Lucene/SOLR seems to have, and also there is a large body of text that is regular documents and not well formed XML as well.

I am brand new to SOLR (one day) and at a basic level understand SOLR's nice simple xml scheme to add documents:

<add>
  <doc>
    <field name="foo1">foo value 1</field>
    <field name="foo2">foo value 2</field>
  </doc>
  <doc>...</doc>
</add>

But my problem is that I believe I need to perserve the xml markup at the paragraph and sentence levels, so I was hoping to create a content field that could just contain the source xml for the paragraph or sentence respectively.  There are reasons for this that I won't go into -- alot of granular work in this app, accessing pars and sens.

Obviously an XML mechanism that could leverage the xml structure (via XPath or XPointers) would work great.  Still I think Lucene can do this in a field level way-- and I also can't imagine that users who are indexing XML documents have to go through the trouble of striping all the markup before indexing?  Hopefully I missing something basic?

It would be great to pointed in the right direction on this matter?

I think I need something along this line:

<add>
  <doc>
    <field name="foo1">value 1</field>
    <field name="foo2">value 2</field>
    ....
    <field name="content"><an xml stream with embedded source markup></field>
  </doc>
</add>

Maybe the overall question -- is what is the best way to index XML content using SOLR -- is all this tag stripping really necessary?

Thanks for any help,

Dave





__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

AW: What is the best way to index xml data preserving the mark up?

Posted by "Hausherr, Jens" <je...@logicacmg.com>.

Hi, 

if you just need to preserve the xml for storing you could simply wrap the xml markup in CDATA. Splitting your structure beforehand and using dynamic fields might be a viable solution...

eg. 
<add>
  <doc>
    <field name="foo1">value 1</field>
    <field name="foo2">value 2</field>
    ....
    <field name="content"><![CDATA[<an xml stream with embedded source markup>]]></field>
  </doc>
</add>


 

Mit freundlichen Grüßen / Best Regards / Avec mes meilleures salutations

 
Jens Hausherr 
 
Dipl.-Wirtsch.Inf. (Univ.) 
Senior Consultant 
 
Tel: 040-27071-233
Fax: 040-27071-244
Fax: +49-(0)178-998866-097
Mobile: +49-(0)178-8866-097
 
mailto: mailto:jens.hausherr@logicacmg.com <ma...@unilog.logicacmg.com> 
 
Unilog Avinci - a LogicaCMG company
Am Sandtorkai 72
D-20457 Hamburg
http://www.unilog.de <http://www.unilog.de/> 
 
Unilog Avinci GmbH
Zettachring 4, 70567 Stuttgart
Amtsgericht Stuttgart HRB 721369
Geschäftsführer: Torsten Straß / Eric Guyot / Rudolf Kuhn / Olaf Scholz
 


This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.

Re: What is the best way to index xml data preserving the mark up?

Posted by Norberto Meijome <fr...@meijome.net>.

On Wed, 7 Nov 2007 20:18:25 -0800 (PST)
David Neubert <de...@yahoo.com> wrote:

> I am sure this is 101 question, but I am bit confused about indexing xml data using SOLR.
> 
> I have rich xml content (books) that need to searched at granular levels (specifically paragraph and sentence levels very accurately, no approximations).  My source text has exact <p></p> and <s></s> tags for this purpose.  I have built this app in previous versions (using other search engines) indexing the text twice, (1) where every paragraph was a virtual document and (2) where every sentence was a virtual document  -- both extracted from the source file (which was a singe xml file for the entire book).  I have of course thought about using an XML engine eXists or Xindices, but I am prefer to the stability and user base and performance that Lucene/SOLR seems to have, and also there is a large body of text that is regular documents and not well formed XML as well.
> 
> I am brand new to SOLR (one day) and at a basic level understand SOLR's nice simple xml scheme to add documents:
> 
> <add>
>   <doc>
>     <field name="foo1">foo value 1</field>
>     <field name="foo2">foo value 2</field>
>   </doc>
>   <doc>...</doc>
> </add>
> 
> But my problem is that I believe I need to perserve the xml markup at the paragraph and sentence levels, so I was hoping to create a content field that could just contain the source xml for the paragraph or sentence respectively.  There are reasons for this that I won't go into -- alot of granular work in this app, accessing pars and sens.
> 
> Obviously an XML mechanism that could leverage the xml structure (via XPath or XPointers) would work great.  Still I think Lucene can do this in a field level way-- and I also can't imagine that users who are indexing XML documents have to go through the trouble of striping all the markup before indexing?  Hopefully I missing something basic?
> 
> It would be great to pointed in the right direction on this matter?
> 
> I think I need something along this line:
> 
> <add>
>   <doc>
>     <field name="foo1">value 1</field>
>     <field name="foo2">value 2</field>
>     ....
>     <field name="content"><an xml stream with embedded source markup></field>
>   </doc>
> </add>
> 
> Maybe the overall question -- is what is the best way to index XML content using SOLR -- is all this tag stripping really necessary?

crazy/silly idea maybe... could you use dynamic fields, each containing a sentence, and a reference to the paragraph it belongs to ? 
eg, (not sure if the syntax is correct..)

<dynamicField name="s_*" type="string" />

Then when you create your document you can define
<doc>
  <field name="s_1_p1">{Sentence #1, Para#1}</field>
  <field name="s_2_p1">{Sentence #2, Para#1}</field>
  <field name="s_3_p1">{Sentence #3, Para#1}</field>
  <field name="s_1_p2">{Sentence #1, Para#2}</field>
[...]
</doc>

I have no idea how scalable that would be. 
cheers,
B
_________________________
{Beto|Norberto|Numard} Meijome

Immediate success shouldn't be necessary as a motivation to do the right thing.

I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.

Re: What is the best way to index xml data preserving the mark up?

Posted by Walter Underwood <wu...@netflix.com>.

If you really, really need to preserve the XML structure, you'll
be doing a LOT of work to make Solr do that. It might be cheaper
to start with software that already does that. I recommend
MarkLogic -- I know the principals there, and it is some seriously
fine software. Not free or open, but very, very good.

If your problem can be expressed in a flat field model, then the
your problem is mapping your document model into Solr. You might
be able to use structured field names to represent the XML context,
but that is just a guess.

With a mixed corpus of XML and arbitrary text, requiring special
handling of XML, yow, that's a lot of work.

One thought -- you can do flat fields in an XML engine (like MarkLogic)
much more easily than you can do XML in a flat field engine (like Lucene).

wunder

On 11/7/07 8:18 PM, "David Neubert" <de...@yahoo.com> wrote:

> I am sure this is 101 question, but I am bit confused about indexing xml data
> using SOLR.
> 
> I have rich xml content (books) that need to searched at granular levels
> (specifically paragraph and sentence levels very accurately, no
> approximations).  My source text has exact <p></p> and <s></s> tags for this
> purpose.  I have built this app in previous versions (using other search
> engines) indexing the text twice, (1) where every paragraph was a virtual
> document and (2) where every sentence was a virtual document  -- both
> extracted from the source file (which was a singe xml file for the entire
> book).  I have of course thought about using an XML engine eXists or Xindices,
> but I am prefer to the stability and user base and performance that
> Lucene/SOLR seems to have, and also there is a large body of text that is
> regular documents and not well formed XML as well.
> 
> I am brand new to SOLR (one day) and at a basic level understand SOLR's nice
> simple xml scheme to add documents:
> 
> <add>
>   <doc>
>     <field name="foo1">foo value 1</field>
>     <field name="foo2">foo value 2</field>
>   </doc>
>   <doc>...</doc>
> </add>
> 
> But my problem is that I believe I need to perserve the xml markup at the
> paragraph and sentence levels, so I was hoping to create a content field that
> could just contain the source xml for the paragraph or sentence respectively.
> There are reasons for this that I won't go into -- alot of granular work in
> this app, accessing pars and sens.
> 
> Obviously an XML mechanism that could leverage the xml structure (via XPath or
> XPointers) would work great.  Still I think Lucene can do this in a field
> level way-- and I also can't imagine that users who are indexing XML documents
> have to go through the trouble of striping all the markup before indexing?
> Hopefully I missing something basic?
> 
> It would be great to pointed in the right direction on this matter?
> 
> I think I need something along this line:
> 
> <add>
>   <doc>
>     <field name="foo1">value 1</field>
>     <field name="foo2">value 2</field>
>     ....
>     <field name="content"><an xml stream with embedded source markup></field>
>   </doc>
> </add>
> 
> Maybe the overall question -- is what is the best way to index XML content
> using SOLR -- is all this tag stripping really necessary?
> 
> Thanks for any help,
> 
> Dave
> 
> 
> 
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com