You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Judioo <co...@judioo.com> on 2011/05/18 22:19:27 UTC

Storing, indexing and searching XML documents in Solr

Hi,
I'm new to solr so apologies if the solution is already documented.
I have installed and populated a solr index using the examples as a template
with a version of the data below.

I have XML in the form of

  <entity>
    <resource>
      <guid>123898-2092099098982</guid>
      <media_format>Blu-Ray</media_format>
      <updated>2011-05-05T11:25:35+0500</updated>
    </resource>
    <price currency="usd">3.99<price>
    <discounts>
      <discount type="percentage" rate="30"
start="2011-05-03T00:00:00" end="2011-05-10T00:00:00" />

      <discount type="decimal" amount="1.99" coupon="1" />
      .....
    </discounts>
    <aspect_ratio>16:9</aspect_ratio>
    <duration>1620</duration>
    <categories>
          <category id="drama" />
          <category id="horror" />
    </categories>
    <rating>
      <rate id="D1">contains some scenes which some viewers may find
upsetting</rate>
    </rating>
    .......
    <media_type>Video</media_type>
</entity>


Can I populate solr directly with this document (like I believe marklogic
does )?
If yes
Can I search on any attribute ( i.e. find all records where
/entity/resource/media_format equals "blu-ray" )

If no
What is the best practice to import the attributes above into solr ( i.e.
patterns for sub dividing / flattening document ).
Does solr support attached documents and if so is this advised ( how does it
affect performance ).

Any help is greatly appreciated. Pointers to documentation that address my
issues is even more helpful.

Thanks again


OJ

Re: Storing, indexing and searching XML documents in Solr

Posted by Judioo <co...@judioo.com>.
The data is being imported directly from mysql. The document is however
indeed a good starting place.
Thanks

2011/5/18 Yury Kats <yu...@yahoo.com>

> On 5/18/2011 4:19 PM, Judioo wrote:
>
> > Any help is greatly appreciated. Pointers to documentation that address
> my
> > issues is even more helpful.
>
> I think this would be a good start:
>
> http://wiki.apache.org/solr/DataImportHandler#Usage_with_XML.2BAC8-HTTP_Datasource
>

Re: Storing, indexing and searching XML documents in Solr

Posted by Erick Erickson <er...@gmail.com>.
You're right, you can't store an XML document directly in Solr.
You have to pull it apart and index it such that you can get whatever
information back you need.

How you flatten data depends entirely upon your needs. The high-level
idea is that you want to create fields such that text searches work. The
moment you start thinking about "how can I express a relationship
in the query", back up and try to flatten the data so you can just *search*.

This is vague, I know. But so much depends on how you want to use
the data that specifics are hard to give.

You've gotta take off your DB hat and not worry about duplicating
data. De-normalize lots and lots and lots first...

Best
Erick

On Wed, May 18, 2011 at 5:27 PM, Judioo <co...@judioo.com> wrote:
> Great document. I can see how to import the data direct from the database.
> However it seems as though I need to write xpath's in the config to extract
> the fields that I wish to transform into an solr document.
>
> So it seems that there is no way of storing the document structure in solr
> as is?
>
>
> 2011/5/18 Yury Kats <yu...@yahoo.com>
>
>> On 5/18/2011 4:19 PM, Judioo wrote:
>>
>> > Any help is greatly appreciated. Pointers to documentation that address
>> my
>> > issues is even more helpful.
>>
>> I think this would be a good start:
>>
>> http://wiki.apache.org/solr/DataImportHandler#Usage_with_XML.2BAC8-HTTP_Datasource
>>
>

Re: Storing, indexing and searching XML documents in Solr

Posted by Mike Sokolov <so...@ifactory.com>.
You might want to create a field that's analyzed using 
HtmlStripCharFilter - this will index all the non-tag/non-attribute text 
in the document, and if you store the value, will store the entire XML 
document as well.

I've done some work on an XmlStripCharFilter, which does the same thing 
(only for well-formed XML) using  the WSTX XML parser, which provides a 
little bit of extra XML goodness (like entity resolution and xinclude 
processing) that HtmlStripCharFilter doesn't.  I could share if there's 
interest.

-Mike

On 05/18/2011 05:27 PM, Judioo wrote:
> Great document. I can see how to import the data direct from the database.
> However it seems as though I need to write xpath's in the config to extract
> the fields that I wish to transform into an solr document.
>
> So it seems that there is no way of storing the document structure in solr
> as is?
>
>
> 2011/5/18 Yury Kats<yu...@yahoo.com>
>
>    
>> On 5/18/2011 4:19 PM, Judioo wrote:
>>
>>      
>>> Any help is greatly appreciated. Pointers to documentation that address
>>>        
>> my
>>      
>>> issues is even more helpful.
>>>        
>> I think this would be a good start:
>>
>> http://wiki.apache.org/solr/DataImportHandler#Usage_with_XML.2BAC8-HTTP_Datasource
>>
>>      
>    

Re: Storing, indexing and searching XML documents in Solr

Posted by Judioo <co...@judioo.com>.
Great document. I can see how to import the data direct from the database.
However it seems as though I need to write xpath's in the config to extract
the fields that I wish to transform into an solr document.

So it seems that there is no way of storing the document structure in solr
as is?


2011/5/18 Yury Kats <yu...@yahoo.com>

> On 5/18/2011 4:19 PM, Judioo wrote:
>
> > Any help is greatly appreciated. Pointers to documentation that address
> my
> > issues is even more helpful.
>
> I think this would be a good start:
>
> http://wiki.apache.org/solr/DataImportHandler#Usage_with_XML.2BAC8-HTTP_Datasource
>

Re: Storing, indexing and searching XML documents in Solr

Posted by Yury Kats <yu...@yahoo.com>.
On 5/18/2011 4:19 PM, Judioo wrote:

> Any help is greatly appreciated. Pointers to documentation that address my
> issues is even more helpful.

I think this would be a good start:
http://wiki.apache.org/solr/DataImportHandler#Usage_with_XML.2BAC8-HTTP_Datasource