You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Lucas Miguez <lu...@gmail.com> on 2011/07/13 19:00:19 UTC

Preserve XML hierarchy

Hi,

is it possible to do that in Apache Solr? If i make a search, how I
know from where it comes the result?

Thanks!

I have an XML like this:
<aaaa>
<bbbb>
<cccc>some Text</cccc>
<dddd>another text</dddd>
<eeee>
<ffff>text</ffff>
<gggg>more text</gggg>
<hhhh>
<iiii>
<jjjj>text</jjjj>
</iiii>
</hhhh>
</eeee>
</bbbb>
<bbbb>
<cccc>some Text</cccc>
<eeee>
<ffff>text</ffff>
</eeee>
</bbbb>
<bbbb>
<cccc>some Text</cccc>
<dddd>another text</dddd>
<eeee>
<ffff>text</ffff>
<gggg>more text</gggg>
<hhhh>
<iiii>
<jjjj>text</jjjj>
</iiii>
<iiii>
<jjjj>text</jjjj>
<hhhh>
<iiii>
<jjjj>text</jjjj>
</hhhh>
</iiii>
</hhhh>
</eeee>
</bbbb>
</aaaa>

Re: Preserve XML hierarchy

Posted by Erick Erickson <er...@gmail.com>.

Jars aren't where it's at. You apply patches to *source* code,
then compile.

Here's a good place to start understanding this process:

http://wiki.apache.org/solr/HowToContribute

See "getting the code" and "working with patches"

I *strongly* advise you to get the code and compile
it and run it first before applying the patch, just to
eliminate an extra variable...

Best
Erick

On Thu, Aug 25, 2011 at 3:56 AM, _snake_ <lu...@gmail.com> wrote:
> Hi Michael, Thanks for your help!
>
> I am using Apache Solr 3.2 on windows.
>
> I am trying to apply the 2 patches (
> https://issues.apache.org/jira/browse/SOLR-2597?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#issue-tabs
> XMLCharFilter Patch ), but I have no idea to do that.
>
> What do I need to open the Solr project? Or Which is the .jar file that I
> need to open?
> I saw that there is 2 files  (woodstox and stax2), what I have to do with
> that files?
>
> Thanks!
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Preserve-XML-hierarchy-tp3166690p3283275.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Preserve XML hierarchy

Posted by _snake_ <lu...@gmail.com>.

Hi Michael, Thanks for your help!

I am using Apache Solr 3.2 on windows.

I am trying to apply the 2 patches (
https://issues.apache.org/jira/browse/SOLR-2597?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#issue-tabs
XMLCharFilter Patch ), but I have no idea to do that.

What do I need to open the Solr project? Or Which is the .jar file that I
need to open?
I saw that there is 2 files  (woodstox and stax2), what I have to do with
that files?

Thanks!

--
View this message in context: http://lucene.472066.n3.nabble.com/Preserve-XML-hierarchy-tp3166690p3283275.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Preserve XML hierarchy

Posted by Michael Sokolov <so...@ifactory.com>.

Here's an idea: if you index the full text of your XML document using 
XmlCharFilter - available as a patch (or HtmlCharFilter), and then 
highlight the entire document (you will need to fiddle with highlighter 
parameters a bit to make sure you get 1 fragment that covers the entire 
file) with some tag like <match>, then you can take the highlighted 
result, parse it as an XML document into a tree model like JDOM or DOM, 
and execute XPath like: name(/descendant::match[1]/..) to find out the 
context in which your (first) hit appears.

-Mike

On 7/26/2011 10:48 AM, Lucas Miguez wrote:
> Hi, finally now I have all the field names of each document using the
> Luke Request Handler (http://wiki.apache.org/solr/LukeRequestHandler)
> and making HTTP Request to Solr I can get all the fields that contain
> the word that I am searching.
> I'll keep looking for a better solution.
>
> Thanks!
>
> Regards
>
> 2011/7/15 Gora Mohanty
>> On Thu, Jul 14, 2011 at 8:43 PM, Lucas Miguez<lu...@gmail.com>  wrote:
>>> Thanks for your help!
>>>
>>> DIH XPathEntityProcessor helps me to index the XML Files, but, does it
>>> help to me to know from where the node comes? Following the example in
>>> my previous post:
>>>
>>>>> example: Imagine that the user search the word "zona", then I have to
>>>>> show the TitleP, the TextP, the TitlePart, the TextPart and all the
>>>>> TextSubPart that are childs of gSubPart.
>>> Well, I tried to create TextPart, TitlePart, etc with the XPath
>>> expression of the location in the original XML, using dynamic fields,
>>> for example:
>>> <dynamic field="TextPart *" multivalued="true" indexed="true" ... />
>> There should not be a space between "TextPart" and "*"
>>
>>> to have the XPath associated with the field, but I don't know how to
>>> search in all "TextPart *" fields...
>> [...]
>>
>> You can search in individual fields, e.g., with ?q=TitlePart:myterm.
>> For searching in all "TextPart*" fields, the easiest way probably is
>> to copy the fields into a full-text search field. With the default Solr
>> schema, this can be done by adding a directive like
>>    <copyField source="TextPart*"  dest="text" />
>> This copies all fields into the field "text", which is searched by
>> default. Thus, ?q=myterm will find "myterm" in all "TextPart*"
>> fields.
>>
>> Regards,
>> Gora
>>

Re: Preserve XML hierarchy

Posted by Lucas Miguez <lu...@gmail.com>.

Hi, finally now I have all the field names of each document using the
Luke Request Handler (http://wiki.apache.org/solr/LukeRequestHandler)
and making HTTP Request to Solr I can get all the fields that contain
the word that I am searching.
I'll keep looking for a better solution.

Thanks!

Regards

2011/7/15 Gora Mohanty
> On Thu, Jul 14, 2011 at 8:43 PM, Lucas Miguez <lu...@gmail.com> wrote:
>> Thanks for your help!
>>
>> DIH XPathEntityProcessor helps me to index the XML Files, but, does it
>> help to me to know from where the node comes? Following the example in
>> my previous post:
>>
>>>> example: Imagine that the user search the word "zona", then I have to
>>>> show the TitleP, the TextP, the TitlePart, the TextPart and all the
>>>> TextSubPart that are childs of gSubPart.
>>
>> Well, I tried to create TextPart, TitlePart, etc with the XPath
>> expression of the location in the original XML, using dynamic fields,
>> for example:
>> <dynamic field="TextPart *" multivalued="true" indexed="true" ... />
>
> There should not be a space between "TextPart" and "*"
>
>> to have the XPath associated with the field, but I don't know how to
>> search in all "TextPart *" fields...
> [...]
>
> You can search in individual fields, e.g., with ?q=TitlePart:myterm.
> For searching in all "TextPart*" fields, the easiest way probably is
> to copy the fields into a full-text search field. With the default Solr
> schema, this can be done by adding a directive like
>   <copyField source="TextPart*"  dest="text" />
> This copies all fields into the field "text", which is searched by
> default. Thus, ?q=myterm will find "myterm" in all "TextPart*"
> fields.
>
> Regards,
> Gora
>

Re: Preserve XML hierarchy

Posted by Gora Mohanty <go...@mimirtech.com>.

On Thu, Jul 14, 2011 at 8:43 PM, Lucas Miguez <lu...@gmail.com> wrote:
> Thanks for your help!
>
> DIH XPathEntityProcessor helps me to index the XML Files, but, does it
> help to me to know from where the node comes? Following the example in
> my previous post:
>
>>> example: Imagine that the user search the word "zona", then I have to
>>> show the TitleP, the TextP, the TitlePart, the TextPart and all the
>>> TextSubPart that are childs of gSubPart.
>
> Well, I tried to create TextPart, TitlePart, etc with the XPath
> expression of the location in the original XML, using dynamic fields,
> for example:
> <dynamic field="TextPart *" multivalued="true" indexed="true" ... />

There should not be a space between "TextPart" and "*"

> to have the XPath associated with the field, but I don't know how to
> search in all "TextPart *" fields...
[...]

You can search in individual fields, e.g., with ?q=TitlePart:myterm.
For searching in all "TextPart*" fields, the easiest way probably is
to copy the fields into a full-text search field. With the default Solr
schema, this can be done by adding a directive like
   <copyField source="TextPart*"  dest="text" />
This copies all fields into the field "text", which is searched by
default. Thus, ?q=myterm will find "myterm" in all "TextPart*"
fields.

Regards,
Gora

Fwd: Preserve XML hierarchy

Posted by Lucas Miguez <lu...@gmail.com>.

Thanks for your help!

DIH XPathEntityProcessor helps me to index the XML Files, but, does it
help to me to know from where the node comes? Following the example in
my previous post:

>> example: Imagine that the user search the word "zona", then I have to
>> show the TitleP, the TextP, the TitlePart, the TextPart and all the
>> TextSubPart that are childs of gSubPart.

Well, I tried to create TextPart, TitlePart, etc with the XPath
expression of the location in the original XML, using dynamic fields,
for example:
<dynamic field="TextPart *" multivalued="true" indexed="true" ... />

to have the XPath associated with the field, but I don't know how to
search in all "TextPart *" fields...

Maybe I need to define my own FunctionQuery.

Keep looking for the solution.




2011/7/14 Michael Sokolov
> Have a look at
> http://wiki.apache.org/solr/DataImportHandler#XPathEntityProcessor
>
> It might be just what you need?
>
> -Mike
>
> On 7/14/2011 3:31 AM, Lucas Miguez wrote:
>>
>> Hi,
>>
>> yes, I was asking about it, is it possible to index an XML file?
>>
>> Is it possible to know which node of the XML the search result comes from?
>>
>> So I have 2 XML files, the original and the summary. I want to index
>> the summary. So, that is an example of the summary XML:
>>
>> <Objetives>
>>       <Activity xpath="2_3">
>>         <TitleP>A.</TitleP>
>>         <TextP>Requisitos generales mínimos exigibles a las
>> explotaciones para las que se soliciten
>> las ayudas.</TextP>
>>         <Part>
>>           <TitlePart>7. Además, la actividad de la explotación deberá
>> garantizar:
>> </TitlePart>
>>           <gSubPart>
>>             <SubPart>
>>               <TextSubPart>a) Gestión de los medios de
>> producción.</TextSubPart>
>>             </SubPart>
>>             <SubPart>
>>               <TextSubPart>b) Conservación de elementos propios de la
>> zona y en consonancia con el medio.</TextSubPart>
>>             </SubPart>
>>           </gSubPart>
>>         </Part>
>>       </Activity>
>> </Objectives>
>>
>> That is an summary of my original XML file. So, the xpath atribute in
>> Activity Element shows me the way to retrieve the information in the
>> original file (2_3 : 2 is the second element in PartV, and 3 is the
>> third Part inside the second PartV).
>> So, I need to index the fields (TitleP, TextP, TitlePart, TextPart,
>> TextSubPart. This fields can occur zero or more times), and I want to
>> know the xpath to the original file for each field, because I need to
>> show to the user the hierarchy of the results. Following the XML
>> example: Imagine that the user search the word "zona", then I have to
>> show the TitleP, the TextP, the TitlePart, the TextPart and all the
>> TextSubPart that are childs of gSubPart.
>>
>> Is there any example similar to my issue?
>>
>> Thanks!
>>
>>
>>
>> 2011/7/13 Gora Mohanty<go...@mimirtech.com>:
>>>
>>> On Wed, Jul 13, 2011 at 10:30 PM, Lucas Miguez<lu...@gmail.com>
>>>  wrote:
>>>>
>>>> Hi,
>>>>
>>>> is it possible to do that in Apache Solr? If i make a search, how I
>>>> know from where it comes the result?
>>>
>>> [...]
>>>
>>> Your question is not very clear, and I happen unfortunately to be
>>> out of crystal balls and Tarot cards.
>>>
>>> Is it possible to do what? Make a search on what, and what sort
>>> of results do you you expect from said search?
>>>
>>> Peering into the misty depths of my non-existent crystal ball,
>>> if you are asking is it possible to index an XML file, search it,
>>> and figure out which node of the XML the search result comes
>>> from, yes that is possible; though details, and better advice
>>> would require more input from your side. Roughly speaking,
>>> each node can go into a separate Solr field, and full-text
>>> search on all relevant fields is also possible. Joking aside, please
>>> do provide more details.
>>>
>>> Regards,
>>> Gora
>>>
>
>

Re: Preserve XML hierarchy

Posted by Michael Sokolov <so...@ifactory.com>.

Have a look at 
http://wiki.apache.org/solr/DataImportHandler#XPathEntityProcessor

It might be just what you need?

-Mike

On 7/14/2011 3:31 AM, Lucas Miguez wrote:
> Hi,
>
> yes, I was asking about it, is it possible to index an XML file?
>
> Is it possible to know which node of the XML the search result comes from?
>
> So I have 2 XML files, the original and the summary. I want to index
> the summary. So, that is an example of the summary XML:
>
> <Objetives>
>        <Activity xpath="2_3">
>          <TitleP>A.</TitleP>
>          <TextP>Requisitos generales mínimos exigibles a las
> explotaciones para las que se soliciten
> las ayudas.</TextP>
>          <Part>
>            <TitlePart>7. Además, la actividad de la explotación deberá
> garantizar:
> </TitlePart>
>            <gSubPart>
>              <SubPart>
>                <TextSubPart>a) Gestión de los medios de producción.</TextSubPart>
>              </SubPart>
>              <SubPart>
>                <TextSubPart>b) Conservación de elementos propios de la
> zona y en consonancia con el medio.</TextSubPart>
>              </SubPart>
>            </gSubPart>
>          </Part>
>        </Activity>
> </Objectives>
>
> That is an summary of my original XML file. So, the xpath atribute in
> Activity Element shows me the way to retrieve the information in the
> original file (2_3 : 2 is the second element in PartV, and 3 is the
> third Part inside the second PartV).
> So, I need to index the fields (TitleP, TextP, TitlePart, TextPart,
> TextSubPart. This fields can occur zero or more times), and I want to
> know the xpath to the original file for each field, because I need to
> show to the user the hierarchy of the results. Following the XML
> example: Imagine that the user search the word "zona", then I have to
> show the TitleP, the TextP, the TitlePart, the TextPart and all the
> TextSubPart that are childs of gSubPart.
>
> Is there any example similar to my issue?
>
> Thanks!
>
>
>
> 2011/7/13 Gora Mohanty<go...@mimirtech.com>:
>> On Wed, Jul 13, 2011 at 10:30 PM, Lucas Miguez<lu...@gmail.com>  wrote:
>>> Hi,
>>>
>>> is it possible to do that in Apache Solr? If i make a search, how I
>>> know from where it comes the result?
>> [...]
>>
>> Your question is not very clear, and I happen unfortunately to be
>> out of crystal balls and Tarot cards.
>>
>> Is it possible to do what? Make a search on what, and what sort
>> of results do you you expect from said search?
>>
>> Peering into the misty depths of my non-existent crystal ball,
>> if you are asking is it possible to index an XML file, search it,
>> and figure out which node of the XML the search result comes
>> from, yes that is possible; though details, and better advice
>> would require more input from your side. Roughly speaking,
>> each node can go into a separate Solr field, and full-text
>> search on all relevant fields is also possible. Joking aside, please
>> do provide more details.
>>
>> Regards,
>> Gora
>>

Re: Preserve XML hierarchy

Posted by Walter Underwood <wu...@wunderwood.org>.

This will be much easier on an XML database, because that supports XPath natively.

For open source, try eXist.

For a commercial XML database, try MarkLogic (much, much faster than eXist).

wunder
Walter Underwood
Lead Engineer, MarkLogic
www.marklogic.com

On Jul 14, 2011, at 12:31 AM, Lucas Miguez wrote:

> Hi,
> 
> yes, I was asking about it, is it possible to index an XML file?
> 
> Is it possible to know which node of the XML the search result comes from?
> 
> So I have 2 XML files, the original and the summary. I want to index
> the summary. So, that is an example of the summary XML:
> 
> <Objetives>
>      <Activity xpath="2_3">
>        <TitleP>A. </TitleP>
>        <TextP>Requisitos generales mínimos exigibles a las
> explotaciones para las que se soliciten
> las ayudas.</TextP>
>        <Part>
>          <TitlePart>7. Además, la actividad de la explotación deberá
> garantizar:
> </TitlePart>
>          <gSubPart>
>            <SubPart>
>              <TextSubPart>a) Gestión de los medios de producción.</TextSubPart>
>            </SubPart>
>            <SubPart>
>              <TextSubPart>b) Conservación de elementos propios de la
> zona y en consonancia con el medio.</TextSubPart>
>            </SubPart>
>          </gSubPart>
>        </Part>
>      </Activity>
> </Objectives>
> 
> That is an summary of my original XML file. So, the xpath atribute in
> Activity Element shows me the way to retrieve the information in the
> original file (2_3 : 2 is the second element in PartV, and 3 is the
> third Part inside the second PartV).
> So, I need to index the fields (TitleP, TextP, TitlePart, TextPart,
> TextSubPart. This fields can occur zero or more times), and I want to
> know the xpath to the original file for each field, because I need to
> show to the user the hierarchy of the results. Following the XML
> example: Imagine that the user search the word "zona", then I have to
> show the TitleP, the TextP, the TitlePart, the TextPart and all the
> TextSubPart that are childs of gSubPart.
> 
> Is there any example similar to my issue?
> 
> Thanks!
> 
> 
> 
> 2011/7/13 Gora Mohanty <go...@mimirtech.com>:
>> On Wed, Jul 13, 2011 at 10:30 PM, Lucas Miguez <lu...@gmail.com> wrote:
>>> Hi,
>>> 
>>> is it possible to do that in Apache Solr? If i make a search, how I
>>> know from where it comes the result?
>> [...]
>> 
>> Your question is not very clear, and I happen unfortunately to be
>> out of crystal balls and Tarot cards.
>> 
>> Is it possible to do what? Make a search on what, and what sort
>> of results do you you expect from said search?
>> 
>> Peering into the misty depths of my non-existent crystal ball,
>> if you are asking is it possible to index an XML file, search it,
>> and figure out which node of the XML the search result comes
>> from, yes that is possible; though details, and better advice
>> would require more input from your side. Roughly speaking,
>> each node can go into a separate Solr field, and full-text
>> search on all relevant fields is also possible. Joking aside, please
>> do provide more details.
>> 
>> Regards,
>> Gora
>>

Re: Preserve XML hierarchy

Posted by Lucas Miguez <lu...@gmail.com>.

Hi,

yes, I was asking about it, is it possible to index an XML file?

Is it possible to know which node of the XML the search result comes from?

So I have 2 XML files, the original and the summary. I want to index
the summary. So, that is an example of the summary XML:

<Objetives>
      <Activity xpath="2_3">
        <TitleP>A. </TitleP>
        <TextP>Requisitos generales mínimos exigibles a las
explotaciones para las que se soliciten
las ayudas.</TextP>
        <Part>
          <TitlePart>7. Además, la actividad de la explotación deberá
garantizar:
</TitlePart>
          <gSubPart>
            <SubPart>
              <TextSubPart>a) Gestión de los medios de producción.</TextSubPart>
            </SubPart>
            <SubPart>
              <TextSubPart>b) Conservación de elementos propios de la
zona y en consonancia con el medio.</TextSubPart>
            </SubPart>
          </gSubPart>
        </Part>
      </Activity>
</Objectives>

That is an summary of my original XML file. So, the xpath atribute in
Activity Element shows me the way to retrieve the information in the
original file (2_3 : 2 is the second element in PartV, and 3 is the
third Part inside the second PartV).
So, I need to index the fields (TitleP, TextP, TitlePart, TextPart,
TextSubPart. This fields can occur zero or more times), and I want to
know the xpath to the original file for each field, because I need to
show to the user the hierarchy of the results. Following the XML
example: Imagine that the user search the word "zona", then I have to
show the TitleP, the TextP, the TitlePart, the TextPart and all the
TextSubPart that are childs of gSubPart.

Is there any example similar to my issue?

Thanks!

2011/7/13 Gora Mohanty <go...@mimirtech.com>:
> On Wed, Jul 13, 2011 at 10:30 PM, Lucas Miguez <lu...@gmail.com> wrote:
>> Hi,
>>
>> is it possible to do that in Apache Solr? If i make a search, how I
>> know from where it comes the result?
> [...]
>
> Your question is not very clear, and I happen unfortunately to be
> out of crystal balls and Tarot cards.
>
> Is it possible to do what? Make a search on what, and what sort
> of results do you you expect from said search?
>
> Peering into the misty depths of my non-existent crystal ball,
> if you are asking is it possible to index an XML file, search it,
> and figure out which node of the XML the search result comes
> from, yes that is possible; though details, and better advice
> would require more input from your side. Roughly speaking,
> each node can go into a separate Solr field, and full-text
> search on all relevant fields is also possible. Joking aside, please
> do provide more details.
>
> Regards,
> Gora
>

Re: Preserve XML hierarchy

Posted by Gora Mohanty <go...@mimirtech.com>.

On Wed, Jul 13, 2011 at 10:30 PM, Lucas Miguez <lu...@gmail.com> wrote:
> Hi,
>
> is it possible to do that in Apache Solr? If i make a search, how I
> know from where it comes the result?
[...]

Your question is not very clear, and I happen unfortunately to be
out of crystal balls and Tarot cards.

Is it possible to do what? Make a search on what, and what sort
of results do you you expect from said search?

Peering into the misty depths of my non-existent crystal ball,
if you are asking is it possible to index an XML file, search it,
and figure out which node of the XML the search result comes
from, yes that is possible; though details, and better advice
would require more input from your side. Roughly speaking,
each node can go into a separate Solr field, and full-text
search on all relevant fields is also possible. Joking aside, please
do provide more details.

Regards,
Gora