You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Abhiroop <as...@gmail.com> on 2013/08/16 14:55:48 UTC

Indexing an XML file in Apache Solr

I am very new to Solr. I am looking to index an xml file and search its
contents. Its structure resembles something like this

<entry id="REACT_142474" acc="REACT_142474.5">
<name>((1,6)-alpha-glucosyl)poly((1,4)-alpha-glucosyl)glycogenin =&gt;
poly{(1,4)-alpha-      glucosyl} glycogenin + alpha-D-glucose</name>
<description>This event has been computationally inferred from an event that
has been demonstrated in another species.The inference is based on the
homology mapping in Ensembl Compara. Briefly, reactions for which all
involved PhysicalEntities (in input, output and catalyst) have a mapped
orthologue/paralogue (for complexes at least 75% of components must have a
mapping) are inferred to the other species. High level events are also
inferred for these events to allow for easier navigation.More details and
caveats of the event inference in Reactome. For details on the Ensembl
Compara system see also: Gene orthology/paralogy prediction
method.</description>
<dates>
<date type="creation" value="06-JUN-2013"/>
<date type="last_modification" value="06-JUN-2013"/>
</dates>
<cross_references>
<ref dbname="ChEBI" dbkey="17925"/>
<ref dbname="UniProt" dbkey="Q06625"/>
<ref dbname="ChEBI" dbkey="18291"/>
<ref dbname="UniProt" dbkey="P47011"/>
<ref dbname="UniProt" dbkey="P36143"/>
<ref dbname="GO" dbkey="GO:0004135"/>
<ref dbname="taxonomy" dbkey="4932"/>
</cross_references>
<additional_fields>
<field name="organism">Saccharomyces cerevisiae</field>
</additional_fields>
</entry>

Is it essential to use the DIH to import this data into Solr? Isn't there
any simpler way to accomplish the task? Can it be done through SolrJ as I am
fine with outputting the result through the console too. It would be really
helpful if someone could point me to some useful examples or resources on
this apart from the official documentation.



--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-an-XML-file-in-Apache-Solr-tp4085053.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing an XML file in Apache Solr

Posted by Chris Hostetter <ho...@fucit.org>.

: I am very new to Solr. I am looking to index an xml file and search its
: contents. Its structure resembles something like this
	...
: Is it essential to use the DIH to import this data into Solr? Isn't there
: any simpler way to accomplish the task? Can it be done through SolrJ as I am

Ignore for a minute that your data happens to be in an XML file....

 * you have some structured data
 * you want to put it in solr
 * you want to search it

In order to do thiese things, you have to understand (for yourself, but if 
you want help from others you have to be able to explain it to us as well) 
what that structure means, and how you want to be able to search it.

If you are familiar with relational databases, ask your self: if i were 
putting my data into a table, what would my rows be? what would my 
colunmns be? what data types would i use for each column? what pieces of 
my data would i put into each column/row? 

You have to ask yourself the same types questions when you use Solr to 
decide what you want your schema.xml to look like, and what you want to 
model as "documents" -- and depening on your answers, then you can decide 
how to index the data.

Do you have to use DIH to index an XML file?  Not at all.  

You do have to use *something* to pull the pieces of data you want out of 
your XML file (or out of your CSV file, or out of your relational 
database, etc...) to model them as "Documents" containing "Fields" that 
can put them into Solr.  You might find DIH useful for that, or you might 
also find the ExtractingRequestHandler useful forthat, or you might ust 
want to implement your own bit of code that pulls what you want out of 
your XML files and sends them to Solr as SolrInputDocuments (using SolrJ), 
or you might want to write a bit of python/ruby/perl/lua/haskel code that 
does the same thing and sends it to Solr as xml or json using the format 
Solr expects for indexing commands.

that's entirely up to you.


-Hoss

Re: Indexing an XML file in Apache Solr

Posted by "tamanjit.bindra@yahoo.co.in" <ta...@yahoo.co.in>.

DIH is not at all necessary and yes, SolrJ can be used to add data, the XML
bit am not too sure though.

Try:
http://wiki.apache.org/solr/UpdateXmlMessages
<http://wiki.apache.org/solr/UpdateXmlMessages>  
and
http://wiki.apache.org/solr/Solrj <http://wiki.apache.org/solr/Solrj>  



--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-an-XML-file-in-Apache-Solr-tp4085053p4085054.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing an XML file in Apache Solr

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

Abhiroop, I'm cc-ing the lux mailing list since this thread might not be 
of interest to all of solr-user; I'd suggest following up on that list.

But to answer your actual question: see the documentation here 
http://luxdb.org/REST-API.html#LuxUpdateProcessor where it explains what 
to do.  Basically you just insert documents with two fields: lux_xml 
(the full text of the document, serialized as XML) and lux_uri (a 
pathname uniquely identifying the document).  You can add other fields 
if you want, but those are the special names (can be aliased if needed) 
that trigger Lux's update processor.

-Mike

PS I think we need a better "getting started" tutorial; lots of folks 
are confused about how to insert docs and get started. Putting it on the 
TODO list ...

On 08/19/2013 03:24 AM, Abhiroop wrote:
> Funnily just today itself I was looking at Lux for searching through my xml
> file. Now what I have inferred is that I need to format my xml to fit the
> format of Solr. Now do I have to manually code it or do i have some kind of
> parser on which the xml if fed is formatted to the Solr version? I couldnt
> find any code examples in Lux.
>
>
> On Sun, Aug 18, 2013 at 11:20 PM, Michael Sokolov-3 [via Lucene] <
> ml-node+s472066n4085344h78@n3.nabble.com> wrote:
>
>> You might be interested in trying Lux, which is a Solr extension that
>> indexes XML documents using the element and attribute names and the
>> contents of those nodes in your document.  It also allows you to define
>> XPath indexes (like DIH, I think, but with the full XPath 2.0 syntax),
>> and to query your document collection using XQuery 1.0 (in combination
>> with standard lucene searches at the document level).  See
>> http://luxdb.org/
>>
>> -Mike Sokolov
>>
>> On 8/16/2013 8:55 AM, Abhiroop wrote:
>>
>>> I am very new to Solr. I am looking to index an xml file and search its
>>> contents. Its structure resembles something like this
>>>
>>> <entry id="REACT_142474" acc="REACT_142474.5">
>>> <name>((1,6)-alpha-glucosyl)poly((1,4)-alpha-glucosyl)glycogenin =&gt;
>>> poly{(1,4)-alpha-      glucosyl} glycogenin + alpha-D-glucose</name>
>>> <description>This event has been computationally inferred from an event
>> that
>>> has been demonstrated in another species.The inference is based on the
>>> homology mapping in Ensembl Compara. Briefly, reactions for which all
>>> involved PhysicalEntities (in input, output and catalyst) have a mapped
>>> orthologue/paralogue (for complexes at least 75% of components must have
>> a
>>> mapping) are inferred to the other species. High level events are also
>>> inferred for these events to allow for easier navigation.More details
>> and
>>> caveats of the event inference in Reactome. For details on the Ensembl
>>> Compara system see also: Gene orthology/paralogy prediction
>>> method.</description>
>>> <dates>
>>> <date type="creation" value="06-JUN-2013"/>
>>> <date type="last_modification" value="06-JUN-2013"/>
>>> </dates>
>>> <cross_references>
>>> <ref dbname="ChEBI" dbkey="17925"/>
>>> <ref dbname="UniProt" dbkey="Q06625"/>
>>> <ref dbname="ChEBI" dbkey="18291"/>
>>> <ref dbname="UniProt" dbkey="P47011"/>
>>> <ref dbname="UniProt" dbkey="P36143"/>
>>> <ref dbname="GO" dbkey="GO:0004135"/>
>>> <ref dbname="taxonomy" dbkey="4932"/>
>>> </cross_references>
>>> <additional_fields>
>>> <field name="organism">Saccharomyces cerevisiae</field>
>>> </additional_fields>
>>> </entry>
>>>
>>> Is it essential to use the DIH to import this data into Solr? Isn't
>> there
>>> any simpler way to accomplish the task? Can it be done through SolrJ as
>> I am
>>> fine with outputting the result through the console too. It would be
>> really
>>> helpful if someone could point me to some useful examples or resources
>> on
>>> this apart from the official documentation.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Indexing-an-XML-file-in-Apache-Solr-tp4085053.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>> ------------------------------
>>   If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://lucene.472066.n3.nabble.com/Indexing-an-XML-file-in-Apache-Solr-tp4085053p4085344.html
>>   To unsubscribe from Indexing an XML file in Apache Solr, click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4085053&code=YXNpYW1nZW5pdXNAZ21haWwuY29tfDQwODUwNTN8LTMzNDk4OTkzNQ==>
>> .
>> NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
>

Re: Indexing an XML file in Apache Solr

Posted by Abhiroop <as...@gmail.com>.

Funnily just today itself I was looking at Lux for searching through my xml
file. Now what I have inferred is that I need to format my xml to fit the
format of Solr. Now do I have to manually code it or do i have some kind of
parser on which the xml if fed is formatted to the Solr version? I couldnt
find any code examples in Lux.


On Sun, Aug 18, 2013 at 11:20 PM, Michael Sokolov-3 [via Lucene] <
ml-node+s472066n4085344h78@n3.nabble.com> wrote:

> You might be interested in trying Lux, which is a Solr extension that
> indexes XML documents using the element and attribute names and the
> contents of those nodes in your document.  It also allows you to define
> XPath indexes (like DIH, I think, but with the full XPath 2.0 syntax),
> and to query your document collection using XQuery 1.0 (in combination
> with standard lucene searches at the document level).  See
> http://luxdb.org/
>
> -Mike Sokolov
>
> On 8/16/2013 8:55 AM, Abhiroop wrote:
>
> > I am very new to Solr. I am looking to index an xml file and search its
> > contents. Its structure resembles something like this
> >
> > <entry id="REACT_142474" acc="REACT_142474.5">
> > <name>((1,6)-alpha-glucosyl)poly((1,4)-alpha-glucosyl)glycogenin =&gt;
> > poly{(1,4)-alpha-      glucosyl} glycogenin + alpha-D-glucose</name>
> > <description>This event has been computationally inferred from an event
> that
> > has been demonstrated in another species.The inference is based on the
> > homology mapping in Ensembl Compara. Briefly, reactions for which all
> > involved PhysicalEntities (in input, output and catalyst) have a mapped
> > orthologue/paralogue (for complexes at least 75% of components must have
> a
> > mapping) are inferred to the other species. High level events are also
> > inferred for these events to allow for easier navigation.More details
> and
> > caveats of the event inference in Reactome. For details on the Ensembl
> > Compara system see also: Gene orthology/paralogy prediction
> > method.</description>
> > <dates>
> > <date type="creation" value="06-JUN-2013"/>
> > <date type="last_modification" value="06-JUN-2013"/>
> > </dates>
> > <cross_references>
> > <ref dbname="ChEBI" dbkey="17925"/>
> > <ref dbname="UniProt" dbkey="Q06625"/>
> > <ref dbname="ChEBI" dbkey="18291"/>
> > <ref dbname="UniProt" dbkey="P47011"/>
> > <ref dbname="UniProt" dbkey="P36143"/>
> > <ref dbname="GO" dbkey="GO:0004135"/>
> > <ref dbname="taxonomy" dbkey="4932"/>
> > </cross_references>
> > <additional_fields>
> > <field name="organism">Saccharomyces cerevisiae</field>
> > </additional_fields>
> > </entry>
> >
> > Is it essential to use the DIH to import this data into Solr? Isn't
> there
> > any simpler way to accomplish the task? Can it be done through SolrJ as
> I am
> > fine with outputting the result through the console too. It would be
> really
> > helpful if someone could point me to some useful examples or resources
> on
> > this apart from the official documentation.
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-an-XML-file-in-Apache-Solr-tp4085053.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Indexing-an-XML-file-in-Apache-Solr-tp4085053p4085344.html
>  To unsubscribe from Indexing an XML file in Apache Solr, click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4085053&code=YXNpYW1nZW5pdXNAZ21haWwuY29tfDQwODUwNTN8LTMzNDk4OTkzNQ==>
> .
> NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>



-- 
Kloona - Coming Soon!




--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-an-XML-file-in-Apache-Solr-tp4085053p4085394.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing an XML file in Apache Solr

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

You might be interested in trying Lux, which is a Solr extension that 
indexes XML documents using the element and attribute names and the 
contents of those nodes in your document.  It also allows you to define 
XPath indexes (like DIH, I think, but with the full XPath 2.0 syntax), 
and to query your document collection using XQuery 1.0 (in combination 
with standard lucene searches at the document level).  See http://luxdb.org/

-Mike Sokolov

On 8/16/2013 8:55 AM, Abhiroop wrote:
> I am very new to Solr. I am looking to index an xml file and search its
> contents. Its structure resembles something like this
>
> <entry id="REACT_142474" acc="REACT_142474.5">
> <name>((1,6)-alpha-glucosyl)poly((1,4)-alpha-glucosyl)glycogenin =&gt;
> poly{(1,4)-alpha-      glucosyl} glycogenin + alpha-D-glucose</name>
> <description>This event has been computationally inferred from an event that
> has been demonstrated in another species.The inference is based on the
> homology mapping in Ensembl Compara. Briefly, reactions for which all
> involved PhysicalEntities (in input, output and catalyst) have a mapped
> orthologue/paralogue (for complexes at least 75% of components must have a
> mapping) are inferred to the other species. High level events are also
> inferred for these events to allow for easier navigation.More details and
> caveats of the event inference in Reactome. For details on the Ensembl
> Compara system see also: Gene orthology/paralogy prediction
> method.</description>
> <dates>
> <date type="creation" value="06-JUN-2013"/>
> <date type="last_modification" value="06-JUN-2013"/>
> </dates>
> <cross_references>
> <ref dbname="ChEBI" dbkey="17925"/>
> <ref dbname="UniProt" dbkey="Q06625"/>
> <ref dbname="ChEBI" dbkey="18291"/>
> <ref dbname="UniProt" dbkey="P47011"/>
> <ref dbname="UniProt" dbkey="P36143"/>
> <ref dbname="GO" dbkey="GO:0004135"/>
> <ref dbname="taxonomy" dbkey="4932"/>
> </cross_references>
> <additional_fields>
> <field name="organism">Saccharomyces cerevisiae</field>
> </additional_fields>
> </entry>
>
> Is it essential to use the DIH to import this data into Solr? Isn't there
> any simpler way to accomplish the task? Can it be done through SolrJ as I am
> fine with outputting the result through the console too. It would be really
> helpful if someone could point me to some useful examples or resources on
> this apart from the official documentation.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Indexing-an-XML-file-in-Apache-Solr-tp4085053.html
> Sent from the Solr - User mailing list archive at Nabble.com.