You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Dwarak R <dw...@solutionnet.net> on 2007/11/14 05:16:48 UTC

solr - other document formats

Hey All

I read an article on http://www.xml.com/lpt/a/1668

Its states that 

"As we've seen, the XML format used by Solr for indexing is quite simple. Extracting the relevant metadata to create these XML documents from the many formats floating around, however, is another story. Fortunately, Lucene users have the same problem and have been working on it for quite a while; the Lucene FAQ lists a number of references to parsers and filters which can be used to extract content and metadata from many common document formats. 
Solr won't index spreadsheets or other formats out of the box, but that is not its role: you should see Solr as the "search engine" component of a broader "search system," where extraction of content and metadata is handled by other components. This will help to keep your search system maintainable and testable, and it helps the Solr team focus on doing one thing well."

Parsing documents like pdf, ms word document, excel to xml will be done other component ? 

Somebody advise 

Regards

Dwarak R

This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender&postmaster@solutonnet.net  immediately and delete the original. Any other use of the email by you is prohibited.

RE: solr - other document formats

Posted by "SDIS M. Beauchamp" <FB...@SDIS71.fr>.

The commit can't be false. It can be done or not . If it is not, your users won't be able to search through the "uncommited" documents. It it's done, users can search through all document successfully sent to Solr.

You can use the autocommit feature  (in solrconfig.xml)  to avoid the explicit usage of commit : you juste have to send documents to Solr

Florent BEAUCHAMP

-----Message d'origine-----
De : Dwarak R [mailto:dwarak@solutionnet.net] 
Envoyé : mercredi 14 novembre 2007 13:38
À : solr-user@lucene.apache.org
Objet : Re: solr - other document formats

Many thanks Florent

Hey All

My docs are parsed and indexes are updated (using UpdateRichDocuments patch). But tell me onething what will happen if i don't commit ?. If commit is false where the docs are stored ?.

Regards

Dwarak R
----- Original Message -----
From: "SDIS M. Beauchamp" <FB...@SDIS71.fr>
To: <so...@lucene.apache.org>
Sent: Wednesday, November 14, 2007 1:13 PM
Subject: RE: solr - other document formats


You should take a look at 
http://wiki.apache.org/solr/UpdateRichDocuments?highlight=%28richdocument%29

It gives you a starting point to make the extractor you need

Regards

Florent

-----Message d'origine-----
De : Dwarak R [mailto:dwarak@solutionnet.net]
Envoyé : mercredi 14 novembre 2007 05:17
À : solr-user@lucene.apache.org
Objet : solr - other document formats

Hey All

I read an article on http://www.xml.com/lpt/a/1668

Its states that

"As we've seen, the XML format used by Solr for indexing is quite simple. 
Extracting the relevant metadata to create these XML documents from the many 
formats floating around, however, is another story. Fortunately, Lucene 
users have the same problem and have been working on it for quite a while; 
the Lucene FAQ lists a number of references to parsers and filters which can 
be used to extract content and metadata from many common document formats.
Solr won't index spreadsheets or other formats out of the box, but that is 
not its role: you should see Solr as the "search engine" component of a 
broader "search system," where extraction of content and metadata is handled 
by other components. This will help to keep your search system maintainable 
and testable, and it helps the Solr team focus on doing one thing well."

Parsing documents like pdf, ms word document, excel to xml will be done 
other component ?

Somebody advise

Regards

Dwarak R

This message is for the designated recipient only and may contain 
privileged, proprietary, or otherwise private information. If you have 
received it in error, please notify the sender&postmaster@solutonnet.net 
immediately and delete the original. Any other use of the email by you is 
prohibited.



This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender&postmaster@solutonnet.net  immediately and delete the original. Any other use of the email by you is prohibited.

Re: solr - other document formats

Posted by Dwarak R <dw...@solutionnet.net>.

Many thanks Florent

Hey All

My docs are parsed and indexes are updated (using UpdateRichDocuments 
patch). But tell me onething what will happen if i don't commit ?. If commit 
is false where the docs are stored ?.

Regards

Dwarak R
----- Original Message ----- 
From: "SDIS M. Beauchamp" <FB...@SDIS71.fr>
To: <so...@lucene.apache.org>
Sent: Wednesday, November 14, 2007 1:13 PM
Subject: RE: solr - other document formats


You should take a look at 
http://wiki.apache.org/solr/UpdateRichDocuments?highlight=%28richdocument%29

It gives you a starting point to make the extractor you need

Regards

Florent

-----Message d'origine-----
De : Dwarak R [mailto:dwarak@solutionnet.net]
Envoyé : mercredi 14 novembre 2007 05:17
À : solr-user@lucene.apache.org
Objet : solr - other document formats

Hey All

I read an article on http://www.xml.com/lpt/a/1668

Its states that

"As we've seen, the XML format used by Solr for indexing is quite simple. 
Extracting the relevant metadata to create these XML documents from the many 
formats floating around, however, is another story. Fortunately, Lucene 
users have the same problem and have been working on it for quite a while; 
the Lucene FAQ lists a number of references to parsers and filters which can 
be used to extract content and metadata from many common document formats.
Solr won't index spreadsheets or other formats out of the box, but that is 
not its role: you should see Solr as the "search engine" component of a 
broader "search system," where extraction of content and metadata is handled 
by other components. This will help to keep your search system maintainable 
and testable, and it helps the Solr team focus on doing one thing well."

Parsing documents like pdf, ms word document, excel to xml will be done 
other component ?

Somebody advise

Regards

Dwarak R

This message is for the designated recipient only and may contain 
privileged, proprietary, or otherwise private information. If you have 
received it in error, please notify the sender&postmaster@solutonnet.net 
immediately and delete the original. Any other use of the email by you is 
prohibited.



This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender&postmaster@solutonnet.net  immediately and delete the original. Any other use of the email by you is prohibited.

RE: solr - other document formats

Posted by "SDIS M. Beauchamp" <FB...@SDIS71.fr>.

You should take a look at http://wiki.apache.org/solr/UpdateRichDocuments?highlight=%28richdocument%29

It gives you a starting point to make the extractor you need 

Regards

Florent

-----Message d'origine-----
De : Dwarak R [mailto:dwarak@solutionnet.net] 
Envoyé : mercredi 14 novembre 2007 05:17
À : solr-user@lucene.apache.org
Objet : solr - other document formats

Hey All

I read an article on http://www.xml.com/lpt/a/1668

Its states that 

"As we've seen, the XML format used by Solr for indexing is quite simple. Extracting the relevant metadata to create these XML documents from the many formats floating around, however, is another story. Fortunately, Lucene users have the same problem and have been working on it for quite a while; the Lucene FAQ lists a number of references to parsers and filters which can be used to extract content and metadata from many common document formats. 
Solr won't index spreadsheets or other formats out of the box, but that is not its role: you should see Solr as the "search engine" component of a broader "search system," where extraction of content and metadata is handled by other components. This will help to keep your search system maintainable and testable, and it helps the Solr team focus on doing one thing well."

Parsing documents like pdf, ms word document, excel to xml will be done other component ? 

Somebody advise 

Regards

Dwarak R

This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender&postmaster@solutonnet.net  immediately and delete the original. Any other use of the email by you is prohibited.