You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by John Bartak <ba...@gmail.com> on 2011/12/05 22:02:58 UTC

Custom content extractor for Solr Cell

Is it possible to extract content for file types that Tika doesn’t support
without changing and rebuilding Tika?  Do I need to specify a tika.config
file in the solrconfig.xml file, and if so, what is the format of that file?



One example that I’m trying to solve is for a document management system
where the files are compressed – so I’d like to have a content extractor
that first decompresses the file and then delegates to the standard Solr
content extraction mechanism.   Perhaps writing a custom extractor is more
trouble than it is worth for this use case and I should just decompress the
data before sending it to Solr?

Re: Custom content extractor for Solr Cell

Posted by Jan Høydahl <ja...@cominvent.com>.

Hi John,

See discussion about the issue of indexing contents of ZIP files: https://issues.apache.org/jira/browse/SOLR-2416

Depending on your use case, you may be able to write a Tika parser which handles your specific case, such as uncompressing a GZIP file and using AutoDetect on its contents or similar. If you want to override the behaviour of Tika's parsing of certain MIME types, you can do this by specifying -Dtika.config=<path-to-your-tika-config> when starting Solr (3.5 or later), and it will obey your config. See Tika's web page for how to write your own parsers.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 5. des. 2011, at 22:02, John Bartak wrote:

> Is it possible to extract content for file types that Tika doesn’t support
> without changing and rebuilding Tika?  Do I need to specify a tika.config
> file in the solrconfig.xml file, and if so, what is the format of that file?
> 
> 
> 
> One example that I’m trying to solve is for a document management system
> where the files are compressed – so I’d like to have a content extractor
> that first decompresses the file and then delegates to the standard Solr
> content extraction mechanism.   Perhaps writing a custom extractor is more
> trouble than it is worth for this use case and I should just decompress the
> data before sending it to Solr?