You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by ma...@Automationdirect.com on 2012/08/09 23:40:50 UTC

SolrIndex command

Hi There,
I am a new Nutch user. I am using Nutch to crawl and then send crawl data
to SOLR. I have a question about bin/nutch solrindex command. Which tika
libraries are being used to index; Is it the tika libraries in Nutch or
does Nutch let SOLR index so it uses Solr's tika libraries? I think I read
it somewhere that Nutch is focusing on crawling and parsing and lets SOLR
do the indexing so SOLR's libraries should get used.

Specifically, I am having problems in extracting tags I.e. Say <H1> from
pdf files using Nutch/SOLR combination. The extract-contrib module defined
in schema.xml should get used.

Thanks in advance,
Madhvi

>