You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Emmanuel JOKE <jo...@gmail.com> on 2007/07/04 14:59:11 UTC

Code Newbie questions

Hi Guys,

I have few questions:
1- I found that we have the lib "lib-lucene-analyzers" in the plugin folder.
How does it works, should i just add the definition "lib-lucene-analyzers"
in the list of plugins in nutch-site.xml or should I also add
language-identifier, analysis-(fr|de|en) ?

2- How do we know the name of the plugin we have to add in nutch-site.xml ?
Actually I've just added analysis-fr in the list and I've got an exception
which said that it coudl not find org.apache.lucene.analyzer.FrenchAnalyzer.
It was looking for a lucene implementation of the plugin instead of the
nutch implementation. I don't know why.
is there any mapping between the plugin name and a class ?

3- I tried to implement an HTMLParseFilter but there are few things that i
don't understand.
What is the aim of a ParseResult ? Actually I don't understand why we could
store many parseresult ? Is there any specific usage ?
Why do we call the htmlparsefilter.filter after having created a first
ParseResult ?
How should i proceed if i want to remove some tag + content of those tags in
the Html page? Should i reparse again the page and create another
ParseResult which i will only use ? For instance, I don't want to index some
content. i want to remove all content of each Select box in my html page. I
thought I could do it in a HtmlParseFilter but i notice that I will waste
some processing time because it will parse  and create a  first ParseResult
(which i will never use) and then it will do it again (in my
htmlparsefilter) to get the real text content that i need to index.
I may have miss something in this case i will appreciate your help.

Cheers
E

recrawl working in v0.71 how to for v0.9?

Posted by John Reidy <jo...@reidy.com>.

Hi,this question has been asked by other posters to this list, however I 
haven't seen an answer yet, hopefully some one can help.

I have recrawling working for v0.71, however using the v0.8 wiki scripts 
I can't get it working on v0.9.
They appear to to a recrawl, however no new documents appear in the index.

I have been able to merge 2 seperate indexes into one, however I am 
concerned if I have an index of 500,000 documents, how efficient it will 
be if  - on a daily basis I want to add 100 or so new documents and 
reindex 300.
The source material is from a document management system accessed by 
urls, and I will know exactly what documents are new and which have been 
reupdated and require reindexing.

Do the scripts work- and I need to check again how I am using them or do 
I need to look at something else?

Regards
John Reidy.