You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Phillip Rhodes <sp...@rhoderunner.com> on 2006/12/21 21:44:43 UTC

convert bin/nutch to use ant?

I move between XP/Mac/Sun/Linux based upon client or where I am (work 
vs. home) and found ant to be a good cross-platform scripting language.I 
went to run a nutch crawl on my XP box, and the script is not setup to 
run in an XP environment (yes, I could install cygwin)

I have started creating an ant file that I can use to invoke the 
different java programs that come with nutch and found a "job" file, 
never came across a job file before.  I am guessing it a monster jar 
file with all the dependencies inside it.  I don't think that will work 
in ant!

One little trick that I have learned is to put my ant build.xml file 
into my war file WEB-INF directory.  Since a war file represents a 
bundle of your dependencies, your ant file can easily be shipped with 
your war file and provide a nice and easy way to invoke your java main 
programs.

Has anyone created ant files to invoke the various nutch programs?  Can 
I help out doing this?  How would folks feel about a 2nd war file (an 
admin war app)  It would be a skeleton war file at first (not provide 
any functionality) but we would put all the plugins inside the war file 
so that the ant war file can find them and we can run them.  Down the 
road, we can add a web ui to do a crawl, etc...

Thanks.

Re: NutchBean searching options

Posted by Dennis Kubes <nu...@dragonflymc.com>.

NutchBean creates a query through the [Query query = 
Query.parse(args[0], conf);] call in its main method.  The actual query 
object is created behind the scenes by the whole nutch analysis 
mechanism.  This does alot of work that is helpful in creating general 
queries but it is not the only way. 

You can create your own query objects and pass them into the 
NutchBean.search(query, int) method.  Take a look at the Lucene in 
Action book by Erik Hatcher and Otis Gospodnetic and this will show you 
how to create different types of query objects such as wildcard queries.

Dennis

Daniel López wrote:
> Hi,
>
> I have seen that NutchBean searches are case insensitive and uses a 
> logical AND if various terms are used as criteria. Moreover, if uses 
> full words, so no partial matches are allowed (or * as ?) and special 
> characters (áéí...) have to be matched exactly.
>
> Is there any way one can tweak those settings and search through the 
> NutchBean with other settings? I'm more interested in allowing partial 
> matches and making criteria "special-characaters insensitive".
>
> It NutchBean does not support that, is there any workaround and should 
> I better go straight to the code and use Lucene low level to 
> accomplish it?
>
> Thanks,
> D.

NutchBean searching options

Posted by Daniel López <D....@uib.es>.

Hi,

I have seen that NutchBean searches are case insensitive and uses a 
logical AND if various terms are used as criteria. Moreover, if uses 
full words, so no partial matches are allowed (or * as ?) and special 
characters (áéí...) have to be matched exactly.

Is there any way one can tweak those settings and search through the 
NutchBean with other settings? I'm more interested in allowing partial 
matches and making criteria "special-characaters insensitive".

It NutchBean does not support that, is there any workaround and should I 
better go straight to the code and use Lucene low level to accomplish it?

Thanks,
D.

Intranet crawling maintenance

Posted by Daniel López <D....@uib.es>.

Hi there,

I think I have it more or less thought out, but just in case I missed 
something, I would like to check with more experienced people.

I Have set up everything to crawl out intranet, with Nutch 0.7.

I create the initial index with something like:

bin/nutch crawl $MY_URL_FILE -dir $MY_CRAWL_DIR -depth X -topN Y

then periodically.... ( daily? ), I mantain such index with either:

.- The "Maintenance Shell Script" from "Nutch - The Java Search Engine - 
Nutch Wiki"
http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine

or

.- The script from "IntranetRecrawl - Nutch Wiki"
http://wiki.apache.org/nutch/IntranetRecrawl

Both seem to be more or less equivalent. After one of thouse one would 
restart the web application.

Then, it is recommended to remove the whole $MY_CRAWL_DIR every now and 
then (months) and start all over. To do so one could create the new 
crawl dir under a different name and then stop the web application, 
remove and rename the crawl directories and start the web application.

Would that be more or less correct? Any special preference for the 
maintenance script? I guess the recommended intervals for the cleaning 
and recrawling depend on the site, but any recommendation for a medium 
intranet?

In order to pick up the latest news, would you recommend configuring 
special recrawls for the "news section" of the web site and run them 
more frequently? (and then make the whole recrawl less frequent)

Any advice is welcome,
Thanks,
D.