You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Phillip Rhodes <sp...@rhoderunner.com> on 2006/12/21 21:44:43 UTC
convert bin/nutch to use ant?
I move between XP/Mac/Sun/Linux based upon client or where I am (work
vs. home) and found ant to be a good cross-platform scripting language.I
went to run a nutch crawl on my XP box, and the script is not setup to
run in an XP environment (yes, I could install cygwin)
I have started creating an ant file that I can use to invoke the
different java programs that come with nutch and found a "job" file,
never came across a job file before. I am guessing it a monster jar
file with all the dependencies inside it. I don't think that will work
in ant!
One little trick that I have learned is to put my ant build.xml file
into my war file WEB-INF directory. Since a war file represents a
bundle of your dependencies, your ant file can easily be shipped with
your war file and provide a nice and easy way to invoke your java main
programs.
Has anyone created ant files to invoke the various nutch programs? Can
I help out doing this? How would folks feel about a 2nd war file (an
admin war app) It would be a skeleton war file at first (not provide
any functionality) but we would put all the plugins inside the war file
so that the ant war file can find them and we can run them. Down the
road, we can add a web ui to do a crawl, etc...
Thanks.
Re: NutchBean searching options
Posted by Dennis Kubes <nu...@dragonflymc.com>.
NutchBean creates a query through the [Query query =
Query.parse(args[0], conf);] call in its main method. The actual query
object is created behind the scenes by the whole nutch analysis
mechanism. This does alot of work that is helpful in creating general
queries but it is not the only way.
You can create your own query objects and pass them into the
NutchBean.search(query, int) method. Take a look at the Lucene in
Action book by Erik Hatcher and Otis Gospodnetic and this will show you
how to create different types of query objects such as wildcard queries.
Dennis
Daniel López wrote:
> Hi,
>
> I have seen that NutchBean searches are case insensitive and uses a
> logical AND if various terms are used as criteria. Moreover, if uses
> full words, so no partial matches are allowed (or * as ?) and special
> characters (áéí...) have to be matched exactly.
>
> Is there any way one can tweak those settings and search through the
> NutchBean with other settings? I'm more interested in allowing partial
> matches and making criteria "special-characaters insensitive".
>
> It NutchBean does not support that, is there any workaround and should
> I better go straight to the code and use Lucene low level to
> accomplish it?
>
> Thanks,
> D.
NutchBean searching options
Posted by Daniel López <D....@uib.es>.
Hi,
I have seen that NutchBean searches are case insensitive and uses a
logical AND if various terms are used as criteria. Moreover, if uses
full words, so no partial matches are allowed (or * as ?) and special
characters (áéí...) have to be matched exactly.
Is there any way one can tweak those settings and search through the
NutchBean with other settings? I'm more interested in allowing partial
matches and making criteria "special-characaters insensitive".
It NutchBean does not support that, is there any workaround and should I
better go straight to the code and use Lucene low level to accomplish it?
Thanks,
D.
Intranet crawling maintenance
Posted by Daniel López <D....@uib.es>.
Hi there,
I think I have it more or less thought out, but just in case I missed
something, I would like to check with more experienced people.
I Have set up everything to crawl out intranet, with Nutch 0.7.
I create the initial index with something like:
bin/nutch crawl $MY_URL_FILE -dir $MY_CRAWL_DIR -depth X -topN Y
then periodically.... ( daily? ), I mantain such index with either:
.- The "Maintenance Shell Script" from "Nutch - The Java Search Engine -
Nutch Wiki"
http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine
or
.- The script from "IntranetRecrawl - Nutch Wiki"
http://wiki.apache.org/nutch/IntranetRecrawl
Both seem to be more or less equivalent. After one of thouse one would
restart the web application.
Then, it is recommended to remove the whole $MY_CRAWL_DIR every now and
then (months) and start all over. To do so one could create the new
crawl dir under a different name and then stop the web application,
remove and rename the crawl directories and start the web application.
Would that be more or less correct? Any special preference for the
maintenance script? I guess the recommended intervals for the cleaning
and recrawling depend on the site, but any recommendation for a medium
intranet?
In order to pick up the latest news, would you recommend configuring
special recrawls for the "news section" of the web site and run them
more frequently? (and then make the whole recrawl less frequent)
Any advice is welcome,
Thanks,
D.