You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by EM <em...@cpuedge.com> on 2005/08/03 05:25:02 UTC

My wishlist of 12 out of...

I've been using Nutch for quite a while and reading this list constantly.

I'll state some assumptions in this post about the way Nutch operates, if
they are wrong, please excuse my ignorance. I'm using the interfaces
extensively, so I will assume things behind them.

I think I'm the average user:
	-up to a 0.2-4 million pages in a set of several thousand websites
	-db + segments rarely going over 10 GB
	-fetching is often done on one computer and uploaded elsewhere.
	-not a small user enough to do everything by hand
	-not large enough to need 100% automated inflexible solution.

Over the past two months, I've found manual workarounds for the most Nutch
limitations. Being able to do that in really nice, however, as I don't have
as much time lately to manually control the fetching process, I would like
to see the following features in, and I hope that others will benefit from
them:

1. More flexible regex-urlfilter. I want to be able to override the
regex-filter for some sites. For example, in the regex-urlfilter I would
put:
<overrides>
	<site>
		<url>http://Domain1.com</url>
		<file>f1</file>
	</site>
	<site>
		<url>http://Domain21.com</url>
		<file>f2</file>
	</site>
</override>

Then I'll have to create two new regex-urlfilter files for each domain
separately. 

2. Pages per host. There should really be built-in maximum pages to be
fetched per host. I don't think limits on level-depth would work here
because websites are diverse. So, just, this host=2000max.

3. Threads-per-host feature that really works, the current one is bugged.
Example: 
I was fetching a half a million pages, it went something like this:
1-st run, fetching 200 pages, everything's fine, less than 10 errors
2-nd run, fetching 1000 pages, everything's fine, less than 10 errors
3-rd run, fetching 20000 pages, everything's fine, less than 100 errors
4-th run, fetching 200000 pages, everything's fine, less than 100 errors
5-th run, fetching 200000 pages, everything's fine, less than 100 errors
6-th run, fetching 10000 pages, not quite everything is fine, 1000-3000
errors.
7-th run, fetching 5000 pages, 4000 errors.

No matter how many times one repeats the fetching process, on the end, they
are bound to end up with hosts with MANY websites, and there will be a
collision and many errors. So, whenever the fetch list ends up with 4000
pages from one single host, can someone implement a function that would
decrease the number of threads to whatever threads-per-host is equal to? I
know that increasing the host delays solves this, but that's just a
workaround and it's not obvious at first sight (well at least not to many
people since I'm seeing the question popping up all the time).

4. Java exceptions generated during the fetching process should be kept at
minimum.
I occasionally look at the fetching process for several minutes. Something
it's hard to follow what's going on. What I'd like to see it's the
following:

Lets say there are 5 pages to be fetched, the first one and the fifth one
have content errors (.bmp or something), the third one has a java exception
in it and the sixth one is Very long.

The output on screen should look something as this:

FAILED   #n06 http://www.mysite1.com/somdfglsdfgsdfedir/somepage1.bmp
FETCHED       http://www.mysite1.com/somedir/somsfdgsdgfepage1.htm
FAILED   #j01 http://www.mysdsgfsdfgite1.com/somesdgfsdgfdir/somepage2.bmp 
FETCHED       http://www.myssfdgite1.com/somedir/somepage2.htm
FAILED   #n06 http://www.mysisdfte1.com/somedir/somepage3.bmp 
FETCHED       http://www.mye1.com/somedir/somethi..  ..anddf/safd.htm

FAILED should be in red color, FETCHED in green.
#n06 (or something) would be the code for unknown content
#j01 would be the code for java exception and a separate file for exceptions
would get all exceptions in it. It's hard to search for all exceptions
manually.
Terminate all outputs to a single line only please.
Use \t instead of space, is way more readable if the things are aligned
properly.

The full output of what's happening should go in a separate file.

I believe this would help me and others to follow what's happening when 30+
threads are printing on the same screen. 

5. Real-time fetching control

During the fetch I want to be able to Interact! with:
- Pressing P would pause the fetch
- Pressing R would resume the fetch
- Pressing S would output me the status line (you don't know how boring is
to WAIT for the thing to appear sometimes), and preferably pause the fetch
for 3 seconds so I can write things down if I wish.
- Pressing !! followed by X would save the fetching process and exit nicely,
so whatever is in the script can proceed without me pressing ctrl-c bunch of
times and then repairing/slicing segment and resuming things by hand.

6. Configurable recursion detection.
- During the fetch, Nutch parses the fetched page, well, calculate MD5 out
of it, and compare it to the md5's of the whole host, if there are more than
1 , dump the page right there and then. Blacklist the url and everything, or
even better, since the url is already fetched, add a pointer to an existing
fetched page. Solve the deduplication on a single host during runtime.
- provide the user the option (enable/disable) to dump same-name
subdirectories and same-name second-subdirectories, for example:
http://qwer.com/p1/p1
http://qwer.com/p1/p2/p1

7. Priority fetching. Some site changes a lot, some don't, why everything
should be refetched at the same time?  Why not making a separate column for
all the pages with and float number named changes. Since db is designed to
last forever, monitor the number of times the page was changed during all
fetch-s. Then, for the next fetch, get the threshold out of the config file
(should be user-configurable), and decide if the page will be refetched.
Add option to the nutch startup file to force override on this one for
certain pages (or put them into the improvement 1.).

8. Make it easy to generate pages that will be used for vanilla searching
(no cached copy, no other things) as opposed to full sized searching (with
cached copy and all). Sometimes I want to put a couple or more million pages
on the server and there isn't enough space. 

9. Fetch only approved pages nutch can process, I know that the
regex-urlfiler can do this with a bit of writing in it, but wouldn't it be
nice if one can say:
<fetch-only>pdf|ht*|txt|asp|php|</fetch-only>

10. What would be extremely nice is an interactive GUI during the fetch
process to change parameters for the fetch during its running. 
For example:
-generating too much errors because of timeouts, no problem, decrease the
total threads, increase the delay-per-host and so on.
-fetching some file from a host which no one has heard of but it fails
because it cannot be parsed. Add it to the blacklist and don't bother
fetching it anymore for <the given host> or <all hosts>

11. Reparsing and pruning. 
Let say I have Y GB on my hosting account. I've crawled things at home and I
ended up with Y.1 GB. I have to dump something. Well, unfortunately, the way
how things are set up right now, I'll have to modify the regex-urlfilter,
export the database, delete it, reimport everything with the regex changes,
and refetch. Or slice the segments and wipe random pages out of it. It would
be really nice to be able to simply, copy the segments aside WITHOUT a
certain set of pages (or the recent changes to the regex-urlfilter)

12. Small fetch-only client so I can send the thing with the fetch list to
my friend(s) and he can fetch everything for me (because he might be on a
WAY better speed with the pages in a certain TLD than me) and just send me
the segments/content over here for me to index+etc it. This small client
should be able to read an encrypted config files, and save the files with
encryption/hash so tampering would be hard to do. 


If you read up to this point, thanks, I know it's a long message and not
everyone will bother. 

If you are a developer, thanks, I personally know what amount of time goes
into coding even simple things, and I hope you'll keep the good job and
maybe implement some of my suggestions. 

This list of 12 is just a small part of the things I would really like to
see them in. I wish I could participate in the coding process and implement
these 12 and many, many more. Unfortunately, I'm a student, and by an ironic
cliche, time goes mainly to school and working for food. A bit time is left
however, for playing with things I like, and I'm glad Nutch is one of them.

P.S. I don't know how appropriate is to ask, but, anyone offering a paid
position for Nutch development?

Keep up the good work,
EM in Toronto.