You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Piotr Kosiorowski <pk...@gmail.com> on 2005/08/07 23:27:08 UTC

Nutch website deployment

Hi,

I just wanted to finally add myself to the list of nutch committers on 
nutch website and I am not sure how to deploy it.

So I have installed forrest and modified 
src/site/src/documentation/content/xdocs.
Than run 'forrest'. And it generated content in src/site/build/site.

And now the questions:

Should I copy src/site/build/site to site and commit it?

How to deploy it to public Apache Web server?
Regards
Piotr

no crossposting, please!

Posted by Doug Cutting <cu...@nutch.org>.

Can folks please try to avoid cross-posting messages to both nutch-user 
and nutch-dev?  Many folks are on both lists and don't appreciate seeing 
multiple copies of messages.  If a question is about using Nutch, then 
it should be sent to nutch-user.  Discussions about modifying Nutch's 
implementation should be sent to nutch-dev.  When in doubt, try 
nutch-user first.

Doug

Re: [Nutch-dev] Field.Text vs Field.UnStored

Posted by praveen pathiyil <pa...@gmail.com>.

Hi,

You have four different options for field types

Field method/type                           Tokenized            
Indexed                  Stored

Field.Keyword(String, String)            No                       Yes 
                      Yes
Field.UnIndexed(String, String)         No                        No  
                      Yes
Field.UnStored(String, String)           Yes                      Yes 
                      No
Field.Text(String, String)                  Yes                     
Yes                        Yes
 
Check out Otis' introductory article for a background on this:
http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=1

Regards,
Praveen.



On 8/12/05, EM <em...@cpuedge.com> wrote:
> I need some help figuring out the following:
> 
> I was looking at: BasicIndexingFilter.java where it's stated:
> 
> // url is both stored and indexed, so it's both searchable and returned
> doc.add(Field.Text("url", url));
> 
> // content is indexed, so that it's searchable, but not stored in index
> doc.add(Field.UnStored("content", parse.getText()));
> 
> I'm stuck on what replacement can be made here. I'm assuming doc.add is the
> object that would add tokens to the index? How can a token (word, phrase) be
> "searchable but not stored in the index"?
> 
> I'm basicly trying to do the following, given two pages A and B:
> A is written in eastern alphabet
> B is written in latin alphabet.
> I would like to index page B as it is, and page A as it is, and the content
> of page A translated to latin in addition to it.
> 
> Would I have to add something as:
> String content = parse.getText();
> content +=" ";
> content += myTranslationFunctionToLatin(content);
> doc.add (Field.Text("content", content);
> 
> Or would the last line be:
> doc.add(Field.UnStored("content", content));
> 
> What's the difference with regard to the Field.* object?
> 
> 
> Regards,
> EM
> 
> 
> 
> -------------------------------------------------------
> SF.Net email is Sponsored by the Better Software Conference & EXPO
> September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
> Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
> Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
> _______________________________________________
> Nutch-developers mailing list
> Nutch-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
>

Re: [Nutch-dev] Re: regex-url filter

Posted by Hasan Diwan <ha...@gmail.com>.

On Aug 9, 2005, at 6:23 PM, Zhou LiBing wrote:

> 1) If I want to limit multi-domains,what should I do

Change:
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
to:
+^http://([a-z0-9]*\.)*(domain1|domain2|...|domainN)/
> 2) if I have a mirror web site at local disk, How can I use Nutch  
> to search
> the content the mirror site?

Change the line: -^(file|ftp|mailto):
to read (both of the following lines should solve your problem):
-^(ftp|mailto):
+^(file):

Let me know if it works...
Cheers,
Hasan Diwan <ha...@gmail.com>

Re: [Nutch-dev] Re: regex-url filter

Posted by Zhou LiBing <zh...@gmail.com>.

1) If I want to limit multi-domains,what should I do ,thanks
2) if I have a mirror web site at local disk, How can I use Nutch to search 
the content the mirror site?
 

 2005/8/10, Hasan Diwan <ha...@gmail.com>: 
> 
> Jay:
> On Aug 8, 2005, at 12:24 PM, Jay Pound wrote:
> 
> > is there any way to filter results to english via search, so I can
> > setup a
> > multi-language search, I thought I saw somewhere that you could put
> > something into the form of the html, a switch while submiting the
> > form that
> > would use a plugin to filter the results? I know I had seen some
> > benchmarks
> > on a plugin made to do this
> 
> There's a languageidentier plugin in src/plugin/languageidentifier
> that one could use to do this.
> 
> Cheers,
> Hasan Diwan <ha...@gmail.com>
> 
> 
> 
> 


-- 
---Letter From your friend Blue at HUST CGCL---

Re: Site Content not indexed ? Nutch 0.7

Posted by Andrzej Bialecki <ab...@getopt.org>.

Nils Hoeller wrote:
> Hi,
> 
> actually I thought the content of the pages,
> is beeing indexed.
> 
> When I have a look with Luke at the 
> index of a Nutch Crawl, it says 
> contents not available. 

Please try "reconstruct & Edit" button, and you should see some text 
from the content. The plain text is NOT stored in Lucene index, it's 
just indexed there - the text itself is stored in the segment parse_text.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Site Content not indexed ? Nutch 0.7

Posted by Nils Hoeller <ni...@arcor.de>.

Hi,

actually I thought the content of the pages,
is beeing indexed.

When I have a look with Luke at the 
index of a Nutch Crawl, it says 
contents not available. 

When I search for a word in field "content"
that IS IN A SITE in the index, 
it gives me no results. 

Now I saw something in config files,
that contents is not yet beeing indexed!?

Whats correct? Is it my fault, do 
I have to check some feature of crawl, 
to index the contents ?
Is the contents field really not available? 


Thanks for your help.

Nils

Re: Field.Text vs Field.UnStored

Posted by Matthias Jaekle <ja...@eventax.de>.

> I'm assuming doc.add is the
> object that would add tokens to the index? 
Sometimes.

> How can a token (word, phrase) be
> "searchable but not stored in the index"?
Impossible.

You can search only stuff in the index. But you can not reconstruct page 
content from your index.
If you would be able to get parts of the original content, you also have 
to store the page.

So: Index the parts you would like to search, Store the stuff you would 
like to get in their original version out of your system. Or do both.

If you want to search special fields, you should not extend the content 
field, you should create a new field.

Maybe it is better to have a look at index-more plugin instead of the 
basic index stuff.

Matthias
-- 
http://www.eventax.com - eventax GmbH
http://www.umkreisfinder.de - Die Suchmaschine für Lokales und Events

Field.Text vs Field.UnStored

Posted by EM <em...@cpuedge.com>.

I need some help figuring out the following:

I was looking at: BasicIndexingFilter.java where it's stated:

// url is both stored and indexed, so it's both searchable and returned
doc.add(Field.Text("url", url));

// content is indexed, so that it's searchable, but not stored in index
doc.add(Field.UnStored("content", parse.getText()));

I'm stuck on what replacement can be made here. I'm assuming doc.add is the
object that would add tokens to the index? How can a token (word, phrase) be
"searchable but not stored in the index"?

I'm basicly trying to do the following, given two pages A and B:
A is written in eastern alphabet
B is written in latin alphabet.
I would like to index page B as it is, and page A as it is, and the content
of page A translated to latin in addition to it.

Would I have to add something as:
String content = parse.getText();
content +=" ";
content += myTranslationFunctionToLatin(content);
doc.add (Field.Text("content", content);

Or would the last line be:
doc.add(Field.UnStored("content", content));

What's the difference with regard to the Field.* object?


Regards,
EM

strange url counting in the fetcher

Posted by EM <em...@cpuedge.com>.

Should the following be happening?

Short description:
-fetch bunch of pages.
-status: 5400 fetched, 27 errors
-fetch 22 more pages
-status: 5403 fetched, 27 errors

My regex-urlfilter excludes "jpg" and includes "?"

Long description:

050809 093001 fetching http://<domain>/<text>.jpg?6351
050809 093001 fetching http://<domain>/<text>.jpg?4141
050809 093001 fetching http://<domain>/<text>.jpg?4333
050809 093001 fetching http://<domain>/<text>.htm
050809 093002 status: segment 20050808224036-5, 5400 pages, 27 errors,
40986108 bytes, 30626719 ms
050809 093002 status: 0.17631662 pages/s, 10.455052 kb/s, 7590.02 bytes/page
050809 093002 fetching http://<domain>/<text>.jpg?3732
050809 093002 fetching http://<domain>/<text>.jpg?1398
050809 093002 fetching http://<domain>/<text>.jpg?3876
050809 093002 fetching http://<domain>/<text>.jpg?2260
050809 093002 fetching http://<domain>/<text>.jpg?3298
050809 093002 fetching http://<domain>/<text>.jpg?9396
050809 093002 fetching http://<domain>/<text>.jpg?1946
050809 093002 fetching http://<domain>/<text>.jpg?9897
050809 093002 fetching http://<domain>/<text>>.htm
050809 093007 Response content length is not known
050809 093007 fetching http:// <domain>/<text>.htm
050809 093014 Response content length is not known
050809 093015 fetching http://<domain>/<text>.jpg?8507
050809 093015 fetching http:// <domain>/<text>.htm
050809 093022 Response content length is not known
050809 093023 fetching http://<domain>/<text>.jpg?693
050809 093023 fetching http://<domain>/<text>.jpg?4637
050809 093023 fetching http://<domain>/<text>.jpg?7929
050809 093023 fetching http://<domain>/<text>.jpg?7113
050809 093023 fetching http://<domain>/<text>.jpg?6956
050809 093023 fetching http://<domain>/<text>.jpg?5054
050809 093023 fetching http://<domain>/<text>.jpg?4768
050809 093023 fetching http://<domain>/<text>.jpg?1673
050809 093023 fetching http://<domain>/<text>.jpg?2583
050809 093023 fetching http://<domain>/<text>.jpg?4526
050809 093024 status: segment 20050808224036-5, 5403 pages, 27 errors,
41002564 bytes, 30648859 ms
050809 093024 status: 0.17628714 pages/s, 10.451695 kb/s, 7588.8516
bytes/page

Re: [Nutch-dev] Re: regex-url filter

Posted by Hasan Diwan <ha...@gmail.com>.

Jay:
On Aug 8, 2005, at 12:24 PM, Jay Pound wrote:

> is there any way to filter results to english via search, so I can  
> setup a
> multi-language search, I thought I saw somewhere that you could put
> something into the form of the html, a switch while submiting the  
> form that
> would use a plugin to filter the results? I know I had seen some  
> benchmarks
> on a plugin made to do this

There's a languageidentier plugin in src/plugin/languageidentifier  
that one could use to do this.

Cheers,
Hasan Diwan <ha...@gmail.com>

Re: regex-url filter

Posted by Jay Pound <we...@poundwebhosting.com>.

is there any way to filter results to english via search, so I can setup a
multi-language search, I thought I saw somewhere that you could put
something into the form of the html, a switch while submiting the form that
would use a plugin to filter the results? I know I had seen some benchmarks
on a plugin made to do this
-Jay Pound

----- Original Message ----- 
From: "Chirag Chaman" <de...@filangy.com>
To: <nu...@lucene.apache.org>; <nu...@lucene.apache.org>
Sent: Monday, August 08, 2005 3:02 PM
Subject: RE: regex-url filter


> Here's a better way
>
> http://([a-z0-9]*\.)*.(com|org|net|biz|edu|biz|mil|us|info|cc)/
>
> FYI, this will not remove non-English sites -- but international sites
that
> follow the two-letter convention.
>
> CC-
>
> -----Original Message-----
> From: Jay Pound [mailto:webmaster@poundwebhosting.com]
> Sent: Monday, August 08, 2005 2:37 PM
> To: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org
> Subject: regex-url filter
>
> I would like a confirmation from someone that this will work, I've edited
> the regex filter in hopes to weed out non-english sites from my search
> results, I'll be testing pruning on my current 40mil index to see if it
> works there, or maybe there is a way to set the search to return only
> english results, but I'm trying it this way now, is this the right way to
> add just extensions without sites?
> I'll try it soon but just wanted to not waste my time if its not
correct!!!
> Thanks,
> -Jay Pound
> # The default url filter.
>
> # Better for whole-internet crawling.
>
> # Each non-comment, non-blank line contains a regular expression
>
> # prefixed by '+' or '-'. The first matching pattern in the file
>
> # determines whether a URL is included or ignored. If no pattern
>
> # matches, the URL is ignored.
>
> # skip file: ftp: and mailto: urls
>
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz
|m
> ov|MOV|exe)$
>
> # skip URLs containing certain characters as probable queries, etc.
>
> -[?*!@=]
>
> # accept US only sites
>
> +^http://([a-z0-9]*\.)*.com/
>
> +^http://([a-z0-9]*\.)*.org/
>
> +^http://([a-z0-9]*\.)*.edu/
>
> +^http://([a-z0-9]*\.)*.net/
>
> +^http://([a-z0-9]*\.)*.mil/
>
> +^http://([a-z0-9]*\.)*.us/
>
> +^http://([a-z0-9]*\.)*.info/
>
> +^http://([a-z0-9]*\.)*.cc/
>
> +^http://([a-z0-9]*\.)*.biz/
>
>
>
>
>

Re: regex-url filter

Posted by Rob Pettengill <ro...@earthlink.net>.

DNS names use a bit richer character set than you have specified.  In  
this case I suspect that by specifying the inverse you will get  
closer to what you intend:

+http://[^/]*\.(com|org|net|biz|mil|us|info|cc)/
-.*

A few days back there was a request for features that can be handled  
by the RegexURLFilter.  If you have any uncertainties about the regex  
syntax supported, it is easy to put some test rules in a customized  
conf directory and test them as follows:

$ #setup your shell environment as required.  e.g. for bash:
$ export NUTCH_CONF_DIR=/Users/rcp/project/nutch/test/conf
$ export NUTCH_HOME=/Users/rcp/project/nutch/nutch-rcp
$ #next run
$ $NUTCH_HOME/bin/nutch net.nutch.net.RegexURLFilter
# you can now type in urls, one per line and see the results
http://www.somedomain.com/
+http://www.somedomain.com/
http://www.somedomain.co.uk/
-http://www.somedomain.co.uk/

of course they don't speak "American English" in the UK :-)


;rob
--
Robert C. Pettengill, Ph.D.
    rcp@stanfordalumni.org

Questions about petroleum?
     Goto:   http://AskAboutOil.com/
Need help implementing search?
     Goto:   http://MesaVida.com/


On 2005, Aug 8, at 2:27 PM, Piotr Kosiorowski wrote:

> Hello,
> I am not sure which way is better but I would look for "dot":
> orginal >http://([a-z0-9]*\.)*.(com|org|net|biz|edu|biz|mil|us|info| 
> cc)/
> modified>http://([a-z0-9]*\.)*(com|org|net|biz|edu|biz|mil|us|info| 
> cc)/
> In my opinion "dot" before com,org etc is already included in ([a- 
> z0-9]*\.)* and additional one (not escaped) means any character so  
> it would match eg:
> http://www.abc.xcom/
> but not
> http://www.abc.com/.
> Regards,
> P.
>
>
> Chirag Chaman wrote:
>
>> Here's a better way
>> http://([a-z0-9]*\.)*.(com|org|net|biz|edu|biz|mil|us|info|cc)/
>> FYI, this will not remove non-English sites -- but international  
>> sites that
>> follow the two-letter convention.
>> CC-
>>  -----Original Message-----
>> From: Jay Pound [mailto:webmaster@poundwebhosting.com] Sent:  
>> Monday, August 08, 2005 2:37 PM
>> To: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org
>> Subject: regex-url filter
>> I would like a confirmation from someone that this will work, I've  
>> edited
>> the regex filter in hopes to weed out non-english sites from my  
>> search
>> results, I'll be testing pruning on my current 40mil index to see  
>> if it
>> works there, or maybe there is a way to set the search to return only
>> english results, but I'm trying it this way now, is this the right  
>> way to
>> add just extensions without sites?
>> I'll try it soon but just wanted to not waste my time if its not  
>> correct!!!
>> Thanks,
>> -Jay Pound
>> # The default url filter.
>> # Better for whole-internet crawling.
>> # Each non-comment, non-blank line contains a regular expression
>> # prefixed by '+' or '-'. The first matching pattern in the file
>> # determines whether a URL is included or ignored. If no pattern
>> # matches, the URL is ignored.
>> # skip file: ftp: and mailto: urls
>> -^(file|ftp|mailto):
>> # skip image and other suffixes we can't yet parse
>> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz| 
>> rpm|tgz|m
>> ov|MOV|exe)$
>> # skip URLs containing certain characters as probable queries, etc.
>> -[?*!@=]
>> # accept US only sites
>> +^http://([a-z0-9]*\.)*.com/
>> +^http://([a-z0-9]*\.)*.org/
>> +^http://([a-z0-9]*\.)*.edu/
>> +^http://([a-z0-9]*\.)*.net/
>> +^http://([a-z0-9]*\.)*.mil/
>> +^http://([a-z0-9]*\.)*.us/
>> +^http://([a-z0-9]*\.)*.info/
>> +^http://([a-z0-9]*\.)*.cc/
>> +^http://([a-z0-9]*\.)*.biz/
>>
>
>

Re: regex-url filter

Posted by Piotr Kosiorowski <pk...@gmail.com>.

Hello,
I am not sure which way is better but I would look for "dot":
orginal >http://([a-z0-9]*\.)*.(com|org|net|biz|edu|biz|mil|us|info|cc)/
modified>http://([a-z0-9]*\.)*(com|org|net|biz|edu|biz|mil|us|info|cc)/
In my opinion "dot" before com,org etc is already included in 
([a-z0-9]*\.)* and additional one (not escaped) means any character so 
it would match eg:
http://www.abc.xcom/
but not
http://www.abc.com/.
Regards,
P.


Chirag Chaman wrote:
> Here's a better way
> 
> http://([a-z0-9]*\.)*.(com|org|net|biz|edu|biz|mil|us|info|cc)/
> 
> FYI, this will not remove non-English sites -- but international sites that
> follow the two-letter convention.
> 
> CC-
>  
> -----Original Message-----
> From: Jay Pound [mailto:webmaster@poundwebhosting.com] 
> Sent: Monday, August 08, 2005 2:37 PM
> To: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org
> Subject: regex-url filter
> 
> I would like a confirmation from someone that this will work, I've edited
> the regex filter in hopes to weed out non-english sites from my search
> results, I'll be testing pruning on my current 40mil index to see if it
> works there, or maybe there is a way to set the search to return only
> english results, but I'm trying it this way now, is this the right way to
> add just extensions without sites?
> I'll try it soon but just wanted to not waste my time if its not correct!!!
> Thanks,
> -Jay Pound
> # The default url filter.
> 
> # Better for whole-internet crawling.
> 
> # Each non-comment, non-blank line contains a regular expression
> 
> # prefixed by '+' or '-'. The first matching pattern in the file
> 
> # determines whether a URL is included or ignored. If no pattern
> 
> # matches, the URL is ignored.
> 
> # skip file: ftp: and mailto: urls
> 
> -^(file|ftp|mailto):
> 
> # skip image and other suffixes we can't yet parse
> 
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|m
> ov|MOV|exe)$
> 
> # skip URLs containing certain characters as probable queries, etc.
> 
> -[?*!@=]
> 
> # accept US only sites
> 
> +^http://([a-z0-9]*\.)*.com/
> 
> +^http://([a-z0-9]*\.)*.org/
> 
> +^http://([a-z0-9]*\.)*.edu/
> 
> +^http://([a-z0-9]*\.)*.net/
> 
> +^http://([a-z0-9]*\.)*.mil/
> 
> +^http://([a-z0-9]*\.)*.us/
> 
> +^http://([a-z0-9]*\.)*.info/
> 
> +^http://([a-z0-9]*\.)*.cc/
> 
> +^http://([a-z0-9]*\.)*.biz/
> 
> 
> 
> 
>

RE: regex-url filter

Posted by Chirag Chaman <de...@filangy.com>.

Here's a better way

http://([a-z0-9]*\.)*.(com|org|net|biz|edu|biz|mil|us|info|cc)/

FYI, this will not remove non-English sites -- but international sites that
follow the two-letter convention.

CC-
 
-----Original Message-----
From: Jay Pound [mailto:webmaster@poundwebhosting.com] 
Sent: Monday, August 08, 2005 2:37 PM
To: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org
Subject: regex-url filter

I would like a confirmation from someone that this will work, I've edited
the regex filter in hopes to weed out non-english sites from my search
results, I'll be testing pruning on my current 40mil index to see if it
works there, or maybe there is a way to set the search to return only
english results, but I'm trying it this way now, is this the right way to
add just extensions without sites?
I'll try it soon but just wanted to not waste my time if its not correct!!!
Thanks,
-Jay Pound
# The default url filter.

# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression

# prefixed by '+' or '-'. The first matching pattern in the file

# determines whether a URL is included or ignored. If no pattern

# matches, the URL is ignored.

# skip file: ftp: and mailto: urls

-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse

-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|m
ov|MOV|exe)$

# skip URLs containing certain characters as probable queries, etc.

-[?*!@=]

# accept US only sites

+^http://([a-z0-9]*\.)*.com/

+^http://([a-z0-9]*\.)*.org/

+^http://([a-z0-9]*\.)*.edu/

+^http://([a-z0-9]*\.)*.net/

+^http://([a-z0-9]*\.)*.mil/

+^http://([a-z0-9]*\.)*.us/

+^http://([a-z0-9]*\.)*.info/

+^http://([a-z0-9]*\.)*.cc/

+^http://([a-z0-9]*\.)*.biz/

RE: regex-url filter

Posted by Chirag Chaman <de...@filangy.com>.

Here's a better way

http://([a-z0-9]*\.)*.(com|org|net|biz|edu|biz|mil|us|info|cc)/

FYI, this will not remove non-English sites -- but international sites that
follow the two-letter convention.

CC-
 
-----Original Message-----
From: Jay Pound [mailto:webmaster@poundwebhosting.com] 
Sent: Monday, August 08, 2005 2:37 PM
To: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org
Subject: regex-url filter

I would like a confirmation from someone that this will work, I've edited
the regex filter in hopes to weed out non-english sites from my search
results, I'll be testing pruning on my current 40mil index to see if it
works there, or maybe there is a way to set the search to return only
english results, but I'm trying it this way now, is this the right way to
add just extensions without sites?
I'll try it soon but just wanted to not waste my time if its not correct!!!
Thanks,
-Jay Pound
# The default url filter.

# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression

# prefixed by '+' or '-'. The first matching pattern in the file

# determines whether a URL is included or ignored. If no pattern

# matches, the URL is ignored.

# skip file: ftp: and mailto: urls

-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse

-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|m
ov|MOV|exe)$

# skip URLs containing certain characters as probable queries, etc.

-[?*!@=]

# accept US only sites

+^http://([a-z0-9]*\.)*.com/

+^http://([a-z0-9]*\.)*.org/

+^http://([a-z0-9]*\.)*.edu/

+^http://([a-z0-9]*\.)*.net/

+^http://([a-z0-9]*\.)*.mil/

+^http://([a-z0-9]*\.)*.us/

+^http://([a-z0-9]*\.)*.info/

+^http://([a-z0-9]*\.)*.cc/

+^http://([a-z0-9]*\.)*.biz/

regex-url filter

Posted by Jay Pound <we...@poundwebhosting.com>.

I would like a confirmation from someone that this will work,
I've edited the regex filter in hopes to weed out non-english sites from my
search results, I'll be testing pruning on my current 40mil index to see if
it works there, or maybe there is a way to set the search to return only
english results, but I'm trying it this way now, is this the right way to
add just extensions without sites?
I'll try it soon but just wanted to not waste my time if its not correct!!!
Thanks,
-Jay Pound
# The default url filter.

# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression

# prefixed by '+' or '-'. The first matching pattern in the file

# determines whether a URL is included or ignored. If no pattern

# matches, the URL is ignored.

# skip file: ftp: and mailto: urls

-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse

-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|m
ov|MOV|exe)$

# skip URLs containing certain characters as probable queries, etc.

-[?*!@=]

# accept US only sites

+^http://([a-z0-9]*\.)*.com/

+^http://([a-z0-9]*\.)*.org/

+^http://([a-z0-9]*\.)*.edu/

+^http://([a-z0-9]*\.)*.net/

+^http://([a-z0-9]*\.)*.mil/

+^http://([a-z0-9]*\.)*.us/

+^http://([a-z0-9]*\.)*.info/

+^http://([a-z0-9]*\.)*.cc/

+^http://([a-z0-9]*\.)*.biz/

regex-url filter

Posted by Jay Pound <we...@poundwebhosting.com>.

I would like a confirmation from someone that this will work,
I've edited the regex filter in hopes to weed out non-english sites from my
search results, I'll be testing pruning on my current 40mil index to see if
it works there, or maybe there is a way to set the search to return only
english results, but I'm trying it this way now, is this the right way to
add just extensions without sites?
I'll try it soon but just wanted to not waste my time if its not correct!!!
Thanks,
-Jay Pound
# The default url filter.

# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression

# prefixed by '+' or '-'. The first matching pattern in the file

# determines whether a URL is included or ignored. If no pattern

# matches, the URL is ignored.

# skip file: ftp: and mailto: urls

-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse

-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|m
ov|MOV|exe)$

# skip URLs containing certain characters as probable queries, etc.

-[?*!@=]

# accept US only sites

+^http://([a-z0-9]*\.)*.com/

+^http://([a-z0-9]*\.)*.org/

+^http://([a-z0-9]*\.)*.edu/

+^http://([a-z0-9]*\.)*.net/

+^http://([a-z0-9]*\.)*.mil/

+^http://([a-z0-9]*\.)*.us/

+^http://([a-z0-9]*\.)*.info/

+^http://([a-z0-9]*\.)*.cc/

+^http://([a-z0-9]*\.)*.biz/

Re: Nutch website deployment

Posted by Piotr Kosiorowski <pk...@gmail.com>.

Thanks. I will add it to Wiki (but not today).
P.
Doug Cutting wrote:
> Piotr Kosiorowski wrote:
> 
>> So I have installed forrest and modified 
>> src/site/src/documentation/content/xdocs.
>> Than run 'forrest'. And it generated content in src/site/build/site.
>>
>> And now the questions:
>>
>> Should I copy src/site/build/site to site and commit it?
> 
> 
> Yes.  I'm impressed that you got this far on your own!  We need to 
> document this process...
> 
>> How to deploy it to public Apache Web server?
> 
> 
> ssh people.apache.org
> cd /www/lucene.apache.org/nutch
> svn up
> 
> Then wait a few hours.  The website is sync'd from people.apache.org.
> 
> Thanks,
> 
> Doug
>

Re: Nutch website deployment

Posted by Doug Cutting <cu...@nutch.org>.

Piotr Kosiorowski wrote:
> So I have installed forrest and modified 
> src/site/src/documentation/content/xdocs.
> Than run 'forrest'. And it generated content in src/site/build/site.
> 
> And now the questions:
> 
> Should I copy src/site/build/site to site and commit it?

Yes.  I'm impressed that you got this far on your own!  We need to 
document this process...

> How to deploy it to public Apache Web server?

ssh people.apache.org
cd /www/lucene.apache.org/nutch
svn up

Then wait a few hours.  The website is sync'd from people.apache.org.

Thanks,

Doug