You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Paul Rogers <pa...@gmail.com> on 2014/05/05 17:34:13 UTC

Problem with regex url filter

Hi Guys

I am trying to crawl a local intranet (currently a test) over http (with
the binary version of nutch 1.8).  The site is accessible at
http://localhost/ and my documents are in a directory called pcedocs, ie
are accessible at http://localhost/pcedocs/.

The documents currently in the folder are as follows:

-rw-r--r--. 1 root root    884 May  4 16:41 index1.html
-rw-r--r--. 1 root root    882 May  4 16:00 index.html
-rw-r--r--. 1 root root   2072 May  4 16:01 light_button.png
-rw-r--r--. 1 root root  35431 May  4 16:01 light_logo.png
-rw-r--r--. 1 root root    103 May  4 16:01 poweredby.png

so I'm expecting the two html docs to picked up by the crawl.

My urls/seed.txt is as follows:

http://localhost/pcedocs/

My regex-urlfilter.txt is unchanged from the original except for the
following lines:

# accept anything else
+.*/pcedocs/.*

I have also tried replacing "+.*/pcedocs/.*" with:

+^http://([a-z0-9]*\.)*localhost/

In both instances the crawl gives the following output:

Injector: starting at 2014-04-20 16:55:32
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: overwrite: false
Injector: update: false
Injector: finished at 2014-04-20 16:55:35, elapsed: 00:00:02
Sun 20 Apr 16:55:35 EST 2014 : Iteration 1 of 2
Generating a new segment
Generator: starting at 2014-04-20 16:55:36
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20140420165538
Generator: finished at 2014-04-20 16:55:40, elapsed: 00:00:03
Operating on segment : 20140420165538
Fetching : 20140420165538
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2014-04-20 16:55:40
Fetcher: segment: crawl/segments/20140420165538
Fetcher Timelimit set for : 1398041740834
Using queue mode : byHost
Fetcher: threads: 50
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://localhost/pcedocs/ (queue crawl delay=5000ms)
.
.
.

Checking the Solr index shows that only one document has been indexed
(index.html) and that it's url is http://localhost/pcedocs/.

What I'm expecting is for the crawl to produce two valid urls:

http://localhost/pcedocs/index.html
http://localhost/pcedocs/index1.html

(and as more documents are added more urls).

My question is how do I get nutch to crawl all the files on a web site not
just the "root" url?

Is it a problem with my urlfilter or some other config?

I'm missing something basic here but can't for the life of me figure out
what.

Any help'd be much appreciated.

Cheers

P

Re: Problem with regex url filter

Posted by Bayu Widyasanyata <bw...@gmail.com>.

I think it would be hard when at the same time you disallow directory
listing and allowing files inside it to fetch/crawl and indexing.
Even though nutch solrindex command can filter directory listing (or any
url regex format) with -filter command [0]

My advice is to create index page (e.g. index.html) on that directory
manually or automatically (generated by script) which only list particular
files type (.pdf, .docx, etc.). And nutch will find those files inside the
directory referred by that index page.

[0] http://wiki.apache.org/nutch/bin/nutch%20solrindex


On Mon, May 19, 2014 at 10:26 PM, Paul Rogers <pa...@gmail.com>wrote:

> Hi Bayu
>
> Many thanks for that.
>
> What I'm trying to do is crawl the documents in the directory but not have
> nutch submit the directory listing to solr for indexing.  So for example if
> I have a directory with four pdf documents in it nutch crawls it and solr
> indexes five documents (the directory listing and the four pdf documents).
>
> I can see the logic - nutch is crawling URL's so
> http://mysite/my-directory/(the directory listing) and
> http://mysite/my-directory/pdfdoc.pdf are both valid URL's.
>
> What I think I need is a regex filter that excludes directories (and their
> listings) but includes any files in them.
>
> Thanks
>
> P
>
>
> On 19 May 2014 09:31, Bayu Widyasanyata <bw...@gmail.com> wrote:
>
> > Hi Paul,
> >
> > Apologize for late reply since I have another tasks should be finished.
> >
> > The common practice if your website is common site in providing
> > information, e.g. blog, product infos, company profile, etc., you should
> > *enable* DirectoryIndex as described here [0]
> >
> > But, if you have a particular directory which will shown as directory
> > listing and you don't want it crawling and indexing, you can disallow it
> by
> > configure nutch regex-urlfilter.txt file.
> >
> > e.g.:
> >
> > -^http://yoursite.com/directory/directory-disallow/*
> >
> > Thanks.-
> >
> >
> > On Fri, May 9, 2014 at 1:06 AM, Paul Rogers <pa...@gmail.com>
> > wrote:
> >
> > > Hi Bayu
> > >
> > > Many thanks for that.  Disabling the directory index page and enabling
> a
> > > directory has fixed the issue.  I now get three documents indexed.  The
> > > directory listing, index.html and index1.html
> > >
> > > Is there anyway to stop nutch from indexing (rather than crawing) the
> > >  directory listing itself?
> > >
> > > Thanks
> > >
> > > Paul
> > >
> > >
> > > On 5 May 2014 18:57, Bayu Widyasanyata <bw...@gmail.com>
> wrote:
> > >
> > > > On Tue, May 6, 2014 at 6:05 AM, Paul Rogers <pa...@gmail.com>
> > > > wrote:
> > > >
> > > > > By that do you mean using file:// as opposed to http:// crawling?
> > > >
> > > >
> > > > Yupe.
> > > >
> > > >
> > >
> >
> https://wiki.apache.org/nutch/FAQ#Nutch_crawling_parent_directories_for_file_protocol
> > > >
> > > >
> > > > --
> > > > wassalam,
> > > > [bayu]
> > > >
> > >
> >
> >
> >
> > --
> > wassalam,
> > [bayu]
> >
>



-- 
wassalam,
[bayu]

Re: Problem with regex url filter

Posted by Paul Rogers <pa...@gmail.com>.

Hi Bayu

Many thanks for that.

What I'm trying to do is crawl the documents in the directory but not have
nutch submit the directory listing to solr for indexing.  So for example if
I have a directory with four pdf documents in it nutch crawls it and solr
indexes five documents (the directory listing and the four pdf documents).

I can see the logic - nutch is crawling URL's so
http://mysite/my-directory/(the directory listing) and
http://mysite/my-directory/pdfdoc.pdf are both valid URL's.

What I think I need is a regex filter that excludes directories (and their
listings) but includes any files in them.

Thanks

P


On 19 May 2014 09:31, Bayu Widyasanyata <bw...@gmail.com> wrote:

> Hi Paul,
>
> Apologize for late reply since I have another tasks should be finished.
>
> The common practice if your website is common site in providing
> information, e.g. blog, product infos, company profile, etc., you should
> *enable* DirectoryIndex as described here [0]
>
> But, if you have a particular directory which will shown as directory
> listing and you don't want it crawling and indexing, you can disallow it by
> configure nutch regex-urlfilter.txt file.
>
> e.g.:
>
> -^http://yoursite.com/directory/directory-disallow/*
>
> Thanks.-
>
>
> On Fri, May 9, 2014 at 1:06 AM, Paul Rogers <pa...@gmail.com>
> wrote:
>
> > Hi Bayu
> >
> > Many thanks for that.  Disabling the directory index page and enabling a
> > directory has fixed the issue.  I now get three documents indexed.  The
> > directory listing, index.html and index1.html
> >
> > Is there anyway to stop nutch from indexing (rather than crawing) the
> >  directory listing itself?
> >
> > Thanks
> >
> > Paul
> >
> >
> > On 5 May 2014 18:57, Bayu Widyasanyata <bw...@gmail.com> wrote:
> >
> > > On Tue, May 6, 2014 at 6:05 AM, Paul Rogers <pa...@gmail.com>
> > > wrote:
> > >
> > > > By that do you mean using file:// as opposed to http:// crawling?
> > >
> > >
> > > Yupe.
> > >
> > >
> >
> https://wiki.apache.org/nutch/FAQ#Nutch_crawling_parent_directories_for_file_protocol
> > >
> > >
> > > --
> > > wassalam,
> > > [bayu]
> > >
> >
>
>
>
> --
> wassalam,
> [bayu]
>

Re: Problem with regex url filter

Posted by Bayu Widyasanyata <bw...@gmail.com>.

Hi Paul,

Apologize for late reply since I have another tasks should be finished.

The common practice if your website is common site in providing
information, e.g. blog, product infos, company profile, etc., you should
*enable* DirectoryIndex as described here [0]

But, if you have a particular directory which will shown as directory
listing and you don't want it crawling and indexing, you can disallow it by
configure nutch regex-urlfilter.txt file.

e.g.:

-^http://yoursite.com/directory/directory-disallow/*

Thanks.-

On Fri, May 9, 2014 at 1:06 AM, Paul Rogers <pa...@gmail.com> wrote:

> Hi Bayu
>
> Many thanks for that.  Disabling the directory index page and enabling a
> directory has fixed the issue.  I now get three documents indexed.  The
> directory listing, index.html and index1.html
>
> Is there anyway to stop nutch from indexing (rather than crawing) the
>  directory listing itself?
>
> Thanks
>
> Paul
>
>
> On 5 May 2014 18:57, Bayu Widyasanyata <bw...@gmail.com> wrote:
>
> > On Tue, May 6, 2014 at 6:05 AM, Paul Rogers <pa...@gmail.com>
> > wrote:
> >
> > > By that do you mean using file:// as opposed to http:// crawling?
> >
> >
> > Yupe.
> >
> >
> https://wiki.apache.org/nutch/FAQ#Nutch_crawling_parent_directories_for_file_protocol
> >
> >
> > --
> > wassalam,
> > [bayu]
> >
>

-- 
wassalam,
[bayu]

Re: Problem with regex url filter

Posted by Paul Rogers <pa...@gmail.com>.

Hi Bayu

Many thanks for that.  Disabling the directory index page and enabling a
directory has fixed the issue.  I now get three documents indexed.  The
directory listing, index.html and index1.html

Is there anyway to stop nutch from indexing (rather than crawing) the
 directory listing itself?

Thanks

Paul

On 5 May 2014 18:57, Bayu Widyasanyata <bw...@gmail.com> wrote:

> On Tue, May 6, 2014 at 6:05 AM, Paul Rogers <pa...@gmail.com>
> wrote:
>
> > By that do you mean using file:// as opposed to http:// crawling?
>
>
> Yupe.
>
> https://wiki.apache.org/nutch/FAQ#Nutch_crawling_parent_directories_for_file_protocol
>
>
> --
> wassalam,
> [bayu]
>

Re: Problem with regex url filter

Posted by Bayu Widyasanyata <bw...@gmail.com>.

On Tue, May 6, 2014 at 6:05 AM, Paul Rogers <pa...@gmail.com> wrote:

> By that do you mean using file:// as opposed to http:// crawling?


Yupe.
https://wiki.apache.org/nutch/FAQ#Nutch_crawling_parent_directories_for_file_protocol


-- 
wassalam,
[bayu]

Re: Problem with regex url filter

Posted by Paul Rogers <pa...@gmail.com>.

Hi Bayu

Many thanks for the response.

> Otherwise you can still also fetch if through "directory crawling"
(instead
of browser crawling)

By that do you mean using file:// as opposed to http:// crawling?

Thanks

P


On 5 May 2014 17:42, Bayu Widyasanyata <bw...@gmail.com> wrote:

> On Mon, May 5, 2014 at 10:34 PM, Paul Rogers <pa...@gmail.com>
> wrote:
>
> > My question is how do I get nutch to crawl all the files on a web site
> not
> > just the "root" url?
> >
>
> Hi,
>
> nutch is acts as crawler, the same about we uses any Internet browser.
> nutch or we can't browse or crawl the pages that doesn't have a referer
> page (linked page).
> So, you should have a page that has link to index1.html.
> File index.html is automatically crawled since it should be your
> DirectoryIndex page.
>
> Otherwise you can still also fetch if through "directory crawling" (instead
> of browser crawling) or you disable directory Index page setting (such on
> Apache / DirectoryIndex), so clients (nutch) can browse your entire
> directories.
>
> Thanks.-
>
>
> --
> wassalam,
> [bayu]
>

Re: Problem with regex url filter

Posted by Bayu Widyasanyata <bw...@gmail.com>.

On Mon, May 5, 2014 at 10:34 PM, Paul Rogers <pa...@gmail.com> wrote:

> My question is how do I get nutch to crawl all the files on a web site not
> just the "root" url?
>

Hi,

nutch is acts as crawler, the same about we uses any Internet browser.
nutch or we can't browse or crawl the pages that doesn't have a referer
page (linked page).
So, you should have a page that has link to index1.html.
File index.html is automatically crawled since it should be your
DirectoryIndex page.

Otherwise you can still also fetch if through "directory crawling" (instead
of browser crawling) or you disable directory Index page setting (such on
Apache / DirectoryIndex), so clients (nutch) can browse your entire
directories.

Thanks.-

-- 
wassalam,
[bayu]

Re: 回复﹕ Problem with regex url filter

Posted by Paul Rogers <pa...@gmail.com>.

Hi

Thanks for the response.

My understanding of the outlink config is that these affect how nutch deals
with links embedded in individual crawled pages rather than multiple
pages/files in a single URL/directory.

Have I got that wrong?

P

On 5 May 2014 11:07, Tree ser <tr...@yahoo.com> wrote:

> Hi,Paul<br/><br/>    maybe u need to check your nutch-site.xml settings.
> In nutch-default.xml, nutch just fetch only one outlink from one page, u
> need to change the value from true to false. then , u can fetch them.
> Furthermore , u can set the outlink limit if u need. <a href="
> https://overview.mail.yahoo.com?.src=iOS"><br/><br/>发自 iPad 版雅虎邮箱</a>

回复﹕ Problem with regex url filter

Posted by Tree ser <tr...@yahoo.com>.

Hi,Paul<br/><br/>    maybe u need to check your nutch-site.xml settings. In nutch-default.xml, nutch just fetch only one outlink from one page, u need to change the value from true to false. then , u can fetch them. Furthermore , u can set the outlink limit if u need. <a href="https://overview.mail.yahoo.com?.src=iOS"><br/><br/>发自 iPad 版雅虎邮箱</a>