You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Bayu Widyasanyata <bw...@gmail.com> on 2014/06/04 14:30:22 UTC

Crawling web and intranet files into single crawldb

Hi,

I successfully running nutch 1.8 and Solr 4.8.1 to fetch and index web
sources (http protocol).
And now I want add file share data sources (file protocol) into current
crawldb.

What is the strategy or common practices to handle this situations?

Thank you.-

-- 
wassalam,
[bayu]

Re: Crawling web and intranet files into single crawldb

Posted by Bayu Widyasanyata <bw...@gmail.com>.

OK, thanks! :)


On Wed, Jun 4, 2014 at 8:28 PM, Markus Jelsma <ma...@openindex.io>
wrote:

> ah yes. i am wrong, do not remove it :)
>
>
>
> -----Original message-----
> From:Bayu Widyasanyata <bw...@gmail.com>
> Sent:Wed 04-06-2014 15:25
> Subject:Re: Crawling web and intranet files into single crawldb
> To:user@nutch.apache.org;
> Hi Markus,
>
> Did you mean I should remove "file://" line from prefix-urlfilter.txt?
>
> When I checked with command: bin/nutch
> org.apache.nutch.net.URLFilterChecker -allCombined < urls/seed.txt, it
> returns:
>
> Checking combination of all URLFilters available
> -http://www.myurl.com
> -file://opt/searchengine/nutch
>
> What does it mean?
>
> Following are contains of my prefix-urlfilter.txt file (default
> configuration):
>
> http://
> https://
> ftp://
> file://
>
> Even though I removed "file://" or not, the result of nutch
> URLFilterChecker is still the same.
>
> On Wed, Jun 4, 2014 at 7:50 PM, Markus Jelsma <ma...@openindex.io>
> wrote:
>
> > Remove it from the prefix filter and confirm it works using bin/nutch
> > org.apache.nutch.net.URLFilterChecker -allCombined
> >
> >
> >
> > -----Original message-----
> > From:Bayu Widyasanyata <bw...@gmail.com>
> > Sent:Wed 04-06-2014 14:47
> > Subject:Re: Crawling web and intranet files into single crawldb
> > To:user@nutch.apache.org;
> > Hi Markus,
> >
> > The following files should I configured:
> >
> > = prefix-urlfilter.txt: put file:// which is already configured.
> > = regex-urlfilter.txt: update following line -^(file|ftp|mailto) to
> > -^(ftp|mailto):
> > = urls/seed.txt: add new URL/file path.
> >
> > ...and start crawling.
> >
> > Is it enough? CMIIW
> >
> > Thanks-
> >
> >
> >
> > On Wed, Jun 4, 2014 at 7:33 PM, Markus Jelsma <
> markus.jelsma@openindex.io>
> > wrote:
> >
> > > Hi Bayu,
> > >
> > >
> > > You must enabled the protocol-file first. Then make sure the file://
> > > prefix is not filtered via prefix-urlfilter.txt or any other. Now just
> > > inject new URL's and start the crawl.
> > >
> > >
> > > Cheers
> > >
> > >
> > >
> > > -----Original message-----
> > > From:Bayu Widyasanyata <bw...@gmail.com>
> > > Sent:Wed 04-06-2014 14:30
> > > Subject:Crawling web and intranet files into single crawldb
> > > To:user@nutch.apache.org;
> > > Hi,
> > >
> > > I successfully running nutch 1.8 and Solr 4.8.1 to fetch and index web
> > > sources (http protocol).
> > > And now I want add file share data sources (file protocol) into current
> > > crawldb.
> > >
> > > What is the strategy or common practices to handle this situations?
> > >
> > > Thank you.-
> > >
> > > --
> > > wassalam,
> > > [bayu]
> > >
> >
> >
> >
> > --
> > wassalam,
> > [bayu]
> >
>
>
>
> --
> wassalam,
> [bayu]
>



-- 
wassalam,
[bayu]

RE: Crawling web and intranet files into single crawldb

Posted by Markus Jelsma <ma...@openindex.io>.

ah yes. i am wrong, do not remove it :)

 
 
-----Original message-----
From:Bayu Widyasanyata <bw...@gmail.com>
Sent:Wed 04-06-2014 15:25
Subject:Re: Crawling web and intranet files into single crawldb
To:user@nutch.apache.org; 
Hi Markus,

Did you mean I should remove "file://" line from prefix-urlfilter.txt?

When I checked with command: bin/nutch
org.apache.nutch.net.URLFilterChecker -allCombined < urls/seed.txt, it
returns:

Checking combination of all URLFilters available
-http://www.myurl.com
-file://opt/searchengine/nutch

What does it mean?

Following are contains of my prefix-urlfilter.txt file (default
configuration):

http://
https://
ftp://
file://

Even though I removed "file://" or not, the result of nutch
URLFilterChecker is still the same.

On Wed, Jun 4, 2014 at 7:50 PM, Markus Jelsma <ma...@openindex.io>
wrote:

> Remove it from the prefix filter and confirm it works using bin/nutch
> org.apache.nutch.net.URLFilterChecker -allCombined
>
>
>
> -----Original message-----
> From:Bayu Widyasanyata <bw...@gmail.com>
> Sent:Wed 04-06-2014 14:47
> Subject:Re: Crawling web and intranet files into single crawldb
> To:user@nutch.apache.org;
> Hi Markus,
>
> The following files should I configured:
>
> = prefix-urlfilter.txt: put file:// which is already configured.
> = regex-urlfilter.txt: update following line -^(file|ftp|mailto) to
> -^(ftp|mailto):
> = urls/seed.txt: add new URL/file path.
>
> ...and start crawling.
>
> Is it enough? CMIIW
>
> Thanks-
>
>
>
> On Wed, Jun 4, 2014 at 7:33 PM, Markus Jelsma <ma...@openindex.io>
> wrote:
>
> > Hi Bayu,
> >
> >
> > You must enabled the protocol-file first. Then make sure the file://
> > prefix is not filtered via prefix-urlfilter.txt or any other. Now just
> > inject new URL's and start the crawl.
> >
> >
> > Cheers
> >
> >
> >
> > -----Original message-----
> > From:Bayu Widyasanyata <bw...@gmail.com>
> > Sent:Wed 04-06-2014 14:30
> > Subject:Crawling web and intranet files into single crawldb
> > To:user@nutch.apache.org;
> > Hi,
> >
> > I successfully running nutch 1.8 and Solr 4.8.1 to fetch and index web
> > sources (http protocol).
> > And now I want add file share data sources (file protocol) into current
> > crawldb.
> >
> > What is the strategy or common practices to handle this situations?
> >
> > Thank you.-
> >
> > --
> > wassalam,
> > [bayu]
> >
>
>
>
> --
> wassalam,
> [bayu]
>



-- 
wassalam,
[bayu]

Re: Crawling web and intranet files into single crawldb

Posted by Bayu Widyasanyata <bw...@gmail.com>.

Hi Markus,

Did you mean I should remove "file://" line from prefix-urlfilter.txt?

When I checked with command: bin/nutch
org.apache.nutch.net.URLFilterChecker -allCombined < urls/seed.txt, it
returns:

Checking combination of all URLFilters available
-http://www.myurl.com
-file://opt/searchengine/nutch

What does it mean?

Following are contains of my prefix-urlfilter.txt file (default
configuration):

http://
https://
ftp://
file://

Even though I removed "file://" or not, the result of nutch
URLFilterChecker is still the same.

On Wed, Jun 4, 2014 at 7:50 PM, Markus Jelsma <ma...@openindex.io>
wrote:

> Remove it from the prefix filter and confirm it works using bin/nutch
> org.apache.nutch.net.URLFilterChecker -allCombined
>
>
>
> -----Original message-----
> From:Bayu Widyasanyata <bw...@gmail.com>
> Sent:Wed 04-06-2014 14:47
> Subject:Re: Crawling web and intranet files into single crawldb
> To:user@nutch.apache.org;
> Hi Markus,
>
> The following files should I configured:
>
> = prefix-urlfilter.txt: put file:// which is already configured.
> = regex-urlfilter.txt: update following line -^(file|ftp|mailto) to
> -^(ftp|mailto):
> = urls/seed.txt: add new URL/file path.
>
> ...and start crawling.
>
> Is it enough? CMIIW
>
> Thanks-
>
>
>
> On Wed, Jun 4, 2014 at 7:33 PM, Markus Jelsma <ma...@openindex.io>
> wrote:
>
> > Hi Bayu,
> >
> >
> > You must enabled the protocol-file first. Then make sure the file://
> > prefix is not filtered via prefix-urlfilter.txt or any other. Now just
> > inject new URL's and start the crawl.
> >
> >
> > Cheers
> >
> >
> >
> > -----Original message-----
> > From:Bayu Widyasanyata <bw...@gmail.com>
> > Sent:Wed 04-06-2014 14:30
> > Subject:Crawling web and intranet files into single crawldb
> > To:user@nutch.apache.org;
> > Hi,
> >
> > I successfully running nutch 1.8 and Solr 4.8.1 to fetch and index web
> > sources (http protocol).
> > And now I want add file share data sources (file protocol) into current
> > crawldb.
> >
> > What is the strategy or common practices to handle this situations?
> >
> > Thank you.-
> >
> > --
> > wassalam,
> > [bayu]
> >
>
>
>
> --
> wassalam,
> [bayu]
>



-- 
wassalam,
[bayu]

RE: Crawling web and intranet files into single crawldb

Posted by Markus Jelsma <ma...@openindex.io>.

Remove it from the prefix filter and confirm it works using bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined


 
-----Original message-----
From:Bayu Widyasanyata <bw...@gmail.com>
Sent:Wed 04-06-2014 14:47
Subject:Re: Crawling web and intranet files into single crawldb
To:user@nutch.apache.org; 
Hi Markus,

The following files should I configured:

= prefix-urlfilter.txt: put file:// which is already configured.
= regex-urlfilter.txt: update following line -^(file|ftp|mailto) to
-^(ftp|mailto):
= urls/seed.txt: add new URL/file path.

...and start crawling.

Is it enough? CMIIW

Thanks-



On Wed, Jun 4, 2014 at 7:33 PM, Markus Jelsma <ma...@openindex.io>
wrote:

> Hi Bayu,
>
>
> You must enabled the protocol-file first. Then make sure the file://
> prefix is not filtered via prefix-urlfilter.txt or any other. Now just
> inject new URL's and start the crawl.
>
>
> Cheers
>
>
>
> -----Original message-----
> From:Bayu Widyasanyata <bw...@gmail.com>
> Sent:Wed 04-06-2014 14:30
> Subject:Crawling web and intranet files into single crawldb
> To:user@nutch.apache.org;
> Hi,
>
> I successfully running nutch 1.8 and Solr 4.8.1 to fetch and index web
> sources (http protocol).
> And now I want add file share data sources (file protocol) into current
> crawldb.
>
> What is the strategy or common practices to handle this situations?
>
> Thank you.-
>
> --
> wassalam,
> [bayu]
>



-- 
wassalam,
[bayu]

Re: Crawling web and intranet files into single crawldb

Posted by Bayu Widyasanyata <bw...@gmail.com>.

Hi Markus,

The following files should I configured:

= prefix-urlfilter.txt: put file:// which is already configured.
= regex-urlfilter.txt: update following line -^(file|ftp|mailto) to
-^(ftp|mailto):
= urls/seed.txt: add new URL/file path.

...and start crawling.

Is it enough? CMIIW

Thanks-



On Wed, Jun 4, 2014 at 7:33 PM, Markus Jelsma <ma...@openindex.io>
wrote:

> Hi Bayu,
>
>
> You must enabled the protocol-file first. Then make sure the file://
> prefix is not filtered via prefix-urlfilter.txt or any other. Now just
> inject new URL's and start the crawl.
>
>
> Cheers
>
>
>
> -----Original message-----
> From:Bayu Widyasanyata <bw...@gmail.com>
> Sent:Wed 04-06-2014 14:30
> Subject:Crawling web and intranet files into single crawldb
> To:user@nutch.apache.org;
> Hi,
>
> I successfully running nutch 1.8 and Solr 4.8.1 to fetch and index web
> sources (http protocol).
> And now I want add file share data sources (file protocol) into current
> crawldb.
>
> What is the strategy or common practices to handle this situations?
>
> Thank you.-
>
> --
> wassalam,
> [bayu]
>



-- 
wassalam,
[bayu]

RE: Crawling web and intranet files into single crawldb

Posted by Markus Jelsma <ma...@openindex.io>.

Hi Bayu,

 
You must enabled the protocol-file first. Then make sure the file:// prefix is not filtered via prefix-urlfilter.txt or any other. Now just inject new URL's and start the crawl.

 
Cheers


 
-----Original message-----
From:Bayu Widyasanyata <bw...@gmail.com>
Sent:Wed 04-06-2014 14:30
Subject:Crawling web and intranet files into single crawldb
To:user@nutch.apache.org; 
Hi,

I successfully running nutch 1.8 and Solr 4.8.1 to fetch and index web
sources (http protocol).
And now I want add file share data sources (file protocol) into current
crawldb.

What is the strategy or common practices to handle this situations?

Thank you.-

-- 
wassalam,
[bayu]