You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by jibjoice <su...@hotmail.com> on 2008/01/09 03:07:31 UTC
nutch crawl and index problem
first i set conf/crawl-urlfilter that
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(png|PNG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/
# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
# skip everything else
+.
i can crawl "http://guide.kapook.com" but i can't crawl
"http://www.kapook.com" some webpage can't crawl all i want to know why?
after crawl index file not complete it's not have segments file it have only
/user/nutch/crawld/indexes/part-00000/_0.fdt <r 1> 365
/user/nutch/crawld/indexes/part-00000/_0.fdx <r 1> 8
/user/nutch/crawld/indexes/part-00000/_0.fnm <r 1> 66
/user/nutch/crawld/indexes/part-00000/_0.frq <r 1> 370
/user/nutch/crawld/indexes/part-00000/_0.nrm <r 1> 9
/user/nutch/crawld/indexes/part-00000/_0.prx <r 1> 611
/user/nutch/crawld/indexes/part-00000/_0.tii <r 1> 135
/user/nutch/crawld/indexes/part-00000/_0.tis <r 1> 10553
/user/nutch/crawld/indexes/part-00000/index.done <r 1> 0
/user/nutch/crawld/indexes/part-00000/segments.gen <r 1> 20
/user/nutch/crawld/indexes/part-00000/segments_2 <r 1> 41
/user/nutch/crawld/indexes/part-00001/index.done <r 1> 0
/user/nutch/crawld/indexes/part-00001/segments.gen <r 1> 20
/user/nutch/crawld/indexes/part-00001/segments_1 <r 1> 20
how i solve it?
--
View this message in context: http://www.nabble.com/nutch-crawl-and-index-problem-tp14703815p14703815.html
Sent from the Hadoop Users mailing list archive at Nabble.com.
Re: nutch crawl and index problem
Posted by jibjoice <su...@hotmail.com>.
i can not solve it
jibjoice wrote:
>
> now i have
>
> /user/nutch/crawld/indexes/part-00000/index.done <r 1> 0
> /user/nutch/crawld/indexes/part-00000/segments.gen <r 1> 20
> /user/nutch/crawld/indexes/part-00000/segments_1 <r 1> 20
>
> /user/nutch/crawld/indexes/part-00001/_0.fdt <r 1> 144
> /user/nutch/crawld/indexes/part-00001/_0.fdx <r 1> 8
> /user/nutch/crawld/indexes/part-00001/_0.fnm <r 1> 66
> /user/nutch/crawld/indexes/part-00001/_0.frq <r 1> 31
> /user/nutch/crawld/indexes/part-00001/_0.nrm <r 1> 9
> /user/nutch/crawld/indexes/part-00001/_0.prx <r 1> 32
> /user/nutch/crawld/indexes/part-00001/_0.tii <r 1> 31
> /user/nutch/crawld/indexes/part-00001/_0.tis <r 1> 757
> /user/nutch/crawld/indexes/part-00001/index.done <r 1> 0
> /user/nutch/crawld/indexes/part-00001/segments.gen <r 1> 20
> /user/nutch/crawld/indexes/part-00001/segments_2 <r 1> 41
>
> it not have segment file that importance for nutch search, so i use
> command "bin/nutch merge /user/nutch/crawld/index
> /user/nutch/crawld/indexes" after that i list /d01/local/crawld/index it
> have
>
> -rw-r--r-- 1 nutch users 144 ม.ค. 10 16:24 _0.fdt
> -rw-r--r-- 1 nutch users 8 ม.ค. 10 16:24 _0.fdx
> -rw-r--r-- 1 nutch users 66 ม.ค. 10 16:24 _0.fnm
> -rw-r--r-- 1 nutch users 31 ม.ค. 10 16:24 _0.frq
> -rw-r--r-- 1 nutch users 9 ม.ค. 10 16:24 _0.nrm
> -rw-r--r-- 1 nutch users 32 ม.ค. 10 16:24 _0.prx
> -rw-r--r-- 1 nutch users 31 ม.ค. 10 16:24 _0.tii
> -rw-r--r-- 1 nutch users 757 ม.ค. 10 16:24 _0.tis
> -rw-r--r-- 1 nutch users 41 ม.ค. 10 16:24 segments_2
> -rw-r--r-- 1 nutch users 20 ม.ค. 10 16:24 segments.gen
>
> which don't have segments file i want to know i miss "bin/nutch merge" yes
> or no? is it correct? if not correct how i use this command?
>
>
>
--
View this message in context: http://www.nabble.com/nutch-crawl-and-index-problem-tp14703815p14796643.html
Sent from the Hadoop Users mailing list archive at Nabble.com.
Re: nutch crawl and index problem
Posted by jibjoice <su...@hotmail.com>.
now i have
/user/nutch/crawld/indexes/part-00000/index.done <r 1> 0
/user/nutch/crawld/indexes/part-00000/segments.gen <r 1> 20
/user/nutch/crawld/indexes/part-00000/segments_1 <r 1> 20
/user/nutch/crawld/indexes/part-00001/_0.fdt <r 1> 144
/user/nutch/crawld/indexes/part-00001/_0.fdx <r 1> 8
/user/nutch/crawld/indexes/part-00001/_0.fnm <r 1> 66
/user/nutch/crawld/indexes/part-00001/_0.frq <r 1> 31
/user/nutch/crawld/indexes/part-00001/_0.nrm <r 1> 9
/user/nutch/crawld/indexes/part-00001/_0.prx <r 1> 32
/user/nutch/crawld/indexes/part-00001/_0.tii <r 1> 31
/user/nutch/crawld/indexes/part-00001/_0.tis <r 1> 757
/user/nutch/crawld/indexes/part-00001/index.done <r 1> 0
/user/nutch/crawld/indexes/part-00001/segments.gen <r 1> 20
/user/nutch/crawld/indexes/part-00001/segments_2 <r 1> 41
it not have segment file that importance for nutch search, so i use command
"bin/nutch merge /user/nutch/crawld/index /user/nutch/crawld/indexes" after
that i list /d01/local/crawld/index it have
-rw-r--r-- 1 nutch users 144 ม.ค. 10 16:24 _0.fdt
-rw-r--r-- 1 nutch users 8 ม.ค. 10 16:24 _0.fdx
-rw-r--r-- 1 nutch users 66 ม.ค. 10 16:24 _0.fnm
-rw-r--r-- 1 nutch users 31 ม.ค. 10 16:24 _0.frq
-rw-r--r-- 1 nutch users 9 ม.ค. 10 16:24 _0.nrm
-rw-r--r-- 1 nutch users 32 ม.ค. 10 16:24 _0.prx
-rw-r--r-- 1 nutch users 31 ม.ค. 10 16:24 _0.tii
-rw-r--r-- 1 nutch users 757 ม.ค. 10 16:24 _0.tis
-rw-r--r-- 1 nutch users 41 ม.ค. 10 16:24 segments_2
-rw-r--r-- 1 nutch users 20 ม.ค. 10 16:24 segments.gen
which don't have segments file i want to know i miss "bin/nutch merge" yes
or no? is it correct? if not correct how i use this command?
--
View this message in context: http://www.nabble.com/nutch-crawl-and-index-problem-tp14703815p14730578.html
Sent from the Hadoop Users mailing list archive at Nabble.com.