You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by oddaniel <od...@msn.com> on 2008/05/05 07:27:09 UTC
Someone Please respond ... Deleting Urls already crawled from the
crawlDB
Guys i have been trying to get this done for weeks now. No progress. Someone
please help me. I am trying to delete a domain already crawled from my
crawldb and index.
I have a list of domains already crawled in my index. How do I exclude or
delete domains from my crawl output folder. I have tried using the
crawl-urlfilter.txt.
+^http://([a-z0-9]*\.)*
-^http://([a-z0-9]*?\.)*remita.net
Hoping it will exclude the domain remita.net from the crawldb or index and
include all the other urls. Then I run the LinkDbMerger, SegmentMerger,
CrawlDbMerger, IndexMerger. No change. All domains remain part of my output.
Please how can I get this done.
--
View this message in context: http://www.nabble.com/Someone-Please-respond-...-Deleting-Urls-already-crawled-from-the-crawlDB-tp17053927p17053927.html
Sent from the Nutch - User mailing list archive at Nabble.com.
答复: 答复: Someone Please respond ... Deleting Urls already crawled from the crawlDB
Posted by wangkai <wa...@metarnet.com>.
It's my pleasure.
-----邮件原件-----
发件人: oddaniel [mailto:oddaniel@msn.com]
发送时间: 2008年5月5日 20:39
收件人: nutch-user@lucene.apache.org
主题: Re: 答复: Someone Please respond ... Deleting Urls already crawled from the crawlDB
Thanks. Problem solved.
wangkai wrote:
>
> Please try "CrawlDbMerger",
>
> This tool merges several CrawlDb-s into one, optionally filtering URLs
> through the current URLFilters, to skip prohibited pages.
>
> It's possible to use this tool just for filtering - in that case only one
> CrawlDb should be specified in arguments.
>
>
> -----邮件原件-----
> 发件人: oddaniel [mailto:oddaniel@msn.com]
> 发送时间: 2008年5月5日 13:27
> 收件人: nutch-user@lucene.apache.org
> 主题: Someone Please respond ... Deleting Urls already crawled from the
> crawlDB
>
>
> Guys i have been trying to get this done for weeks now. No progress.
> Someone
> please help me. I am trying to delete a domain already crawled from my
> crawldb and index.
>
> I have a list of domains already crawled in my index. How do I exclude or
> delete domains from my crawl output folder. I have tried using the
> crawl-urlfilter.txt.
>
> +^http://([a-z0-9]*\.)*
> -^http://([a-z0-9]*?\.)*remita.net
>
> Hoping it will exclude the domain remita.net from the crawldb or index and
> include all the other urls. Then I run the LinkDbMerger, SegmentMerger,
> CrawlDbMerger, IndexMerger. No change. All domains remain part of my
> output.
>
> Please how can I get this done.
> --
> View this message in context:
> http://www.nabble.com/Someone-Please-respond-...-Deleting-Urls-already-crawl
> ed-from-the-crawlDB-tp17053927p17053927.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
>
>
--
View this message in context: http://www.nabble.com/Someone-Please-respond-...-Deleting-Urls-already-crawled-from-the-crawlDB-tp17053927p17060767.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: 答复: Someone Please respond ... Deleting Urls already crawled from the crawlDB
Posted by oddaniel <od...@msn.com>.
Thanks. Problem solved.
wangkai wrote:
>
> Please try "CrawlDbMerger",
>
> This tool merges several CrawlDb-s into one, optionally filtering URLs
> through the current URLFilters, to skip prohibited pages.
>
> It's possible to use this tool just for filtering - in that case only one
> CrawlDb should be specified in arguments.
>
>
> -----邮件原件-----
> 发件人: oddaniel [mailto:oddaniel@msn.com]
> 发送时间: 2008年5月5日 13:27
> 收件人: nutch-user@lucene.apache.org
> 主题: Someone Please respond ... Deleting Urls already crawled from the
> crawlDB
>
>
> Guys i have been trying to get this done for weeks now. No progress.
> Someone
> please help me. I am trying to delete a domain already crawled from my
> crawldb and index.
>
> I have a list of domains already crawled in my index. How do I exclude or
> delete domains from my crawl output folder. I have tried using the
> crawl-urlfilter.txt.
>
> +^http://([a-z0-9]*\.)*
> -^http://([a-z0-9]*?\.)*remita.net
>
> Hoping it will exclude the domain remita.net from the crawldb or index and
> include all the other urls. Then I run the LinkDbMerger, SegmentMerger,
> CrawlDbMerger, IndexMerger. No change. All domains remain part of my
> output.
>
> Please how can I get this done.
> --
> View this message in context:
> http://www.nabble.com/Someone-Please-respond-...-Deleting-Urls-already-crawl
> ed-from-the-crawlDB-tp17053927p17053927.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
>
>
--
View this message in context: http://www.nabble.com/Someone-Please-respond-...-Deleting-Urls-already-crawled-from-the-crawlDB-tp17053927p17060767.html
Sent from the Nutch - User mailing list archive at Nabble.com.
RE: 答复: Someone Please respond ... Deleting Urls already crawled from the crawlDB
Posted by Howie Wang <ho...@hotmail.com>.
Order is important when defining rules in the urlfilter
files. The url will filtered/unfiltered according to the
first pattern in the file that is encountered.
> I have tried using the crawl-urlfilter.txt.
>
> +^http://([a-z0-9]*\.)*
> -^http://([a-z0-9]*?\.)*remita.net
I think you want
-^http://([a-z0-9]*?\.)*remita.net
+^http://([a-z0-9]*\.)*
Howie
> From: wangk@metarnet.com
> To: nutch-user@lucene.apache.org
> Subject: 答复: Someone Please respond ... Deleting Urls already crawled from the crawlDB
> Date: Mon, 5 May 2008 14:12:20 +0800
>
> Please try "CrawlDbMerger",
>
> This tool merges several CrawlDb-s into one, optionally filtering URLs
> through the current URLFilters, to skip prohibited pages.
>
> It's possible to use this tool just for filtering - in that case only one
> CrawlDb should be specified in arguments.
>
>
> -----邮件原件-----
> 发件人: oddaniel [mailto:oddaniel@msn.com]
> 发送时间: 2008年5月5日 13:27
> 收件人: nutch-user@lucene.apache.org
> 主题: Someone Please respond ... Deleting Urls already crawled from the
> crawlDB
>
>
> Guys i have been trying to get this done for weeks now. No progress. Someone
> please help me. I am trying to delete a domain already crawled from my
> crawldb and index.
>
> I have a list of domains already crawled in my index. How do I exclude or
> delete domains from my crawl output folder. I have tried using the
> crawl-urlfilter.txt.
>
> +^http://([a-z0-9]*\.)*
> -^http://([a-z0-9]*?\.)*remita.net
>
> Hoping it will exclude the domain remita.net from the crawldb or index and
> include all the other urls. Then I run the LinkDbMerger, SegmentMerger,
> CrawlDbMerger, IndexMerger. No change. All domains remain part of my output.
>
> Please how can I get this done.
> --
> View this message in context:
> http://www.nabble.com/Someone-Please-respond-...-Deleting-Urls-already-crawl
> ed-from-the-crawlDB-tp17053927p17053927.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
_________________________________________________________________
Make Windows Vista more reliable and secure with Windows Vista Service Pack 1.
http://www.windowsvista.com/SP1?WT.mc_id=hotmailvistasp1banner
答复: Someone Please respond ... Deleting Urls already crawled from the crawlDB
Posted by wangkai <wa...@metarnet.com>.
Please try "CrawlDbMerger",
This tool merges several CrawlDb-s into one, optionally filtering URLs
through the current URLFilters, to skip prohibited pages.
It's possible to use this tool just for filtering - in that case only one
CrawlDb should be specified in arguments.
-----邮件原件-----
发件人: oddaniel [mailto:oddaniel@msn.com]
发送时间: 2008年5月5日 13:27
收件人: nutch-user@lucene.apache.org
主题: Someone Please respond ... Deleting Urls already crawled from the
crawlDB
Guys i have been trying to get this done for weeks now. No progress. Someone
please help me. I am trying to delete a domain already crawled from my
crawldb and index.
I have a list of domains already crawled in my index. How do I exclude or
delete domains from my crawl output folder. I have tried using the
crawl-urlfilter.txt.
+^http://([a-z0-9]*\.)*
-^http://([a-z0-9]*?\.)*remita.net
Hoping it will exclude the domain remita.net from the crawldb or index and
include all the other urls. Then I run the LinkDbMerger, SegmentMerger,
CrawlDbMerger, IndexMerger. No change. All domains remain part of my output.
Please how can I get this done.
--
View this message in context:
http://www.nabble.com/Someone-Please-respond-...-Deleting-Urls-already-crawl
ed-from-the-crawlDB-tp17053927p17053927.html
Sent from the Nutch - User mailing list archive at Nabble.com.