You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sol Lederman <so...@gmail.com> on 2017/11/08 14:55:00 UTC
different regex-urlfilter.txt files for different sets of URLs?
Hi,
I need to have different regex-urlfilter.txt files for different crawls.
Since the file lives in conf and I don't see a way to point nutch inject to
a different file or a different conf directory, I assume I should just swap
in a different regex-urlfilter.txt file every time I do a crawl.
Does that sound right?
Thanks.
Sol
Re: different regex-urlfilter.txt files for different sets of URLs?
Posted by Sol Lederman <so...@gmail.com>.
Hi Rushikesh,
I'm very new to Nutch. I'll let Sebastian and the other experts guide you.
I suspect that success in removing the header and footer will be very
dependent on the HTML files you're processing.
A quick Google search finds these pages:
http://grokbase.com/t/nutch/user/155ensey7k/parsing-pages-but-removing-headers-and-footers
http://grokbase.com/t/nutch/user/1563bdhv85/crawling-pages-but-ignoring-header-and-footer
http://lucene.472066.n3.nabble.com/Removing-Common-Web-Page-Header-and-Footer-from-content-td4168764.html
I suggest you start a new thread since I don't believe your question has
anything to do with this regex-urlfilter.txt discussion.
I also suggest that you try to implement what is suggested in those pages
and then write back (in a new discussion thread) what you did and what
isn't working.
Sol
On Thu, Nov 9, 2017 at 11:02 AM, Rushikesh K <ru...@gmail.com>
wrote:
> Hi Sol,
> i have a question we are trying to use Nutch 1.3 for our website crawling
> ,we have a requirement of skipping the header and footer .I was searching
> online but there isnt an exact solution i found.Can you please guide us
> through that.
>
> I really appreciate you in advance!
>
> On Thu, Nov 9, 2017 at 11:23 AM, Sol Lederman <so...@gmail.com>
> wrote:
>
> > Awesome! Thank you.
> >
>
>
>
> --
> Regards
> Rushikesh M
> .Net Developer
>
Re: different regex-urlfilter.txt files for different sets of URLs?
Posted by Rushikesh K <ru...@gmail.com>.
Hi Sol,
i have a question we are trying to use Nutch 1.3 for our website crawling
,we have a requirement of skipping the header and footer .I was searching
online but there isnt an exact solution i found.Can you please guide us
through that.
I really appreciate you in advance!
On Thu, Nov 9, 2017 at 11:23 AM, Sol Lederman <so...@gmail.com>
wrote:
> Awesome! Thank you.
>
--
Regards
Rushikesh M
.Net Developer
Re: different regex-urlfilter.txt files for different sets of URLs?
Posted by Sol Lederman <so...@gmail.com>.
Awesome! Thank you.
Re: different regex-urlfilter.txt files for different sets of URLs?
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Sol,
of course, you could provide a separate package for every crawl.
In local mode, it's easier to point NUTCH_CONF_DIR to the right directory,
could be even a hierarchy of folders to search for config files separated
by ':' (config files are actually searched on the Java classpath)
E.g., one could define a shell function for Nutch, e.g.
nutch () {
NUTCH_LOG_DIR=./logs NUTCH_CONF_DIR=./conf:$NUTCH_HOME/conf $NUTCH_HOME/bin/nutch "$@"
}
Every config file in ./conf/ is taken first (usually nutch-site.xml) before those
from $NUTCH_HOME/conf/.
For your specific use case, see also:
<property>
<name>urlfilter.regex.file</name>
<value>regex-urlfilter.txt</value>
<description>Name of file on CLASSPATH containing regular expressions
used by urlfilter-regex (RegexURLFilter) plugin.</description>
</property>
This would also work in cluster mode as you can set/overwrite properties
from command-line when launching Nutch.
Sebastian
On 11/08/2017 03:55 PM, Sol Lederman wrote:
> Hi,
>
> I need to have different regex-urlfilter.txt files for different crawls.
> Since the file lives in conf and I don't see a way to point nutch inject to
> a different file or a different conf directory, I assume I should just swap
> in a different regex-urlfilter.txt file every time I do a crawl.
>
> Does that sound right?
>
> Thanks.
>
> Sol
>