You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by feng lu <am...@gmail.com> on 2014/09/01 16:44:20 UTC

Re: Different regex-urlfilter for different file types in nutch

yes, you can not determine the type of file like parser. But I think there
are two methods you can determine the type of file. One is through url
resource suffix and other is use Head Request to get the Content-Type of
that resource but this method will take long time that the first method.

but one question confused me is that how do you classify different
regex-urlfilter file? I think you store different regex in different files.
So before you classify there regex string you have already know the which
url belong to which regex-urlfilter file. :)  So this is a question?


On Mon, Aug 25, 2014 at 10:27 PM, Ali Nazemian <al...@gmail.com>
wrote:

> Hi,
> Do you have any idea about how can I determine file type in RegexUrlFilter?
> file type is distinguishable at parse time not at url filter extension
> point. For example you can manage to use different parser for different
> mimetype in parse-plugins.xml. But how can I manage same behavior at url
> filter extension point?
>
> Best regards.
>
>
> On Tue, Aug 19, 2014 at 6:48 AM, feng lu <am...@gmail.com> wrote:
>
> > Hi
> >
> > Do you want to set different type of rules to different type of files? I
> > find regex-urlfilter plugin did not provide this feature and other
> > *-urlfilter plugins also did not provide this feature.
> >
> > Maybe you can add a interface like
> >
> > protected Reader[] getRulesReaders(Configuration conf) throws IOException
> >
> > to get multi-readers for all configure files in RegexURLFilterBase class.
> >
> >
> > On Tue, Aug 19, 2014 at 1:42 AM, Ali Nazemian <al...@gmail.com>
> > wrote:
> >
> > > Dear all,
> > > Hi,
> > > I use nutch 1.8 for crawl some web sites. For this purpose I want to
> > change
> > > nutch in a way that different regex-urlfilter file loads for different
> > > types of file. For example one for html files and another for image
> > files.
> > > (jpg/jpeg, ... ) Does nutch consider such situation? Or I should change
> > > some line of codes? (probably regex-urlfilter plugin)
> > > Best regards.
> > >
> > > --
> > > A.Nazemian
> > >
> >
> >
> >
> > --
> > Don't Grow Old, Grow Up... :-)
> >
>
>
>
> --
> A.Nazemian
>



-- 
Don't Grow Old, Grow Up... :-)