You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ali Nazemian <al...@gmail.com> on 2014/08/18 19:42:31 UTC

Different regex-urlfilter for different file types in nutch

Dear all,
Hi,
I use nutch 1.8 for crawl some web sites. For this purpose I want to change
nutch in a way that different regex-urlfilter file loads for different
types of file. For example one for html files and another for image files.
(jpg/jpeg, ... ) Does nutch consider such situation? Or I should change
some line of codes? (probably regex-urlfilter plugin)
Best regards.

-- 
A.Nazemian

Re: Different regex-urlfilter for different file types in nutch

Posted by atawfik <co...@gmail.com>.
Hi Ali,

I am not entirely sure, but I do not think you can determine the content
type before parsing. I think filtering is performed before parsing.

My suggestion is to implement a scoring or an indexing filter that returns
an null nutch document based on content type.

Regards
Ameer



--
View this message in context: http://lucene.472066.n3.nabble.com/Different-regex-urlfilter-for-different-file-types-in-nutch-tp4153586p4155988.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Different regex-urlfilter for different file types in nutch

Posted by feng lu <am...@gmail.com>.
yes, you can not determine the type of file like parser. But I think there
are two methods you can determine the type of file. One is through url
resource suffix and other is use Head Request to get the Content-Type of
that resource but this method will take long time that the first method.

but one question confused me is that how do you classify different
regex-urlfilter file? I think you store different regex in different files.
So before you classify there regex string you have already know the which
url belong to which regex-urlfilter file. :)  So this is a question?


On Mon, Aug 25, 2014 at 10:27 PM, Ali Nazemian <al...@gmail.com>
wrote:

> Hi,
> Do you have any idea about how can I determine file type in RegexUrlFilter?
> file type is distinguishable at parse time not at url filter extension
> point. For example you can manage to use different parser for different
> mimetype in parse-plugins.xml. But how can I manage same behavior at url
> filter extension point?
>
> Best regards.
>
>
> On Tue, Aug 19, 2014 at 6:48 AM, feng lu <am...@gmail.com> wrote:
>
> > Hi
> >
> > Do you want to set different type of rules to different type of files? I
> > find regex-urlfilter plugin did not provide this feature and other
> > *-urlfilter plugins also did not provide this feature.
> >
> > Maybe you can add a interface like
> >
> > protected Reader[] getRulesReaders(Configuration conf) throws IOException
> >
> > to get multi-readers for all configure files in RegexURLFilterBase class.
> >
> >
> > On Tue, Aug 19, 2014 at 1:42 AM, Ali Nazemian <al...@gmail.com>
> > wrote:
> >
> > > Dear all,
> > > Hi,
> > > I use nutch 1.8 for crawl some web sites. For this purpose I want to
> > change
> > > nutch in a way that different regex-urlfilter file loads for different
> > > types of file. For example one for html files and another for image
> > files.
> > > (jpg/jpeg, ... ) Does nutch consider such situation? Or I should change
> > > some line of codes? (probably regex-urlfilter plugin)
> > > Best regards.
> > >
> > > --
> > > A.Nazemian
> > >
> >
> >
> >
> > --
> > Don't Grow Old, Grow Up... :-)
> >
>
>
>
> --
> A.Nazemian
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Different regex-urlfilter for different file types in nutch

Posted by Ali Nazemian <al...@gmail.com>.
Hi,
Do you have any idea about how can I determine file type in RegexUrlFilter?
file type is distinguishable at parse time not at url filter extension
point. For example you can manage to use different parser for different
mimetype in parse-plugins.xml. But how can I manage same behavior at url
filter extension point?

Best regards.


On Tue, Aug 19, 2014 at 6:48 AM, feng lu <am...@gmail.com> wrote:

> Hi
>
> Do you want to set different type of rules to different type of files? I
> find regex-urlfilter plugin did not provide this feature and other
> *-urlfilter plugins also did not provide this feature.
>
> Maybe you can add a interface like
>
> protected Reader[] getRulesReaders(Configuration conf) throws IOException
>
> to get multi-readers for all configure files in RegexURLFilterBase class.
>
>
> On Tue, Aug 19, 2014 at 1:42 AM, Ali Nazemian <al...@gmail.com>
> wrote:
>
> > Dear all,
> > Hi,
> > I use nutch 1.8 for crawl some web sites. For this purpose I want to
> change
> > nutch in a way that different regex-urlfilter file loads for different
> > types of file. For example one for html files and another for image
> files.
> > (jpg/jpeg, ... ) Does nutch consider such situation? Or I should change
> > some line of codes? (probably regex-urlfilter plugin)
> > Best regards.
> >
> > --
> > A.Nazemian
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>



-- 
A.Nazemian

Re: Different regex-urlfilter for different file types in nutch

Posted by feng lu <am...@gmail.com>.
Hi

Do you want to set different type of rules to different type of files? I
find regex-urlfilter plugin did not provide this feature and other
*-urlfilter plugins also did not provide this feature.

Maybe you can add a interface like

protected Reader[] getRulesReaders(Configuration conf) throws IOException

to get multi-readers for all configure files in RegexURLFilterBase class.


On Tue, Aug 19, 2014 at 1:42 AM, Ali Nazemian <al...@gmail.com> wrote:

> Dear all,
> Hi,
> I use nutch 1.8 for crawl some web sites. For this purpose I want to change
> nutch in a way that different regex-urlfilter file loads for different
> types of file. For example one for html files and another for image files.
> (jpg/jpeg, ... ) Does nutch consider such situation? Or I should change
> some line of codes? (probably regex-urlfilter plugin)
> Best regards.
>
> --
> A.Nazemian
>



-- 
Don't Grow Old, Grow Up... :-)