You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sol Lederman <so...@gmail.com> on 2017/11/25 16:56:42 UTC

General question on dealing with file types

Like most of you I imagine, I want to capture and index file types from a
particular set of types. I want to index HTML but I may or may not want to
index cgi-bin or PDFs. It seems that there are two general approaches for
selecting what to include and exclude and neither seems ideal.

1. I can include files I care about based on the URL matching a reg ex. So,
I can have a list: html, HTML, pdf, PDF, etc. and filter out URLs that
don't match the pattern.

2. I can exclude files I don't want. I can exclude files with reg exes that
match /cgi-bin/, .ico, .doc, etc and keep everything else.

The problem with the first approach is that lots of HTML files don't end in
.html. Often there is no file name. The home page of a site may just be
http://foo.bar. So, the first approach will miss lots of HTML files.

The second approach is ok until I forget a file pattern that I really want
to exclude.

I'm wondering if using the MIME type in conjunction with the first approach
would work well. So, accept URLs with MIME type text/html, accept URLs that
match some URL patterns I want to include and exclude the rest.

I can, I suppose, use approach #2 and not worry since files that don't have
text won't produce any searchable text in the index. I'm not too worried
about having some junk in the index as I'm not crawling a huge number of
pages.

Thoughts? What do folks generally do?

Thanks.

Sol

RE: General question on dealing with file types

Posted by Yossi Tamari <yo...@pipl.com>.
Hi Sol,

Note that you do not need to use a regular expression to filter by file suffix, the suffix-urlfilter plugin does that.
Obviously, if the URL does not contain the file type, you have to fetch it anyway, to get the mime-type. If there is no parser for this fie type, it will not be parsed and indexed anyway. If there is a parser and you want to disable it, I think you can do it in parse-plugins.xml (remove the * rule, and map only the mime-types you do want).

	Yossi.

> -----Original Message-----
> From: Sol Lederman [mailto:sol.lederman@gmail.com]
> Sent: 25 November 2017 18:57
> To: user@nutch.apache.org
> Subject: General question on dealing with file types
> 
> Like most of you I imagine, I want to capture and index file types from a
> particular set of types. I want to index HTML but I may or may not want to index
> cgi-bin or PDFs. It seems that there are two general approaches for selecting
> what to include and exclude and neither seems ideal.
> 
> 1. I can include files I care about based on the URL matching a reg ex. So, I can
> have a list: html, HTML, pdf, PDF, etc. and filter out URLs that don't match the
> pattern.
> 
> 2. I can exclude files I don't want. I can exclude files with reg exes that match
> /cgi-bin/, .ico, .doc, etc and keep everything else.
> 
> The problem with the first approach is that lots of HTML files don't end in .html.
> Often there is no file name. The home page of a site may just be http://foo.bar.
> So, the first approach will miss lots of HTML files.
> 
> The second approach is ok until I forget a file pattern that I really want to
> exclude.
> 
> I'm wondering if using the MIME type in conjunction with the first approach
> would work well. So, accept URLs with MIME type text/html, accept URLs that
> match some URL patterns I want to include and exclude the rest.
> 
> I can, I suppose, use approach #2 and not worry since files that don't have text
> won't produce any searchable text in the index. I'm not too worried about
> having some junk in the index as I'm not crawling a huge number of pages.
> 
> Thoughts? What do folks generally do?
> 
> Thanks.
> 
> Sol


Re: [MASSMAIL]General question on dealing with file types

Posted by Eyeris Rodriguez Rueda <er...@uci.cu>.
Hi Sol.
Maybe you need to use suffix-urlfilter and also mimetype-filter.

The first will help you to delete all urls that contain .ico, .doc (and others) as suffix.
The second is a very intersting indexing filter based on mime types. Remember that you need to parse the document for a good identification of mimetype. 





----- Mensaje original -----
De: "Sol Lederman" <so...@gmail.com>
Para: user@nutch.apache.org
Enviados: Sábado, 25 de Noviembre 2017 10:56:42
Asunto: [MASSMAIL]General question on dealing with file types

Like most of you I imagine, I want to capture and index file types from a
particular set of types. I want to index HTML but I may or may not want to
index cgi-bin or PDFs. It seems that there are two general approaches for
selecting what to include and exclude and neither seems ideal.

1. I can include files I care about based on the URL matching a reg ex. So,
I can have a list: html, HTML, pdf, PDF, etc. and filter out URLs that
don't match the pattern.

2. I can exclude files I don't want. I can exclude files with reg exes that
match /cgi-bin/, .ico, .doc, etc and keep everything else.

The problem with the first approach is that lots of HTML files don't end in
.html. Often there is no file name. The home page of a site may just be
http://foo.bar. So, the first approach will miss lots of HTML files.

The second approach is ok until I forget a file pattern that I really want
to exclude.

I'm wondering if using the MIME type in conjunction with the first approach
would work well. So, accept URLs with MIME type text/html, accept URLs that
match some URL patterns I want to include and exclude the rest.

I can, I suppose, use approach #2 and not worry since files that don't have
text won't produce any searchable text in the index. I'm not too worried
about having some junk in the index as I'm not crawling a huge number of
pages.

Thoughts? What do folks generally do?

Thanks.

Sol
La @universidad_uci es Fidel: 15 años conectados al futuro... conectados a la Revolución
2002-2017