You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Eyeris Rodriguez Rueda <er...@uci.cu> on 2017/11/25 22:08:51 UTC

Re: [MASSMAIL]General question on dealing with file types

Hi Sol.
Maybe you need to use suffix-urlfilter and also mimetype-filter.

The first will help you to delete all urls that contain .ico, .doc (and others) as suffix.
The second is a very intersting indexing filter based on mime types. Remember that you need to parse the document for a good identification of mimetype. 





----- Mensaje original -----
De: "Sol Lederman" <so...@gmail.com>
Para: user@nutch.apache.org
Enviados: Sábado, 25 de Noviembre 2017 10:56:42
Asunto: [MASSMAIL]General question on dealing with file types

Like most of you I imagine, I want to capture and index file types from a
particular set of types. I want to index HTML but I may or may not want to
index cgi-bin or PDFs. It seems that there are two general approaches for
selecting what to include and exclude and neither seems ideal.

1. I can include files I care about based on the URL matching a reg ex. So,
I can have a list: html, HTML, pdf, PDF, etc. and filter out URLs that
don't match the pattern.

2. I can exclude files I don't want. I can exclude files with reg exes that
match /cgi-bin/, .ico, .doc, etc and keep everything else.

The problem with the first approach is that lots of HTML files don't end in
.html. Often there is no file name. The home page of a site may just be
http://foo.bar. So, the first approach will miss lots of HTML files.

The second approach is ok until I forget a file pattern that I really want
to exclude.

I'm wondering if using the MIME type in conjunction with the first approach
would work well. So, accept URLs with MIME type text/html, accept URLs that
match some URL patterns I want to include and exclude the rest.

I can, I suppose, use approach #2 and not worry since files that don't have
text won't produce any searchable text in the index. I'm not too worried
about having some junk in the index as I'm not crawling a huge number of
pages.

Thoughts? What do folks generally do?

Thanks.

Sol
La @universidad_uci es Fidel: 15 años conectados al futuro... conectados a la Revolución
2002-2017