You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Tomi NA <he...@gmail.com> on 2006/09/07 11:03:58 UTC

parse url and file attributes only - no content

I'd like the user to be able to find "my three dogs.jpg" if he
searches for "three dogs", even though nutch doesn't have a .jpg
parser. Whatsmore, I'd like the user to be able to search against any
other extrinsic file attribute: date, file size, even mime type, all
without reading a single bit of the actual file contents.
Can nutch be configured so that it indexes these external file
properties and completely skip file contents?
I thought maybe I could adapt an existing parser (parse-text?) to do
the job, but I guess I'd still be stuck with reading megabytes of
unparsable data, just to fill in the url, type, date and similar
attributes. I'd appreciate a comment or two.

TIA,
t.n.a.

Re: parse url and file attributes only - no content

Posted by Tomi NA <he...@gmail.com>.

On 9/7/06, heack <ko...@gmail.com> wrote:
> I meet the same problem with you. I think if there exist a way to store a
> description to .mp3 .wmv or .avi .. files, and could be searched.

I believe the problem can't be solved by adding a new parse plugin to
parse "all other (binary) filetypes": this additional parser would
still get the complete (possibly very big) file from the remote host.
At which level are the http.content.limit and file.content.limit taken
into accont?
I'm thinking a new configuration setting (say,
(http|file).unsupported.extensions) set to "mp3|iso|psd" etc. could
guide the fetch algorithm so that it doesn't fetch the file contents
for these files, but simply fetches information *about* the files in
question. How does that sound?

t.n.a.

Re: parse url and file attributes only - no content

Posted by heack <ko...@gmail.com>.

I meet the same problem with you. I think if there exist a way to store a 
description to .mp3 .wmv or .avi .. files, and could be searched.
----- Original Message ----- 
From: "Tomi NA" <he...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Thursday, September 07, 2006 5:03 PM
Subject: parse url and file attributes only - no content


> I'd like the user to be able to find "my three dogs.jpg" if he
> searches for "three dogs", even though nutch doesn't have a .jpg
> parser. Whatsmore, I'd like the user to be able to search against any
> other extrinsic file attribute: date, file size, even mime type, all
> without reading a single bit of the actual file contents.
> Can nutch be configured so that it indexes these external file
> properties and completely skip file contents?
> I thought maybe I could adapt an existing parser (parse-text?) to do
> the job, but I guess I'd still be stuck with reading megabytes of
> unparsable data, just to fill in the url, type, date and similar
> attributes. I'd appreciate a comment or two.
>
> TIA,
> t.n.a.