You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Alex Quezada <al...@deepwebtech.com> on 2006/09/18 23:09:52 UTC

how are filetypes mapped in Nutch

I've been trying to parse files end in ps.gz or pdf.gz.  I would expect 
that parse-zip would handle them first, and then based on the new 
filetype, pass it on to parse-pdf.  However, from looking at the 
parse-zip source, it seems that it only attempts to extract text from 
the zipped file directly.

But what's really strange is that for some files it goes (at least in 
the hadoop log) to parse-zip, and other times directly to parse-pdf.  
Anyone know where the filetype matching code is?  I'm wondering if the 
regex has a bug and it sometimes matches on the first part of the file 
type (ie ps instead of gz for 'ps.gz').

Thanks,

Alex

Re: how are filetypes mapped in Nutch

Posted by Ernesto De Santis <de...@yahoo.com.ar>.
Hi Alex

I don't know... but I'm interested in this point as you.
I'm downloading the nutch source code to found out it debugging.

If you are researching about this issue, we can share results.

I googled about it, and I found nothing.

Bye,
Ernesto.



Alex Quezada escribió:
> I've been trying to parse files end in ps.gz or pdf.gz.  I would 
> expect that parse-zip would handle them first, and then based on the 
> new filetype, pass it on to parse-pdf.  However, from looking at the 
> parse-zip source, it seems that it only attempts to extract text from 
> the zipped file directly.
>
> But what's really strange is that for some files it goes (at least in 
> the hadoop log) to parse-zip, and other times directly to parse-pdf.  
> Anyone know where the filetype matching code is?  I'm wondering if the 
> regex has a bug and it sometimes matches on the first part of the file 
> type (ie ps instead of gz for 'ps.gz').
>
> Thanks,
>
> Alex
>

	
	
		
__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya! 
http://www.yahoo.com.ar/respuestas