You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Dmitriy Fundak <df...@gmail.com> on 2009/10/26 15:53:11 UTC

How to index files only with specific type

Hi, I've create parser and indexer to specific file type(geo xml meta
file - kml).
I am trying to crawl couple of sites, and index only files of this type.
I don't want to index html or anything else.
How can I achieve this?
Thanks.

Re: How to index files only with specific type

Posted by Dmitriy Fundak <df...@gmail.com>.

Checking url postfix and returning null if it's not one I need helped.
Thanks, Andrzej.

2009/10/27 Andrzej Bialecki <ab...@getopt.org>:
> Dmitriy Fundak wrote:
>>
>> If I disable html-parser(remove "parse-(html" from plugin.includes
>> property) html filed didn't get parsed
>> So didn't get outlinks to kml files from html.
>> So I can't parse and index kml files.
>> I might not be right, but I have a feeling that it's not possible
>> without modifying source code.
>
> It's possible to do this with a custom indexing filter - see other indexing
> filters to get a feeling of what's involved. Or you could do this with a
> scoring filter, too, although the scoring API looks more complicated.
>
> Either way, when you execute the Indexer, these filters are run in a chain,
> and if one of them returns null then that document is discarded, i.e. it's
> not added to the output index. So, it's easy to examine in your indexing
> filter the content type (or just a URL of the document) and either pass the
> document on or reject it by returning null.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: How to index files only with specific type

Posted by Andrzej Bialecki <ab...@getopt.org>.

Dmitriy Fundak wrote:
> If I disable html-parser(remove "parse-(html" from plugin.includes
> property) html filed didn't get parsed
> So didn't get outlinks to kml files from html.
> So I can't parse and index kml files.
> I might not be right, but I have a feeling that it's not possible
> without modifying source code.

It's possible to do this with a custom indexing filter - see other 
indexing filters to get a feeling of what's involved. Or you could do 
this with a scoring filter, too, although the scoring API looks more 
complicated.

Either way, when you execute the Indexer, these filters are run in a 
chain, and if one of them returns null then that document is discarded, 
i.e. it's not added to the output index. So, it's easy to examine in 
your indexing filter the content type (or just a URL of the document) 
and either pass the document on or reject it by returning null.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: How to index files only with specific type

Posted by Dmitriy Fundak <df...@gmail.com>.

If I disable html-parser(remove "parse-(html" from plugin.includes
property) html filed didn't get parsed
So didn't get outlinks to kml files from html.
So I can't parse and index kml files.
I might not be right, but I have a feeling that it's not possible
without modifying source code.

thx

2009/10/26 BELLINI ADAM <mb...@msn.com>:
>
> disable the html-parser from the nutch-site and keep only your parser.
> you can also add in uour filter file this : -(htm|html)$
>
> thx
>
>
>
>> Date: Mon, 26 Oct 2009 17:53:11 +0300
>> Subject: How to index files only with specific type
>> From: dfundak@gmail.com
>> To: nutch-user@lucene.apache.org
>>
>> Hi, I've create parser and indexer to specific file type(geo xml meta
>> file - kml).
>> I am trying to crawl couple of sites, and index only files of this type.
>> I don't want to index html or anything else.
>> How can I achieve this?
>> Thanks.-
>
> _________________________________________________________________
> Save up to 84% on Windows 7 until Jan 3—eligible CDN College & University students only. Hurry—buy it now for $39.99!
> http://go.microsoft.com/?linkid=9691635

RE: How to index files only with specific type

Posted by BELLINI ADAM <mb...@msn.com>.

disable the html-parser from the nutch-site and keep only your parser.
you can also add in uour filter file this : -(htm|html)$

thx



> Date: Mon, 26 Oct 2009 17:53:11 +0300
> Subject: How to index files only with specific type
> From: dfundak@gmail.com
> To: nutch-user@lucene.apache.org
> 
> Hi, I've create parser and indexer to specific file type(geo xml meta
> file - kml).
> I am trying to crawl couple of sites, and index only files of this type.
> I don't want to index html or anything else.
> How can I achieve this?
> Thanks.-
 		 	   		  
_________________________________________________________________
Save up to 84% on Windows 7 until Jan 3—eligible CDN College & University students only. Hurry—buy it now for $39.99!
http://go.microsoft.com/?linkid=9691635