You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Eugen Kochuev <eu...@lan23.net> on 2006/05/25 20:22:04 UTC

content-type crawling problem

Hello ,

Nutch is trying to crawl everything, including DLL, EXE and all
non-textual formats. How to limit nutch to only some desirable
content-types? I know it's possible to do this by editing urlfilter
plugin settings, but it's hard to predetermine all the possible
extensions and this technique is unreliable.
Is it possible to limit crawler to fetch only some definite
content-types or at least have only them indexed?

Re[2]: content-type crawling problem

Posted by Eugen Kochuev <eu...@lan23.net>.

Thanks for sharing the information, I'll try this, but if I got it
right parse-plugins.xml contains rules for the parser and still
undesirable documents will be fetched and stored in the segments.
Is it possible to stop fetcher from crawling these pages?

> Hello,

> i had also a similar problem, my little fix was to
> edit the parse-plugins.xml file. There is a the rule:

> <mimeType name="*">
>    <plugin id="parse-text" />
> </mimeType>

> Just uncomment this wilcard match. You might also check
> the other rules for further unwanted content.

> I don't know if this is the best place for such a change,
> but it worked for me.

> with best regards,

> Heiko Dietze



-- 
Best regards,
 Eugen                            mailto:eugen@lan23.net

FieldQueryFilter vs RawFieldQueryFilter

Posted by Bogdan Kecman <bo...@alteray.com>.

Hi,
I'm writing some plugins for nutch and some things are killing me. 
Can someone explain the difference between field and raw field ..

When I use LUKE, all queries work like a charm, but they return 0 results
trough nutch search..


Basically when should I have this as a query plugin:

-----
import org.apache.nutch.searcher.RawFieldQueryFilter;
public class HeadlineQueryFilter extends RawFieldQueryFilter {
	public HeadlineQueryFilter() {
		super("headline");
	}
}
------

And when:

-------
import org.apache.nutch.searcher.FieldQueryFilter;
public class HeadlineQueryFilter extends FieldQueryFilter {
  public HeadlineQueryFilter() {
    super("headline");
  }
}

-------
???

The indexing filter is:

----
   if (headline != null) {
        //doc.add(Field.Keyword("headline", headline));
    	doc.add(new Field("headline", headline, Field.Store.YES,
Field.Index.TOKENIZED));
    	LOG.info("Headline added");
    } else{
      	LOG.info("Headline not found");
    }
----

Thanx in advance
Bogdan

Re: content-type crawling problem

Posted by Stefan Neufeind <ap...@stefan-neufeind.de>.

Heiko Dietze wrote:
> Hello,
> 
> Eugen Kochuev wrote:
>> Btw, do I need to uncomment this? It's more logical to comment this
>> out. Right?
>>
>>
>>> <mimeType name="*">
>>>   <plugin id="parse-text" />
>>> </mimeType>
>>
>>
>>> Just uncomment this wilcard match. You might also check
>>> the other rules for further unwanted content.
> 
> Sorry for the typo, I meant that you should leave it out, yes.
> 
> Unfortunaly for the fetching of the pages this is not the solution, but
> the index will be based only on the proper content. I think the index is
> created with the parsed content.

Maybe have a look at urlfilter-suffix and only fetch those files with
suffixes you want.


Regards,
 Stefan

Re: content-type crawling problem

Posted by Heiko Dietze <he...@biotec.tu-dresden.de>.

Hello,

Eugen Kochuev wrote:
> Btw, do I need to uncomment this? It's more logical to comment this
> out. Right?
> 
> 
>><mimeType name="*">
>>   <plugin id="parse-text" />
>></mimeType>
> 
> 
>>Just uncomment this wilcard match. You might also check
>>the other rules for further unwanted content.

Sorry for the typo, I meant that you should leave it out, yes.

Unfortunaly for the fetching of the pages this is not the solution, but 
the index will be based only on the proper content. I think the index is 
created with the parsed content.

with best regards,

Heiko Dietze

Re[2]: content-type crawling problem

Posted by Eugen Kochuev <eu...@lan23.net>.

Btw, do I need to uncomment this? It's more logical to comment this
out. Right?

> <mimeType name="*">
>    <plugin id="parse-text" />
> </mimeType>

> Just uncomment this wilcard match. You might also check
> the other rules for further unwanted content.



-- 
Best regards,
 Eugen                            mailto:eugen@lan23.net

Re: content-type crawling problem

Posted by Heiko Dietze <he...@biotec.tu-dresden.de>.

Hello,

i had also a similar problem, my little fix was to
edit the parse-plugins.xml file. There is a the rule:

<mimeType name="*">
   <plugin id="parse-text" />
</mimeType>

Just uncomment this wilcard match. You might also check
the other rules for further unwanted content.

I don't know if this is the best place for such a change,
but it worked for me.

with best regards,

Heiko Dietze

Eugen Kochuev wrote:
> Any information on this? I really need to limit nutch in indexing
> (only textual formats, excluding css, javascript and other non human
> oriented data)
> 
> 
>>Nutch is trying to crawl everything, including DLL, EXE and all
>>non-textual formats. How to limit nutch to only some desirable
>>content-types? I know it's possible to do this by editing urlfilter
>>plugin settings, but it's hard to predetermine all the possible
>>extensions and this technique is unreliable.
>>Is it possible to limit crawler to fetch only some definite
>>content-types or at least have only them indexed?
> 
>

Re: content-type crawling problem

Posted by Eugen Kochuev <eu...@lan23.net>.

Any information on this? I really need to limit nutch in indexing
(only textual formats, excluding css, javascript and other non human
oriented data)

> Nutch is trying to crawl everything, including DLL, EXE and all
> non-textual formats. How to limit nutch to only some desirable
> content-types? I know it's possible to do this by editing urlfilter
> plugin settings, but it's hard to predetermine all the possible
> extensions and this technique is unreliable.
> Is it possible to limit crawler to fetch only some definite
> content-types or at least have only them indexed?