You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by stone2dbone <an...@gmail.com> on 2013/07/25 14:49:52 UTC

Nutch returns index as document

When I perform a crawl, one of the documents returned by Nutch is the index
of documents. e.g.

for a crawl of:
https://my.domain.name/inside/test/

the content of the first document is:
Index of /inside/test Index of /inside/test Parent Directory test_css.css
test_css.html test_css1.html test_css2.html test_css3.html test_css4.css
test_css4.html test_css5.cfm test_css6.cfm

How do I prevent this from happening?

regex-urlfilter.txt has the following:
# skip URLs
-^https://my.domain.name/inside/test$

# accept URLs
+^https://my.domain.name/inside/test/*




--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-returns-index-as-document-tp4080323.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch returns index as document

Posted by Sebastian Nagel <wa...@googlemail.com>.

Seed URL are filtered during inject.
A URL rejected by rules does not get
into CrawlDb and is not crawled.

You have to take care that seed URLs pass the filters.

On 08/02/2013 08:49 PM, stone2dbone wrote:
> Sebastian,
> 
> Can you please clarify what you mean?  Why can I not use
> https://my.domain.name/inside/test/ as a seed URL?
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Nutch-returns-index-as-document-tp4080323p4082258.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: Nutch returns index as document

Posted by stone2dbone <an...@gmail.com>.

Sebastian,

Can you please clarify what you mean?  Why can I not use
https://my.domain.name/inside/test/ as a seed URL?



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-returns-index-as-document-tp4080323p4082258.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch returns index as document

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi,

regexes must follow the Java regex syntax, see
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

I think your intention was:

# skip .../test and .../test/
-^https://my\.domain\.name/inside/test/?$
# allow paths below .../test/
+^https://my\.domain\.name/inside/test/.+

Finally, also seeds are filtered: you cannot use
  https://my.domain.name/inside/test/
as seed URL.

Sebastian


On 07/25/2013 02:49 PM, stone2dbone wrote:
> When I perform a crawl, one of the documents returned by Nutch is the index
> of documents. e.g.
> 
> for a crawl of:
> https://my.domain.name/inside/test/
> 
> the content of the first document is:
> Index of /inside/test Index of /inside/test Parent Directory test_css.css
> test_css.html test_css1.html test_css2.html test_css3.html test_css4.css
> test_css4.html test_css5.cfm test_css6.cfm
> 
> How do I prevent this from happening?
> 
> regex-urlfilter.txt has the following:
> # skip URLs
> -^https://my.domain.name/inside/test$
> 
> # accept URLs
> +^https://my.domain.name/inside/test/*
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Nutch-returns-index-as-document-tp4080323.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>