You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by stone2dbone <an...@gmail.com> on 2013/07/25 14:49:52 UTC
Nutch returns index as document
When I perform a crawl, one of the documents returned by Nutch is the index
of documents. e.g.
for a crawl of:
https://my.domain.name/inside/test/
the content of the first document is:
Index of /inside/test Index of /inside/test Parent Directory test_css.css
test_css.html test_css1.html test_css2.html test_css3.html test_css4.css
test_css4.html test_css5.cfm test_css6.cfm
How do I prevent this from happening?
regex-urlfilter.txt has the following:
# skip URLs
-^https://my.domain.name/inside/test$
# accept URLs
+^https://my.domain.name/inside/test/*
--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-returns-index-as-document-tp4080323.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch returns index as document
Posted by Sebastian Nagel <wa...@googlemail.com>.
Seed URL are filtered during inject.
A URL rejected by rules does not get
into CrawlDb and is not crawled.
You have to take care that seed URLs pass the filters.
On 08/02/2013 08:49 PM, stone2dbone wrote:
> Sebastian,
>
> Can you please clarify what you mean? Why can I not use
> https://my.domain.name/inside/test/ as a seed URL?
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Nutch-returns-index-as-document-tp4080323p4082258.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
Re: Nutch returns index as document
Posted by stone2dbone <an...@gmail.com>.
Sebastian,
Can you please clarify what you mean? Why can I not use
https://my.domain.name/inside/test/ as a seed URL?
--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-returns-index-as-document-tp4080323p4082258.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch returns index as document
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,
regexes must follow the Java regex syntax, see
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
I think your intention was:
# skip .../test and .../test/
-^https://my\.domain\.name/inside/test/?$
# allow paths below .../test/
+^https://my\.domain\.name/inside/test/.+
Finally, also seeds are filtered: you cannot use
https://my.domain.name/inside/test/
as seed URL.
Sebastian
On 07/25/2013 02:49 PM, stone2dbone wrote:
> When I perform a crawl, one of the documents returned by Nutch is the index
> of documents. e.g.
>
> for a crawl of:
> https://my.domain.name/inside/test/
>
> the content of the first document is:
> Index of /inside/test Index of /inside/test Parent Directory test_css.css
> test_css.html test_css1.html test_css2.html test_css3.html test_css4.css
> test_css4.html test_css5.cfm test_css6.cfm
>
> How do I prevent this from happening?
>
> regex-urlfilter.txt has the following:
> # skip URLs
> -^https://my.domain.name/inside/test$
>
> # accept URLs
> +^https://my.domain.name/inside/test/*
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Nutch-returns-index-as-document-tp4080323.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>