You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Michela Becchi <mb...@nec-labs.com> on 2010/05/18 16:18:23 UTC
Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.
Hello,
I am performing a local file system crawling.
My problem is the following: all files that contain some hexadecimal
characters in the name do not get crawled.
For example, I will see the following error:
fetching
file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a0
9.html
org.apache.nutch.protocol.file.FileError: File Error: 404
at
org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535)
fetch of
file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a0
9.html failed with: org.apache.nutch.protocol.file.FileError: File
Error: 404
I am using nutch-1.0.
Among other standard settings, I configured nutch-site.conf as follows:
<property>
<name>plugin.includes</name>
<value>protocol-file|protocol-http|urlfilter-regex|parse-(text|html|js|p
df)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summ
ary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints
plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please
enable
protocol-httpclient, but be aware of possible intermittent problems
with the
underlying commons-httpclient library.
</description>
</property>
<property>
<name>file.content.limit</name>
<value>-1</value>
</property>
Moreover, crawl-urlfilter.txt looks like:
# skip http:, ftp:, & mailto: urls
-^(http|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
# accept everything else
+.*
~
---
Thanks,
Michela
Re: Crawling - File Error 404 when fetching file with an
hexadecimal character in the file name.
Posted by Michela Becchi <mb...@nec-labs.com>.
Hi,
I circumvented this problem by modifying the
org.apache.nutch.protocol.file.FileResponse class belonging to the
protocol-file plugin.
In particular, at line 120, I added
String path = "".equals(url.getPath()) ? "/" : url.getPath();
+String decoded_path = path;
+try {
+ decoded_path=java.net.URLDecoder.decode(path,"UTF-8");
+}catch(Exception ex){}
Then, rather than
- java.io.File f = new java.io.File(path);
I have
+ java.io.File f = new java.io.File(decoded_path);
Thanks,
Michela
--
View this message in context: http://lucene.472066.n3.nabble.com/Crawling-File-Error-404-when-fetching-file-with-an-hexadecimal-character-in-the-file-name-tp826407p848871.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.
Re: Crawling - File Error 404 when fetching file with an
hexadecimal character in the file name.
Posted by Michela Becchi <mb...@nec-labs.com>.
Hi Julien,
Thanks a lot.
I tried the same test you indicated ("bin/nutch plugin protocol-file
org.apache.nutch.protocol.file ...") and got again an Error 404. Of course,
I don't get this error if, when issuing the command, I replace the
hexadecimal representation (e.g., "%28" with "(").
I opened an issue in JIRA, as you suggested.
Michela
--
View this message in context: http://lucene.472066.n3.nabble.com/Crawling-File-Error-404-when-fetching-file-with-an-hexadecimal-character-in-the-file-name-tp826407p832811.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.
Re: Crawling - File Error 404 when fetching file with an hexadecimal
character in the file name.
Posted by Julien Nioche <li...@gmail.com>.
Hi Michela,
I tried* *the following command on a* *dummy file*
*
>
> *bin/nutch plugin protocol-file org.apache.nutch.protocol.file.File
> file:/tmp/A.M._%28album%29_8a09.html *
>
and got the expected results :
*Content-Type: text/html
> Content-Length: 47067
> Last-Modified: Tue, 18 May 2010 16:05:46 GMT*
>
I assume that your local file is named *A.M._(album)_8a09.html*, in which
case we get a 404 indeed. Could you please describe the issue in JIRA?
Thanks
Julien
--
DigitalPebble Ltd
http://www.digitalpebble.com
On 18 May 2010 15:18, Michela Becchi <mb...@nec-labs.com> wrote:
> Hello,
>
>
>
> I am performing a local file system crawling.
>
> My problem is the following: all files that contain some hexadecimal
> characters in the name do not get crawled.
>
>
>
> For example, I will see the following error:
>
>
>
> fetching
> file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html
>
> org.apache.nutch.protocol.file.FileError: File Error: 404
>
> at
> org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92)
>
> at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535)
>
> fetch of
> file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html
> failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
>
>
>
> I am using nutch-1.0.
>
>
>
> Among other standard settings, I configured nutch-site.conf as follows:
>
>
>
> <property>
>
> <name>plugin.includes</name>
>
>
> <value>protocol-file|protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>
> <description>Regular expression naming plugin directory names to
>
> include. Any plugin not matching this expression is excluded.
>
> In any case you need at least include the nutch-extensionpoints plugin.
> By
>
> default Nutch includes crawling just HTML and plain text via HTTP,
>
> and basic indexing and search plugins. In order to use HTTPS please
> enable
>
> protocol-httpclient, but be aware of possible intermittent problems with
> the
>
> underlying commons-httpclient library.
>
> </description>
>
> </property>
>
>
>
> <property>
>
> <name>file.content.limit</name>
>
> <value>-1</value>
>
> </property>
>
>
>
> Moreover, crawl-urlfilter.txt looks like:
>
>
>
> # skip http:, ftp:, & mailto: urls
>
> -^(http|ftp|mailto):
>
>
>
> # skip image and other suffixes we can't yet parse
>
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
>
>
> # skip URLs containing certain characters as probable queries, etc.
>
> -[?*!@=]
>
>
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
>
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
>
>
> # accept hosts in MY.DOMAIN.NAME
>
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>
>
>
> # accept everything else
>
> +.*
>
> ~
>
>
>
> ---
>
>
>
> Thanks,
>
>
>
> Michela
>
>
>
>
>