You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Michela Becchi <mb...@nec-labs.com> on 2010/05/18 16:18:23 UTC

Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.

Hello,

 

I am performing a local file system crawling.

My problem is the following: all files that contain some hexadecimal
characters in the name do not get crawled.

 

For example, I will see the following error:

 

fetching
file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a0
9.html

org.apache.nutch.protocol.file.FileError: File Error: 404

        at
org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92)

        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535)

fetch of
file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a0
9.html failed with: org.apache.nutch.protocol.file.FileError: File
Error: 404

 

I am using nutch-1.0.

 

Among other standard settings, I configured nutch-site.conf as follows:

 

<property>

  <name>plugin.includes</name>

 
<value>protocol-file|protocol-http|urlfilter-regex|parse-(text|html|js|p
df)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summ
ary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

  <description>Regular expression naming plugin directory names to

  include.  Any plugin not matching this expression is excluded.

  In any case you need at least include the nutch-extensionpoints
plugin. By

  default Nutch includes crawling just HTML and plain text via HTTP,

  and basic indexing and search plugins. In order to use HTTPS please
enable

  protocol-httpclient, but be aware of possible intermittent problems
with the

  underlying commons-httpclient library.

  </description>

</property>

 

<property>

  <name>file.content.limit</name>

  <value>-1</value>

</property>

 

Moreover, crawl-urlfilter.txt   looks like:

 

# skip http:, ftp:, & mailto: urls

-^(http|ftp|mailto):

 

# skip image and other suffixes we can't yet parse

-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

 

# skip URLs containing certain characters as probable queries, etc.

-[?*!@=]

 

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops

-.*(/[^/]+)/[^/]+\1/[^/]+\1/

 

# accept hosts in MY.DOMAIN.NAME

#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

 

# accept everything else

+.*

~    

 

---

 

Thanks,

 

Michela

Re: Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.

Posted by Michela Becchi <mb...@nec-labs.com>.

Hi,

I circumvented this problem by modifying the
org.apache.nutch.protocol.file.FileResponse class belonging to the
protocol-file plugin.

In particular, at line 120, I added

String path = "".equals(url.getPath()) ? "/" : url.getPath();
+String decoded_path = path;
+try { 
+ decoded_path=java.net.URLDecoder.decode(path,"UTF-8");
+}catch(Exception ex){}

Then, rather than

- java.io.File f = new java.io.File(path);

I have

+ java.io.File f = new java.io.File(decoded_path);

Thanks,

Michela
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Crawling-File-Error-404-when-fetching-file-with-an-hexadecimal-character-in-the-file-name-tp826407p848871.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.

Posted by Michela Becchi <mb...@nec-labs.com>.

Hi Julien,

Thanks a lot.

I tried the same test you indicated ("bin/nutch plugin protocol-file 
org.apache.nutch.protocol.file ...") and got again an Error 404. Of course,
I don't get this error if, when issuing the command, I replace the
hexadecimal representation (e.g., "%28" with "(").

I opened an issue in JIRA, as you suggested.

Michela
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Crawling-File-Error-404-when-fetching-file-with-an-hexadecimal-character-in-the-file-name-tp826407p832811.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.

Posted by Julien Nioche <li...@gmail.com>.

Hi Michela,

I tried* *the following command on a* *dummy file*

*
>
> *bin/nutch plugin protocol-file  org.apache.nutch.protocol.file.File
> file:/tmp/A.M._%28album%29_8a09.html *
>

and got the expected results :

*Content-Type: text/html
> Content-Length: 47067
> Last-Modified: Tue, 18 May 2010 16:05:46 GMT*
>

I assume that your local file is named *A.M._(album)_8a09.html*, in which
case we get a 404 indeed. Could you please describe the issue in JIRA?

Thanks

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com


On 18 May 2010 15:18, Michela Becchi <mb...@nec-labs.com> wrote:

>  Hello,
>
>
>
> I am performing a local file system crawling.
>
> My problem is the following: all files that contain some hexadecimal
> characters in the name do not get crawled.
>
>
>
> For example, I will see the following error:
>
>
>
> fetching
> file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html
>
> org.apache.nutch.protocol.file.FileError: File Error: 404
>
>         at
> org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92)
>
>         at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535)
>
> fetch of
> file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html
> failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
>
>
>
> I am using nutch-1.0.
>
>
>
> Among other standard settings, I configured nutch-site.conf as follows:
>
>
>
> <property>
>
>   <name>plugin.includes</name>
>
>
> <value>protocol-file|protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>
>   <description>Regular expression naming plugin directory names to
>
>   include.  Any plugin not matching this expression is excluded.
>
>   In any case you need at least include the nutch-extensionpoints plugin.
> By
>
>   default Nutch includes crawling just HTML and plain text via HTTP,
>
>   and basic indexing and search plugins. In order to use HTTPS please
> enable
>
>   protocol-httpclient, but be aware of possible intermittent problems with
> the
>
>   underlying commons-httpclient library.
>
>   </description>
>
> </property>
>
>
>
> <property>
>
>   <name>file.content.limit</name>
>
>   <value>-1</value>
>
> </property>
>
>
>
> Moreover, crawl-urlfilter.txt   looks like:
>
>
>
> # skip http:, ftp:, & mailto: urls
>
> -^(http|ftp|mailto):
>
>
>
> # skip image and other suffixes we can't yet parse
>
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
>
>
> # skip URLs containing certain characters as probable queries, etc.
>
> -[?*!@=]
>
>
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
>
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
>
>
> # accept hosts in MY.DOMAIN.NAME
>
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>
>
>
> # accept everything else
>
> +.*
>
> ~
>
>
>
> ---
>
>
>
> Thanks,
>
>
>
> Michela
>
>
>
>
>