You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by ytthet <ye...@gmail.com> on 2012/08/29 14:13:24 UTC

local file system crawl, unable to fetch file name containing CJK letter.

Hi Folks,

I am indexing local file system using file-protocol plugin. I encounter an
issue where the crawler is unable to fetch file name that contains CJK (non
English characters). For my case Korean characters.

I have following file in my target local file system directory.

file1.txt
file2.txt
filewithkorean가맹점정.txt
fileN.txt

When I crawl, the crawler could only fetch file1.txt, file2.txt and
filen.txt. But not the filewithkorean가맹점정.txt.

I tried parser checker command ./bin/nutch
org.apache.nutch.parse.ParserChecker file:///C:/targetdir to check the
outlink extracted the directory. following is the result.

Title: Index of C:\targetdir
Outlinks: 2
outlink: toUrl: file:/C:/targetdir/file1.txt anchor: file1.txt
outlink: toUrl: file:/C:/targetdir/file2.txt anchor: file2.txt
outlink: toUrl: file:/C:/targetdir/filewithkorean??????.txt anchor:
filewithkorean??????.txt
outlink: toUrl: file:/C:/targetdir/fileN.txt anchor: fileN.txt
Content Metadata: Content-Length=1164 Last-Modified=Wed, 29 Aug 2012
08:47:32 GMT Content-Type=text/html
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8

As above, the korean characters become ????? in the outlink. Thus when the
fetcher runs, it fetches /C:/targetdir/filewithkorean??????.txt instead of
/C:/targetdir/filewithkorean가맹점정.txt and hit 404.

My initial guess was that CharSet encoding detection in the parser was the
issue. I tried setting different encodings such as, windows-1252, utf-9,
euc-kr and few others. But that does not seem to fix the issue.

Has anyone encountered similar issue and fixed it before? I would appreciate
any suggestion.

Thanks,

--
View this message in context: http://lucene.472066.n3.nabble.com/local-file-system-crawl-unable-to-fetch-file-name-containing-CJK-letter-tp4003999.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: local file system crawl, unable to fetch file name containing CJK letter.

Posted by Ye T Thet <ye...@gmail.com>.

Hi Lewis,

I would be happy to do that. Let me dig up some docs on Nutch Dev. I am
completely new to open source project.

Catch you folks in dev@nutch.

Cheers,

Ye

On Fri, Aug 31, 2012 at 2:27 AM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Ye,
>
> If you could contribute this to the community as a patch it would be
> greatly appreciated.
>
> If you need any help wit this then please ping us on dev@nutch and we
> will be more than happy to help you out.
>
> Thanks you in advance
>
> Lewis
>
> On Thu, Aug 30, 2012 at 2:14 PM, Ye T Thet <ye...@gmail.com> wrote:
> > Hi Folks,
> >
> > I solved the issue. I am sharing it here in case if others have similar
> > unsolved issue.
> >
> > It is due to the bug in the protocol-file plugin. FileResponse.java. File
> > name is not properly encoded for UTF 8 file name. I changed some code in
> > the constructor and one private method called list2html. The change is
> the
> > combination of the discussion on following tow JIRAs.
> >
> > https://issues.apache.org/jira/browse/NUTCH-824
> > https://issues.apache.org/jira/browse/NUTCH-968
> >
> > It is important to change the code both in constructor and the private
> > method.
> >
> > Cheers,
> >
> > Ye
> >
> >
> > On Wed, Aug 29, 2012 at 10:52 PM, hugo.ma <hu...@gmail.com>
> wrote:
> >
> >> I had a similar problem. My solution was to modify the HTTPREsponse
> class
> >> in
> >> org.apache.nutch.protocol.httpclient.
> >>
> >> In Constructor i changed the first lines like this:
> >>
> >>  // Prepare GET method for HTTP request
> >>    this.url = url;
> >>    URI uri =null;
> >>      //MODIFIED
> >>
> >>    try {
> >>      uri = new URI(url.getProtocol(), url.getHost(), url.getPath(),
> >> url.getQuery(), null);
> >>    } catch (Exception e) {
> >>    // do whatever you want
> >>   }
> >>
> >>  GetMethod get = new GetMethod(uri.toASCIIString());
> >>
> >> //Continue with the original code
> >>
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/local-file-system-crawl-unable-to-fetch-file-name-containing-CJK-letter-tp4003999p4004059.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
>
>
>
> --
> Lewis
>

Re: local file system crawl, unable to fetch file name containing CJK letter.

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Ye,

If you could contribute this to the community as a patch it would be
greatly appreciated.

If you need any help wit this then please ping us on dev@nutch and we
will be more than happy to help you out.

Thanks you in advance

Lewis

On Thu, Aug 30, 2012 at 2:14 PM, Ye T Thet <ye...@gmail.com> wrote:
> Hi Folks,
>
> I solved the issue. I am sharing it here in case if others have similar
> unsolved issue.
>
> It is due to the bug in the protocol-file plugin. FileResponse.java. File
> name is not properly encoded for UTF 8 file name. I changed some code in
> the constructor and one private method called list2html. The change is the
> combination of the discussion on following tow JIRAs.
>
> https://issues.apache.org/jira/browse/NUTCH-824
> https://issues.apache.org/jira/browse/NUTCH-968
>
> It is important to change the code both in constructor and the private
> method.
>
> Cheers,
>
> Ye
>
>
> On Wed, Aug 29, 2012 at 10:52 PM, hugo.ma <hu...@gmail.com> wrote:
>
>> I had a similar problem. My solution was to modify the HTTPREsponse class
>> in
>> org.apache.nutch.protocol.httpclient.
>>
>> In Constructor i changed the first lines like this:
>>
>>  // Prepare GET method for HTTP request
>>    this.url = url;
>>    URI uri =null;
>>      //MODIFIED
>>
>>    try {
>>      uri = new URI(url.getProtocol(), url.getHost(), url.getPath(),
>> url.getQuery(), null);
>>    } catch (Exception e) {
>>    // do whatever you want
>>   }
>>
>>  GetMethod get = new GetMethod(uri.toASCIIString());
>>
>> //Continue with the original code
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/local-file-system-crawl-unable-to-fetch-file-name-containing-CJK-letter-tp4003999p4004059.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>



-- 
Lewis

Re: local file system crawl, unable to fetch file name containing CJK letter.

Posted by Ye T Thet <ye...@gmail.com>.

Hi Folks,

I solved the issue. I am sharing it here in case if others have similar
unsolved issue.

It is due to the bug in the protocol-file plugin. FileResponse.java. File
name is not properly encoded for UTF 8 file name. I changed some code in
the constructor and one private method called list2html. The change is the
combination of the discussion on following tow JIRAs.

https://issues.apache.org/jira/browse/NUTCH-824
https://issues.apache.org/jira/browse/NUTCH-968

It is important to change the code both in constructor and the private
method.

Cheers,

Ye

On Wed, Aug 29, 2012 at 10:52 PM, hugo.ma <hu...@gmail.com> wrote:

> I had a similar problem. My solution was to modify the HTTPREsponse class
> in
> org.apache.nutch.protocol.httpclient.
>
> In Constructor i changed the first lines like this:
>
>  // Prepare GET method for HTTP request
>    this.url = url;
>    URI uri =null;
>      //MODIFIED
>
>    try {
>      uri = new URI(url.getProtocol(), url.getHost(), url.getPath(),
> url.getQuery(), null);
>    } catch (Exception e) {
>    // do whatever you want
>   }
>
>  GetMethod get = new GetMethod(uri.toASCIIString());
>
> //Continue with the original code
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/local-file-system-crawl-unable-to-fetch-file-name-containing-CJK-letter-tp4003999p4004059.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: local file system crawl, unable to fetch file name containing CJK letter.

Posted by "hugo.ma" <hu...@gmail.com>.

I had a similar problem. My solution was to modify the HTTPREsponse class in
org.apache.nutch.protocol.httpclient.

In Constructor i changed the first lines like this:

 // Prepare GET method for HTTP request
   this.url = url;
   URI uri =null;
     //MODIFIED  

   try {
     uri = new URI(url.getProtocol(), url.getHost(), url.getPath(),
url.getQuery(), null);   
   } catch (Exception e) {
   // do whatever you want
  } 

 GetMethod get = new GetMethod(uri.toASCIIString());

//Continue with the original code





--
View this message in context: http://lucene.472066.n3.nabble.com/local-file-system-crawl-unable-to-fetch-file-name-containing-CJK-letter-tp4003999p4004059.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: local file system crawl, unable to fetch file name containing CJK letter.

Posted by Ye T Thet <ye...@gmail.com>.

Thanks Lewis,

My guess the issue is either with the encoding in the parser or the file
protocol plugin.

I found this and tried it though. It does not work.
https://issues.apache.org/jira/browse/NUTCH-824

I am still digging around the source code to get it solve.

Regards,

Ye

On Wed, Aug 29, 2012 at 9:12 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Please have a look at the discussion below
>
> http://www.mail-archive.com/user@nutch.apache.org/msg04176.html
>
> It should help you out.. or point you in the correct direction at least.
>
> hth
>
> Lewis
>
> On Wed, Aug 29, 2012 at 1:13 PM, ytthet <ye...@gmail.com> wrote:
> > Hi Folks,
> >
> > I am indexing local file system using file-protocol plugin. I encounter
> an
> > issue where the crawler is unable to fetch file name that contains CJK
> (non
> > English characters). For my case Korean characters.
> >
> > I have following file in my target local file system directory.
> >
> > file1.txt
> > file2.txt
> > filewithkorean가맹점정.txt
> > fileN.txt
> >
> > When I crawl, the crawler could only fetch file1.txt, file2.txt and
> > filen.txt. But not the filewithkorean가맹점정.txt.
> >
> > I tried parser checker command ./bin/nutch
> > org.apache.nutch.parse.ParserChecker file:///C:/targetdir to check the
> > outlink extracted the directory. following is the result.
> >
> > Title: Index of C:\targetdir
> > Outlinks: 2
> >   outlink: toUrl: file:/C:/targetdir/file1.txt anchor: file1.txt
> >   outlink: toUrl: file:/C:/targetdir/file2.txt anchor: file2.txt
> >   outlink: toUrl: file:/C:/targetdir/filewithkorean??????.txt anchor:
> > filewithkorean??????.txt
> >   outlink: toUrl: file:/C:/targetdir/fileN.txt anchor: fileN.txt
> > Content Metadata: Content-Length=1164 Last-Modified=Wed, 29 Aug 2012
> > 08:47:32 GMT Content-Type=text/html
> > Parse Metadata: CharEncodingForConversion=utf-8
> OriginalCharEncoding=utf-8
> >
> >
> > As above, the korean characters become ????? in the outlink. Thus when
> the
> > fetcher runs, it fetches /C:/targetdir/filewithkorean??????.txt instead
> of
> > /C:/targetdir/filewithkorean가맹점정.txt and hit 404.
> >
> > My initial guess was that CharSet encoding detection in the parser was
> the
> > issue. I tried setting different encodings such as, windows-1252, utf-9,
> > euc-kr and few others. But that does not seem to fix the issue.
> >
> > Has anyone encountered similar issue and fixed it before? I would
> appreciate
> > any suggestion.
> >
> > Thanks,
> >
> > Ye
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/local-file-system-crawl-unable-to-fetch-file-name-containing-CJK-letter-tp4003999.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
>
> --
> Lewis
>

Re: local file system crawl, unable to fetch file name containing CJK letter.

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Please have a look at the discussion below

http://www.mail-archive.com/user@nutch.apache.org/msg04176.html

It should help you out.. or point you in the correct direction at least.

hth

Lewis

On Wed, Aug 29, 2012 at 1:13 PM, ytthet <ye...@gmail.com> wrote:
> Hi Folks,
>
> I am indexing local file system using file-protocol plugin. I encounter an
> issue where the crawler is unable to fetch file name that contains CJK (non
> English characters). For my case Korean characters.
>
> I have following file in my target local file system directory.
>
> file1.txt
> file2.txt
> filewithkorean가맹점정.txt
> fileN.txt
>
> When I crawl, the crawler could only fetch file1.txt, file2.txt and
> filen.txt. But not the filewithkorean가맹점정.txt.
>
> I tried parser checker command ./bin/nutch
> org.apache.nutch.parse.ParserChecker file:///C:/targetdir to check the
> outlink extracted the directory. following is the result.
>
> Title: Index of C:\targetdir
> Outlinks: 2
>   outlink: toUrl: file:/C:/targetdir/file1.txt anchor: file1.txt
>   outlink: toUrl: file:/C:/targetdir/file2.txt anchor: file2.txt
>   outlink: toUrl: file:/C:/targetdir/filewithkorean??????.txt anchor:
> filewithkorean??????.txt
>   outlink: toUrl: file:/C:/targetdir/fileN.txt anchor: fileN.txt
> Content Metadata: Content-Length=1164 Last-Modified=Wed, 29 Aug 2012
> 08:47:32 GMT Content-Type=text/html
> Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
>
>
> As above, the korean characters become ????? in the outlink. Thus when the
> fetcher runs, it fetches /C:/targetdir/filewithkorean??????.txt instead of
> /C:/targetdir/filewithkorean가맹점정.txt and hit 404.
>
> My initial guess was that CharSet encoding detection in the parser was the
> issue. I tried setting different encodings such as, windows-1252, utf-9,
> euc-kr and few others. But that does not seem to fix the issue.
>
> Has anyone encountered similar issue and fixed it before? I would appreciate
> any suggestion.
>
> Thanks,
>
> Ye
>
>
>
>
>
>
>
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/local-file-system-crawl-unable-to-fetch-file-name-containing-CJK-letter-tp4003999.html
> Sent from the Nutch - User mailing list archive at Nabble.com.



-- 
Lewis