You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ytthet <ye...@gmail.com> on 2012/08/29 14:13:24 UTC
local file system crawl, unable to fetch file name containing CJK
letter.
Hi Folks,
I am indexing local file system using file-protocol plugin. I encounter an
issue where the crawler is unable to fetch file name that contains CJK (non
English characters). For my case Korean characters.
I have following file in my target local file system directory.
file1.txt
file2.txt
filewithkorean가맹점정.txt
fileN.txt
When I crawl, the crawler could only fetch file1.txt, file2.txt and
filen.txt. But not the filewithkorean가맹점정.txt.
I tried parser checker command ./bin/nutch
org.apache.nutch.parse.ParserChecker file:///C:/targetdir to check the
outlink extracted the directory. following is the result.
Title: Index of C:\targetdir
Outlinks: 2
outlink: toUrl: file:/C:/targetdir/file1.txt anchor: file1.txt
outlink: toUrl: file:/C:/targetdir/file2.txt anchor: file2.txt
outlink: toUrl: file:/C:/targetdir/filewithkorean??????.txt anchor:
filewithkorean??????.txt
outlink: toUrl: file:/C:/targetdir/fileN.txt anchor: fileN.txt
Content Metadata: Content-Length=1164 Last-Modified=Wed, 29 Aug 2012
08:47:32 GMT Content-Type=text/html
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
As above, the korean characters become ????? in the outlink. Thus when the
fetcher runs, it fetches /C:/targetdir/filewithkorean??????.txt instead of
/C:/targetdir/filewithkorean가맹점정.txt and hit 404.
My initial guess was that CharSet encoding detection in the parser was the
issue. I tried setting different encodings such as, windows-1252, utf-9,
euc-kr and few others. But that does not seem to fix the issue.
Has anyone encountered similar issue and fixed it before? I would appreciate
any suggestion.
Thanks,
Ye
--
View this message in context: http://lucene.472066.n3.nabble.com/local-file-system-crawl-unable-to-fetch-file-name-containing-CJK-letter-tp4003999.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: local file system crawl, unable to fetch file name containing CJK letter.
Posted by Ye T Thet <ye...@gmail.com>.
Hi Lewis,
I would be happy to do that. Let me dig up some docs on Nutch Dev. I am
completely new to open source project.
Catch you folks in dev@nutch.
Cheers,
Ye
On Fri, Aug 31, 2012 at 2:27 AM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:
> Hi Ye,
>
> If you could contribute this to the community as a patch it would be
> greatly appreciated.
>
> If you need any help wit this then please ping us on dev@nutch and we
> will be more than happy to help you out.
>
> Thanks you in advance
>
> Lewis
>
> On Thu, Aug 30, 2012 at 2:14 PM, Ye T Thet <ye...@gmail.com> wrote:
> > Hi Folks,
> >
> > I solved the issue. I am sharing it here in case if others have similar
> > unsolved issue.
> >
> > It is due to the bug in the protocol-file plugin. FileResponse.java. File
> > name is not properly encoded for UTF 8 file name. I changed some code in
> > the constructor and one private method called list2html. The change is
> the
> > combination of the discussion on following tow JIRAs.
> >
> > https://issues.apache.org/jira/browse/NUTCH-824
> > https://issues.apache.org/jira/browse/NUTCH-968
> >
> > It is important to change the code both in constructor and the private
> > method.
> >
> > Cheers,
> >
> > Ye
> >
> >
> > On Wed, Aug 29, 2012 at 10:52 PM, hugo.ma <hu...@gmail.com>
> wrote:
> >
> >> I had a similar problem. My solution was to modify the HTTPREsponse
> class
> >> in
> >> org.apache.nutch.protocol.httpclient.
> >>
> >> In Constructor i changed the first lines like this:
> >>
> >> // Prepare GET method for HTTP request
> >> this.url = url;
> >> URI uri =null;
> >> //MODIFIED
> >>
> >> try {
> >> uri = new URI(url.getProtocol(), url.getHost(), url.getPath(),
> >> url.getQuery(), null);
> >> } catch (Exception e) {
> >> // do whatever you want
> >> }
> >>
> >> GetMethod get = new GetMethod(uri.toASCIIString());
> >>
> >> //Continue with the original code
> >>
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/local-file-system-crawl-unable-to-fetch-file-name-containing-CJK-letter-tp4003999p4004059.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
>
>
>
> --
> Lewis
>
Re: local file system crawl, unable to fetch file name containing CJK letter.
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Ye,
If you could contribute this to the community as a patch it would be
greatly appreciated.
If you need any help wit this then please ping us on dev@nutch and we
will be more than happy to help you out.
Thanks you in advance
Lewis
On Thu, Aug 30, 2012 at 2:14 PM, Ye T Thet <ye...@gmail.com> wrote:
> Hi Folks,
>
> I solved the issue. I am sharing it here in case if others have similar
> unsolved issue.
>
> It is due to the bug in the protocol-file plugin. FileResponse.java. File
> name is not properly encoded for UTF 8 file name. I changed some code in
> the constructor and one private method called list2html. The change is the
> combination of the discussion on following tow JIRAs.
>
> https://issues.apache.org/jira/browse/NUTCH-824
> https://issues.apache.org/jira/browse/NUTCH-968
>
> It is important to change the code both in constructor and the private
> method.
>
> Cheers,
>
> Ye
>
>
> On Wed, Aug 29, 2012 at 10:52 PM, hugo.ma <hu...@gmail.com> wrote:
>
>> I had a similar problem. My solution was to modify the HTTPREsponse class
>> in
>> org.apache.nutch.protocol.httpclient.
>>
>> In Constructor i changed the first lines like this:
>>
>> // Prepare GET method for HTTP request
>> this.url = url;
>> URI uri =null;
>> //MODIFIED
>>
>> try {
>> uri = new URI(url.getProtocol(), url.getHost(), url.getPath(),
>> url.getQuery(), null);
>> } catch (Exception e) {
>> // do whatever you want
>> }
>>
>> GetMethod get = new GetMethod(uri.toASCIIString());
>>
>> //Continue with the original code
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/local-file-system-crawl-unable-to-fetch-file-name-containing-CJK-letter-tp4003999p4004059.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
--
Lewis
Re: local file system crawl, unable to fetch file name containing CJK letter.
Posted by Ye T Thet <ye...@gmail.com>.
Hi Folks,
I solved the issue. I am sharing it here in case if others have similar
unsolved issue.
It is due to the bug in the protocol-file plugin. FileResponse.java. File
name is not properly encoded for UTF 8 file name. I changed some code in
the constructor and one private method called list2html. The change is the
combination of the discussion on following tow JIRAs.
https://issues.apache.org/jira/browse/NUTCH-824
https://issues.apache.org/jira/browse/NUTCH-968
It is important to change the code both in constructor and the private
method.
Cheers,
Ye
On Wed, Aug 29, 2012 at 10:52 PM, hugo.ma <hu...@gmail.com> wrote:
> I had a similar problem. My solution was to modify the HTTPREsponse class
> in
> org.apache.nutch.protocol.httpclient.
>
> In Constructor i changed the first lines like this:
>
> // Prepare GET method for HTTP request
> this.url = url;
> URI uri =null;
> //MODIFIED
>
> try {
> uri = new URI(url.getProtocol(), url.getHost(), url.getPath(),
> url.getQuery(), null);
> } catch (Exception e) {
> // do whatever you want
> }
>
> GetMethod get = new GetMethod(uri.toASCIIString());
>
> //Continue with the original code
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/local-file-system-crawl-unable-to-fetch-file-name-containing-CJK-letter-tp4003999p4004059.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
Re: local file system crawl, unable to fetch file name containing
CJK letter.
Posted by "hugo.ma" <hu...@gmail.com>.
I had a similar problem. My solution was to modify the HTTPREsponse class in
org.apache.nutch.protocol.httpclient.
In Constructor i changed the first lines like this:
// Prepare GET method for HTTP request
this.url = url;
URI uri =null;
//MODIFIED
try {
uri = new URI(url.getProtocol(), url.getHost(), url.getPath(),
url.getQuery(), null);
} catch (Exception e) {
// do whatever you want
}
GetMethod get = new GetMethod(uri.toASCIIString());
//Continue with the original code
--
View this message in context: http://lucene.472066.n3.nabble.com/local-file-system-crawl-unable-to-fetch-file-name-containing-CJK-letter-tp4003999p4004059.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: local file system crawl, unable to fetch file name containing CJK letter.
Posted by Ye T Thet <ye...@gmail.com>.
Thanks Lewis,
My guess the issue is either with the encoding in the parser or the file
protocol plugin.
I found this and tried it though. It does not work.
https://issues.apache.org/jira/browse/NUTCH-824
I am still digging around the source code to get it solve.
Regards,
Ye
On Wed, Aug 29, 2012 at 9:12 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:
> Please have a look at the discussion below
>
> http://www.mail-archive.com/user@nutch.apache.org/msg04176.html
>
> It should help you out.. or point you in the correct direction at least.
>
> hth
>
> Lewis
>
> On Wed, Aug 29, 2012 at 1:13 PM, ytthet <ye...@gmail.com> wrote:
> > Hi Folks,
> >
> > I am indexing local file system using file-protocol plugin. I encounter
> an
> > issue where the crawler is unable to fetch file name that contains CJK
> (non
> > English characters). For my case Korean characters.
> >
> > I have following file in my target local file system directory.
> >
> > file1.txt
> > file2.txt
> > filewithkorean가맹점정.txt
> > fileN.txt
> >
> > When I crawl, the crawler could only fetch file1.txt, file2.txt and
> > filen.txt. But not the filewithkorean가맹점정.txt.
> >
> > I tried parser checker command ./bin/nutch
> > org.apache.nutch.parse.ParserChecker file:///C:/targetdir to check the
> > outlink extracted the directory. following is the result.
> >
> > Title: Index of C:\targetdir
> > Outlinks: 2
> > outlink: toUrl: file:/C:/targetdir/file1.txt anchor: file1.txt
> > outlink: toUrl: file:/C:/targetdir/file2.txt anchor: file2.txt
> > outlink: toUrl: file:/C:/targetdir/filewithkorean??????.txt anchor:
> > filewithkorean??????.txt
> > outlink: toUrl: file:/C:/targetdir/fileN.txt anchor: fileN.txt
> > Content Metadata: Content-Length=1164 Last-Modified=Wed, 29 Aug 2012
> > 08:47:32 GMT Content-Type=text/html
> > Parse Metadata: CharEncodingForConversion=utf-8
> OriginalCharEncoding=utf-8
> >
> >
> > As above, the korean characters become ????? in the outlink. Thus when
> the
> > fetcher runs, it fetches /C:/targetdir/filewithkorean??????.txt instead
> of
> > /C:/targetdir/filewithkorean가맹점정.txt and hit 404.
> >
> > My initial guess was that CharSet encoding detection in the parser was
> the
> > issue. I tried setting different encodings such as, windows-1252, utf-9,
> > euc-kr and few others. But that does not seem to fix the issue.
> >
> > Has anyone encountered similar issue and fixed it before? I would
> appreciate
> > any suggestion.
> >
> > Thanks,
> >
> > Ye
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/local-file-system-crawl-unable-to-fetch-file-name-containing-CJK-letter-tp4003999.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
>
> --
> Lewis
>
Re: local file system crawl, unable to fetch file name containing CJK letter.
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Please have a look at the discussion below
http://www.mail-archive.com/user@nutch.apache.org/msg04176.html
It should help you out.. or point you in the correct direction at least.
hth
Lewis
On Wed, Aug 29, 2012 at 1:13 PM, ytthet <ye...@gmail.com> wrote:
> Hi Folks,
>
> I am indexing local file system using file-protocol plugin. I encounter an
> issue where the crawler is unable to fetch file name that contains CJK (non
> English characters). For my case Korean characters.
>
> I have following file in my target local file system directory.
>
> file1.txt
> file2.txt
> filewithkorean가맹점정.txt
> fileN.txt
>
> When I crawl, the crawler could only fetch file1.txt, file2.txt and
> filen.txt. But not the filewithkorean가맹점정.txt.
>
> I tried parser checker command ./bin/nutch
> org.apache.nutch.parse.ParserChecker file:///C:/targetdir to check the
> outlink extracted the directory. following is the result.
>
> Title: Index of C:\targetdir
> Outlinks: 2
> outlink: toUrl: file:/C:/targetdir/file1.txt anchor: file1.txt
> outlink: toUrl: file:/C:/targetdir/file2.txt anchor: file2.txt
> outlink: toUrl: file:/C:/targetdir/filewithkorean??????.txt anchor:
> filewithkorean??????.txt
> outlink: toUrl: file:/C:/targetdir/fileN.txt anchor: fileN.txt
> Content Metadata: Content-Length=1164 Last-Modified=Wed, 29 Aug 2012
> 08:47:32 GMT Content-Type=text/html
> Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
>
>
> As above, the korean characters become ????? in the outlink. Thus when the
> fetcher runs, it fetches /C:/targetdir/filewithkorean??????.txt instead of
> /C:/targetdir/filewithkorean가맹점정.txt and hit 404.
>
> My initial guess was that CharSet encoding detection in the parser was the
> issue. I tried setting different encodings such as, windows-1252, utf-9,
> euc-kr and few others. But that does not seem to fix the issue.
>
> Has anyone encountered similar issue and fixed it before? I would appreciate
> any suggestion.
>
> Thanks,
>
> Ye
>
>
>
>
>
>
>
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/local-file-system-crawl-unable-to-fetch-file-name-containing-CJK-letter-tp4003999.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
--
Lewis