You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by kenneth man <ke...@scmedia.com.hk> on 2006/09/07 09:55:44 UTC

Charset question

Hi,

I want to do crawling on document with charset="big5-hkscs" (which is an
extension of big5, with extra hong kong chinese characters).  But the
document's meta tags set content="text/html; charset=big5" instead.  So the
crawl engine treats the document as "big5" instead of "big-hkscs".  That
makes the extra hong kong characters unreadable on search result page.  How
my question is:  Can I force the crawl engine to treat the document as
"big5-hkscs"?

Thanks,
Kenneth Man

Re: extracting displayed data of body tag in HTML documents

Posted by Fadzi Ushewokunze <de...@butterflycluster.com>.

hi Murat,

i think that happens already. I might be wrong.

A quick scan led me to this class
org.apache.nutch.parse.html.DOMContentUtils

have a look around, also consider looking at HTMLParser.java in nutch.

I havent looked at the NekoHTML source closely - but hope this helps.

Fadzi




On Thu, 2006-11-30 at 18:07 +0200, Murat Ali Bayir wrote:
> Hi All, I want to ask question about NecoHTML parser that is used by
> Nutch. I want to know
> whether we can have textExtraction funtion extracting displayed data in
> HTML documents
> between <body> and </body> tags ?
> 
> This textExtraction function can work like below:
> 
> case 1: Assume that our html document is given as:
> 
> <html>
> <body>
> 
> <a href="example.com"> this is an example </a>
> 
> </body>
> </html>
> 
> 
> the textExtraction function returns the string "this is an example". for
> case 1.
> 
> <html>
> <body>
> 
> <a href="example.com"> </a>
> 
> </body>
> </html>
> 
> 
> in this case textExtraction function returns null for case 2.
> 
> Is anybody know how to perform that by using NecoHTML parser?
>

extracting displayed data of body tag in HTML documents

Posted by Murat Ali Bayir <mu...@agmlab.com>.

Hi All, I want to ask question about NecoHTML parser that is used by
Nutch. I want to know
whether we can have textExtraction funtion extracting displayed data in
HTML documents
between <body> and </body> tags ?

This textExtraction function can work like below:

case 1: Assume that our html document is given as:

<html>
<body>

<a href="example.com"> this is an example </a>

</body>
</html>


the textExtraction function returns the string "this is an example". for
case 1.

<html>
<body>

<a href="example.com"> </a>

</body>
</html>


in this case textExtraction function returns null for case 2.

Is anybody know how to perform that by using NecoHTML parser?

RE: Charset question

Posted by King Kong <ch...@hotmail.com>.

to Ken:
I run on SLES 10 and JRE 1.5 , so GB18030 is supported all right.

I found the cause of problem. because the HtmlParser use nekohtml to
parse page,
and the nekohtml would parse page' meta to get the charset.
so when the page have defined charset, the HtmlParser setEncoding is of
no effect.

Be lucky , nekohtml have provided a feature named "
http://cyberneko.org/html/features/scanner/ignore-specified-charset"
to switch this function.

intro. about it in
http://www.netlikon.de/docs/nekohtml-0.9.5/constant-values.html#org.cyberneko.html.HTMLScanner.IGNORE_SPECIFIED_CHARSET
here.

we could set that feature to true after created the DOMFragmentParser,

DOMFragmentParser parser = new DOMFragmentParser();
try {

parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset",true);
....

then, we could parse the page in encoding we appointed.

BTW, in the parseNeko method of HtmlParser (in plugin parse-html ) ,
the following sentence
will throw exception:

parser.setFeature("http://apache.org/xml/features/include-comments",
true);
parser.setFeature("http://apache.org/xml/features/augmentations",
true);

so , we must put that sentence before it

parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset",true);
parser.setFeature("http://apache.org/xml/features/include-comments",
true);
parser.setFeature("http://apache.org/xml/features/augmentations",
true);

to Kennth Man :
I hope this message could help you to slove your problem :-)

Hm... Are you a chinese ? if you are, you could PM to me in
chinese.

--
View this message in context: http://www.nabble.com/Charset-question-tf2231717.html#a6359689
Sent from the Nutch - User forum at Nabble.com.

RE: Charset question

Posted by Ken Krugler <kk...@transpac.com>.

>I met the problem like Kenneth's
>
>I crawl a page that the actual charset is GB18030 , but in the meta of the
>page it is set to gb2312.
>
>so, I have got some unreadable characters when parse it;
>
>I have fixed the StringUtil.resolveEncodingAlias() following Ken's advise,
>
>  encodingAliases.put("GB2312", "GB18030");
>
>and I have got the message " setting encoding to GB18030"
>
>but, it resembles evenly useless. the result appear unreadable characters
>again.
>It seem that the parser use the original encoding as gb2312 .

Another change I'd suggest making is to verify that 
Charset.isSupported() returns true for a found alias, before 
returning that name from resolveEncodingAlias().

 From what I can tell, it seems like the current implementation is 
backwards - first it should look up the alias, and then make a call 
to check whether that alias is supported.

But a quick check on my system (JRE 1.5, Mac OS X 10.4.7) says that 
GB18030 is supported, so I'm guessing that's not your problem.

-- Ken


>Ken Krugler wrote:
>>
>>>Thanks for your reply.
>>>
>>>I have found that the method you mentioned looks into the http header from
>>>web server.  It looks for "charset" and does the mapping.  The apache web
>>>server which contains the document has already  configured:
>>>
>>>AddDefaultCharset Big5-HKSCS
>>>
>>>The crawl engine does treat the encoding of all pages from the web server
>as
>>>Big5-HKSCS.
>>>But the crawl engine also looks into the meta tag of the html page.
>>>I have two identical html pages with hong kong big5 characters. One has
>the
>>>tag
>>>
>>><meta http-equiv="Content-Type" content="text/html; charset=Big5" />
>>>
>>>The other
>>>
>>><meta http-equiv="Content-Type" content="text/html; charset=Big5-HKSCS" />
>>>
>>>When both of these html pages are in the search result page, the "summary"
>>>of the first one contains unreadable characters.
>>>So I think I need to modify some codes which read the meta tag of html
>page.
>>>Do you have any idea?
>>
>>   From a quick look at the source, this eventually also calls
>>  StringUtil.resolveEncodingAlias().
>>
>>  HtmlParser.getParse() calls StringUtil.parseCharacterEncoding(),
>>  passing it the content-type meta data, and then takes the returned
>>  charset name and calls StringUtil.resolveEncodingAlias().
>>
>>  So if you fix StringUtil.resolveEncodingAlias(), I think it will take
>>  care of both issues (HTTP server and HTML pages).
>>
>>  -- Ken
>>
>>
>>>-----Original Message-----
>>>>I want to do crawling on document with charset="big5-hkscs" (which is an
>>>>extension of big5, with extra hong kong chinese characters).  But the
>>>>document's meta tags set content="text/html; charset=big5" instead.  So
>the
>>>>crawl engine treats the document as "big5" instead of "big-hkscs".  That
>>>>makes the extra hong kong characters unreadable on search result page.
>How
>>>>my question is:  Can I force the crawl engine to treat the document as
>>>>"big5-hkscs"?
>>>
>>>I don't know of a way to do this without some coding.
>>>
>>>You could modify the resolveEncodingAlias method to add (or
>>>uncomment) the aliasing of big5 => big5-hkscs, but then you'd have to
>>>rebuild Nutch.
>>>
>>>See the resolveEncodingAlias() method here:
>>>
>>>http://www.krugle.com/files/svn/svn.apache.org/lucene/nutch/trunk/src/java/o
>>>rg/apache/nutch/util/StringUtil.java
>>
>>
>>  --
>>  Ken Krugler
>>  Krugle, Inc.
>>  +1 530-210-6378
>>  "Find Code, Find Answers"
>>
>>
>>
>
>--
>View this message in context: 
>http://www.nabble.com/Charset-question-tf2231717.html#a6353390
>Sent from the Nutch - User forum at Nabble.com.


-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

RE: Charset question

Posted by King Kong <ch...@hotmail.com>.

I met the problem like Kenneth's

I crawl a page that the actual charset is GB18030 , but in the meta of the
page it is set to gb2312.

so, I have got some unreadable characters when parse it;

I have fixed the StringUtil.resolveEncodingAlias() following Ken's advise, 

 encodingAliases.put("GB2312", "GB18030");

and I have got the message " setting encoding to GB18030" 

but, it resembles evenly useless. the result appear unreadable characters
again.
It seem that the parser use the original encoding as gb2312 .

Would you give me a hand ?

Thanks in advance.

King Kong


Ken Krugler wrote:
> 
>>Thanks for your reply.
>>
>>I have found that the method you mentioned looks into the http header from
>>web server.  It looks for "charset" and does the mapping.  The apache web
>>server which contains the document has already  configured:
>>
>>AddDefaultCharset Big5-HKSCS
>>
>>The crawl engine does treat the encoding of all pages from the web server
as
>>Big5-HKSCS.
>>But the crawl engine also looks into the meta tag of the html page.
>>I have two identical html pages with hong kong big5 characters. One has
the
>>tag
>>
>><meta http-equiv="Content-Type" content="text/html; charset=Big5" />
>>
>>The other
>>
>><meta http-equiv="Content-Type" content="text/html; charset=Big5-HKSCS" />
>>
>>When both of these html pages are in the search result page, the "summary"
>>of the first one contains unreadable characters.
>>So I think I need to modify some codes which read the meta tag of html
page.
>>Do you have any idea?
> 
>  From a quick look at the source, this eventually also calls 
> StringUtil.resolveEncodingAlias().
> 
> HtmlParser.getParse() calls StringUtil.parseCharacterEncoding(), 
> passing it the content-type meta data, and then takes the returned 
> charset name and calls StringUtil.resolveEncodingAlias().
> 
> So if you fix StringUtil.resolveEncodingAlias(), I think it will take 
> care of both issues (HTTP server and HTML pages).
> 
> -- Ken
> 
> 
>>-----Original Message-----
>>>I want to do crawling on document with charset="big5-hkscs" (which is an
>>>extension of big5, with extra hong kong chinese characters).  But the
>>>document's meta tags set content="text/html; charset=big5" instead.  So
the
>>>crawl engine treats the document as "big5" instead of "big-hkscs".  That
>>>makes the extra hong kong characters unreadable on search result page. 
How
>>>my question is:  Can I force the crawl engine to treat the document as
>>>"big5-hkscs"?
>>
>>I don't know of a way to do this without some coding.
>>
>>You could modify the resolveEncodingAlias method to add (or
>>uncomment) the aliasing of big5 => big5-hkscs, but then you'd have to
>>rebuild Nutch.
>>
>>See the resolveEncodingAlias() method here:
>>
>>http://www.krugle.com/files/svn/svn.apache.org/lucene/nutch/trunk/src/java/o
>>rg/apache/nutch/util/StringUtil.java
> 
> 
> -- 
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "Find Code, Find Answers"
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Charset-question-tf2231717.html#a6353390
Sent from the Nutch - User forum at Nabble.com.

Re: Charset question

Posted by Ken Krugler <kk...@transpac.com>.

>I want to do crawling on document with charset="big5-hkscs" (which is an
>extension of big5, with extra hong kong chinese characters).  But the
>document's meta tags set content="text/html; charset=big5" instead.  So the
>crawl engine treats the document as "big5" instead of "big-hkscs".  That
>makes the extra hong kong characters unreadable on search result page.  How
>my question is:  Can I force the crawl engine to treat the document as
>"big5-hkscs"?

I don't know of a way to do this without some coding.

You could modify the resolveEncodingAlias method to add (or 
uncomment) the aliasing of big5 => big5-hkscs, but then you'd have to 
rebuild Nutch.

See the resolveEncodingAlias() method here:

http://www.krugle.com/files/svn/svn.apache.org/lucene/nutch/trunk/src/java/org/apache/nutch/util/StringUtil.java

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"