You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Shabanali Faghani <sh...@gmail.com> on 2016/05/16 20:37:40 UTC

A powerful Charset Encoding Detector plugin for Nutch

Hi all,

This is my first post in Nutch's developer mailing list.
A while ago, when I was working in a project I've developed a java library
in order to detect charset encoding of crawled HTML web pages. Before
developing my library I tested almost all Charset Detector tools including two
Apache libraries namely TikaEncodingDetector and Lucene-ICU4j but none were
good for HTML documents.

Searching through google for the word "encoding" in Nutch's developer
mailing list archive
<https://www.google.com/#q=site:http:%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fnutch-dev%2F+encoding>
I found some related posts to this problem, so I decided to propose my tool
here.

Library code on github:
https://github.com/shabanali-faghani/IUST-HTMLCharDet
Paper link: http://link.springer.com/chapter/10.1007/978-3-319-28940-3_17
Maven Central link:
http://mvnrepository.org/artifact/ir.ac.iust/htmlchardet/1.0.0

I'm acquaint with the Nutch Plugin policy
<https://wiki.apache.org/nutch/PluginCentral> and I know some plugins of Nutch
such as LanguageIdentifier, which we used a modified version of it in our
project 4 years ago, are very useful in practice. Also, I know
EncodingDetectorPlugin
<https://wiki.apache.org/nutch/EncodingDetectorPlugin> is in TODO list of
Nutch and it is prerequisite to some new plugins such as
NewLanguageIdentifier <https://wiki.apache.org/nutch/NewLanguageIdentifier> to
be applicable in real life, as is stated here
<https://wiki.apache.org/nutch/LanguageIdentifier>.

In a nutshell, I can develop a layer on my library according to the needs
of Nutch which are described here
<https://issues.apache.org/jira/browse/NUTCH-25> (such as considering the
potential Charset in HTTP header) to become the EncodingDetectorPlugin of
 Nutch.

Please let me know your thoughts*.*

Re: A powerful Charset Encoding Detector plugin for Nutch

Posted by Shabanali Faghani <sh...@gmail.com>.

Hi Sebastian,

Thanks for your reply and sorry for my late reply, too :) ... due to the
recent 4 day holidays in our country.

I worked with the Apache's great stuffs like ActiveMQ, Camel, Zookeeper,
etc. in past and also I know Tika. We used Tika to parse PPT, DOC, XLS and
PDF documents in our project. Specially, for the latest case, i.e. PDF,
I've done a bug fix in Apache PDFBox. Also, I've used other Tika
sub-modules such as POI, Boilerpipe, TikaEncodingDetector, etc. separated
from Tika.

Hence, for suggesting my library I was in doubt that (which one is the
correct place) ? Nutch : Tika;
But for some reasons I felt that Nutch is a better place.

Anyway, I really thank you for your elaborated response and I'm eagerly
waiting to see your test result based on real-world test documents, though
my tests were done on real-word data too :) To extend the existing JUnit
tests I can grant your github user to my repo. Having your tests on my repo
helps me to yet improve it.

By the way, since our project was fairly big (now ~ 1.2 billion pages),
I've done performance test on my library. We can discuss on trad-off
between accuracy and performance and also about possible solutions to
improve performance in future but now I would just say that there is no
worry about it.

Regards,
Shabanali

On Thu, May 19, 2016 at 2:01 PM, Sebastian Nagel <wastl.nagel@googlemail.com
> wrote:

> Hi Shabanali,
>
> thanks for your offer! And sorry for the late reply.
>
> Currently, in Nutch charset detection is not plugabble.
> Because encoding is an integral part of document formats
> It's a task for the parser because it's really tight to
> document formats, and does work really different, e.g. for
> HTML compared to PDF.
>
> HTML charset detection is currently addressed
> 1) as part of the plugin parse-html, see [1] and [2]
> 2) plugin parse-tika, here not part of Nutch but "delegated"
>    to the HTML parser of Apache Tika, see [3]
>
> Tika also covers many other document formats, including formats
> (plain-text)
> where charset detection is more difficult.  Tika is a parsing library,
> the main difference to a web crawler is that there is extra context from
> the web:
>  - HTTP headers
>  - encoding specified in links (currently, not used by Nutch):
>     <a charset="ISO-8859-5" href="data/russian.txt">Russian</a>
>
> Ev. Tika may be the better address to offer your work on improving the
> encoding
> detection.
>
> From experience I know that the character set detection of Nutch is not
> always perfect: German documents are sometimes not correctly decoded.
>
> I would generally agree with the research results in your cited paper
> - interpret data as ISO-8859-1 (just bytes, not multi-byte sequences)
>   > that's done in sniffCharacterEncoding(), see [1]
> - strip HTML including embedded css, scripts, etc.
>   That's maybe the more reliable approach than increasing the size of
>   the sniffed chunk (cf. [4]). But also more expensive regarding
> computation.
> - to combine 2 detector libraries (Mozilla and ICU) to get the best of both
>   is, of course, a nice trick.  But again: it may be too expensive
>   in terms of the extra computation time
>
>
> Ok, to come to an end: I'm sure you have a lot of good ideas for
> improvements.
> And yes, help is always welcome. As Apache projects, Nutch and Tika rely
> on the community to be involved in the development.
>
> Instead of implementing a new charset detection plugin, a good approach
> and first
> step could possibly be to test and evaluate the current state and provide
> real-world
> test documents or even extend the existing JUnit tests.
>
>
> Thanks,
> Sebastian
>
>
> [1] method sniffCharacterEncoding(byte[] content)
>
>
> https://github.com/apache/nutch/blob/master/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java#L81
> [2]
> https://nutch.apache.org/apidocs/apidocs-1.11/org/apache/nutch/util/EncodingDetector.html
>
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/EncodingDetector.java
> [3] http://tika.apache.org/
>
> https://tika.apache.org/1.13/api/org/apache/tika/parser/html/HtmlEncodingDetector.html
>
> https://tika.apache.org/1.13/api/org/apache/tika/detect/EncodingDetector.html
> [4] https://issues.apache.org/jira/browse/NUTCH-2042
>
>
>
>
>
>
>
> On 05/16/2016 10:37 PM, Shabanali Faghani wrote:
> > Hi all,
> >
> > This is my first post in Nutch's developer mailing list.
> > A while ago, when I was working in a project I've developed a java
> library in order to detect
> > charset encoding of crawled HTML web pages. Before developing my library
> I tested almost all Charset
> > Detector tools including two Apache libraries namely
> TikaEncodingDetector and Lucene-ICU4j but none
> > were good for HTML documents.
> >
> > Searching through google for the word "encoding" in Nutch's developer
> mailing list archive
> > <
> https://www.google.com/#q=site:http:%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fnutch-dev%2F+encoding
> >I
> > found some related posts to this problem, so I decided to propose my
> tool here.
> >
> > Library code on github:
> https://github.com/shabanali-faghani/IUST-HTMLCharDet
> > Paper link:
> http://link.springer.com/chapter/10.1007/978-3-319-28940-3_17
> > Maven Central link:
> http://mvnrepository.org/artifact/ir.ac.iust/htmlchardet/1.0.0
> >
> > I'm acquaint with the Nutch Plugin policy <
> https://wiki.apache.org/nutch/PluginCentral>and I know
> > some plugins of Nutch such as LanguageIdentifier, which we used a
> modified version of it in our
> > project 4 years ago, are very useful in practice. Also, I know
> EncodingDetectorPlugin
> > <https://wiki.apache.org/nutch/EncodingDetectorPlugin> is in TODO list
> of Nutch and it
> > is prerequisite to some new plugins such as NewLanguageIdentifier
> > <https://wiki.apache.org/nutch/NewLanguageIdentifier> to be applicable
> in real life, as is
> > stated here <https://wiki.apache.org/nutch/LanguageIdentifier>.
> >
> > In a nutshell, I can develop a layer on my library according to the
> needs of Nutch which are
> > described here <https://issues.apache.org/jira/browse/NUTCH-25> (such
> as considering the potential
> > Charset in HTTP header) to become the EncodingDetectorPlugin of Nutch.
> >
> > Please let me know your thoughts*.*
>
>

Re: A powerful Charset Encoding Detector plugin for Nutch

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Shabanali,

thanks for your offer! And sorry for the late reply.

Currently, in Nutch charset detection is not plugabble.
Because encoding is an integral part of document formats
It's a task for the parser because it's really tight to
document formats, and does work really different, e.g. for
HTML compared to PDF.

HTML charset detection is currently addressed
1) as part of the plugin parse-html, see [1] and [2]
2) plugin parse-tika, here not part of Nutch but "delegated"
   to the HTML parser of Apache Tika, see [3]

Tika also covers many other document formats, including formats (plain-text)
where charset detection is more difficult.  Tika is a parsing library,
the main difference to a web crawler is that there is extra context from
the web:
 - HTTP headers
 - encoding specified in links (currently, not used by Nutch):
    <a charset="ISO-8859-5" href="data/russian.txt">Russian</a>

Ev. Tika may be the better address to offer your work on improving the encoding
detection.

From experience I know that the character set detection of Nutch is not
always perfect: German documents are sometimes not correctly decoded.

I would generally agree with the research results in your cited paper
- interpret data as ISO-8859-1 (just bytes, not multi-byte sequences)
  > that's done in sniffCharacterEncoding(), see [1]
- strip HTML including embedded css, scripts, etc.
  That's maybe the more reliable approach than increasing the size of
  the sniffed chunk (cf. [4]). But also more expensive regarding computation.
- to combine 2 detector libraries (Mozilla and ICU) to get the best of both
  is, of course, a nice trick.  But again: it may be too expensive
  in terms of the extra computation time


Ok, to come to an end: I'm sure you have a lot of good ideas for improvements.
And yes, help is always welcome. As Apache projects, Nutch and Tika rely
on the community to be involved in the development.

Instead of implementing a new charset detection plugin, a good approach and first
step could possibly be to test and evaluate the current state and provide real-world
test documents or even extend the existing JUnit tests.


Thanks,
Sebastian


[1] method sniffCharacterEncoding(byte[] content)

https://github.com/apache/nutch/blob/master/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java#L81
[2] https://nutch.apache.org/apidocs/apidocs-1.11/org/apache/nutch/util/EncodingDetector.html
    https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/EncodingDetector.java
[3] http://tika.apache.org/
    https://tika.apache.org/1.13/api/org/apache/tika/parser/html/HtmlEncodingDetector.html
    https://tika.apache.org/1.13/api/org/apache/tika/detect/EncodingDetector.html
[4] https://issues.apache.org/jira/browse/NUTCH-2042







On 05/16/2016 10:37 PM, Shabanali Faghani wrote:
> Hi all,
> 
> This is my first post in Nutch's developer mailing list.
> A while ago, when I was working in a project I've developed a java library in order to detect
> charset encoding of crawled HTML web pages. Before developing my library I tested almost all Charset
> Detector tools including two Apache libraries namely TikaEncodingDetector and Lucene-ICU4j but none
> were good for HTML documents.
> 
> Searching through google for the word "encoding" in Nutch's developer mailing list archive
> <https://www.google.com/#q=site:http:%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fnutch-dev%2F+encoding>I
> found some related posts to this problem, so I decided to propose my tool here.
> 
> Library code on github: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Paper link: http://link.springer.com/chapter/10.1007/978-3-319-28940-3_17
> Maven Central link: http://mvnrepository.org/artifact/ir.ac.iust/htmlchardet/1.0.0
> 
> I'm acquaint with the Nutch Plugin policy <https://wiki.apache.org/nutch/PluginCentral>and I know
> some plugins of Nutch such as LanguageIdentifier, which we used a modified version of it in our
> project 4 years ago, are very useful in practice. Also, I know EncodingDetectorPlugin
> <https://wiki.apache.org/nutch/EncodingDetectorPlugin> is in TODO list of Nutch and it
> is prerequisite to some new plugins such as NewLanguageIdentifier
> <https://wiki.apache.org/nutch/NewLanguageIdentifier> to be applicable in real life, as is
> stated here <https://wiki.apache.org/nutch/LanguageIdentifier>.
> 
> In a nutshell, I can develop a layer on my library according to the needs of Nutch which are
> described here <https://issues.apache.org/jira/browse/NUTCH-25> (such as considering the potential
> Charset in HTTP header) to become the EncodingDetectorPlugin of Nutch.
> 
> Please let me know your thoughts*.*