You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2012/02/14 23:00:38 UTC

Detecting Encoding with plugins

Hi,

I can't see anywhere within our parser plugins where we detect encoding of
documents. I've also begun looking through the o.a.n.p package but again I
can't see anything.

Can anyone provide some detail on this please?

Thank you

Lewis



-- 
*Lewis*

Re: Detecting Encoding with plugins

Posted by Ken Krugler <kk...@transpac.com>.

On Feb 14, 2012, at 2:34pm, Lewis John Mcgibbney wrote:

> It's in HTMLParser#private static String sniffCharacterEncoding
> 
> I'm still wondering where TikaParser gets the character encoding from though?

FYI, the individual Tika parsers have their own detection logic.

The HTML parser, for example, uses the response headers and metadata tags in addition to ICU's statistical method.

That's something I'm still working on cleaning up, but haven't made much progress in the past few months.

-- Ken

> Additionally, this doesn't look like something we check for in our JUnit classes? If we don't then I would like to write some tests to test for this.
> 
> I am working on Any23 tests first, so this provides the justification behind my question.
> 
> Thanks
> 
> Lewis
> 
> On Tue, Feb 14, 2012 at 10:00 PM, Lewis John Mcgibbney <le...@gmail.com> wrote:
> Hi,
> 
> I can't see anywhere within our parser plugins where we detect encoding of documents. I've also begun looking through the o.a.n.p package but again I can't see anything.
> 
> Can anyone provide some detail on this please?
> 
> Thank you
> 
> Lewis 
> 
> 
> 
> -- 
> Lewis 
> 
> 
> 
> 
> -- 
> Lewis 
> 

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Re: Detecting Encoding with plugins

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Also we fall back to windows-1252 encoding in the
parser.character.encoding.default property when we can't find anything else.

On Tue, Feb 14, 2012 at 10:34 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> It's in HTMLParser#private static String sniffCharacterEncoding
>
> I'm still wondering where TikaParser gets the character encoding from
> though? Additionally, this doesn't look like something we check for in our
> JUnit classes? If we don't then I would like to write some tests to test
> for this.
>
> I am working on Any23 tests first, so this provides the justification
> behind my question.
>
> Thanks
>
> Lewis
>
>
> On Tue, Feb 14, 2012 at 10:00 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
>> Hi,
>>
>> I can't see anywhere within our parser plugins where we detect encoding
>> of documents. I've also begun looking through the o.a.n.p package but again
>> I can't see anything.
>>
>> Can anyone provide some detail on this please?
>>
>> Thank you
>>
>> Lewis
>>
>>
>>
>> --
>> *Lewis*
>>
>>
>
>
> --
> *Lewis*
>
>


-- 
*Lewis*

Re: Detecting Encoding with plugins

Posted by Lewis John Mcgibbney <le...@gmail.com>.

It's in HTMLParser#private static String sniffCharacterEncoding

I'm still wondering where TikaParser gets the character encoding from
though? Additionally, this doesn't look like something we check for in our
JUnit classes? If we don't then I would like to write some tests to test
for this.

I am working on Any23 tests first, so this provides the justification
behind my question.

Thanks

Lewis

On Tue, Feb 14, 2012 at 10:00 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi,
>
> I can't see anywhere within our parser plugins where we detect encoding of
> documents. I've also begun looking through the o.a.n.p package but again I
> can't see anything.
>
> Can anyone provide some detail on this please?
>
> Thank you
>
> Lewis
>
>
>
> --
> *Lewis*
>
>

-- 
*Lewis*

Re: Detecting Encoding with plugins

Posted by Julien Nioche <li...@gmail.com>.

Hi Lewis

I assume Tika does already - why should we duplicate the tests in Nutch?
>
> We don't want to I suppose. However the point I was trying to make was
> that as NUTCH-1259 detects the encoding type,
>
>
however we don't have an automated test to cover this, I assume the case is
> somewhat important or else the ticket for NUTCH-1259 wouldn't have been
> opened originally?
>

nope. NUTCH-1259 is about storing the mime-type value detected by Tika. It
is not the same as the encoding. This specific JIRA is not whether or not
we get the correct value but a purely functional one about where we store
it. There is not much to test wrt it



> I agree with you that general cases should be dealt with further upstream
> within Tika development itself, however as the encoding detection is done
> in Nutch within the cd metadata we may wish to get some test case to
> check... it's not a huge thing I suppose.
>

we do have tests for the EncodingDetector (TestEncodingDetector), which is
used by parse-html already. It is Ok to have that as it is our own parser.
As explained earlier, for the Tika parser the detection is delegated to the
Tika parser implementations and as such should be tested there.


>
>> we delegate the functionality to Tika, IMHO this means delegating the
>> testing as well. What we could do to contribute tests to Tika instead if it
>> does not have any.
>>
>> Yeah this is correct. I'm expecting you guys will know better than me but
> I would assume that Tika is mimetype and encoding detection compliant ;0)
>

I definitely do not pretend to know more than anyone else BTW :-) I don't
understand what you mean by 'compliant'. Perfect? Probably not. There was
an interesting experiment made by Ken on measuring the accuracy of the
charset detection in the Tika book - which anyone remotely interested in
Nutch should get BTW. There has been an interesting blog entry recently on
comparing the language detection in Tika and other libraries (cant find ref
and am in a hurry - sorry)


>
>
>> Re-any23 : why not handling it as a Tika parser instead of a Nutch one?
>> This could be useful to other Tika users who do not necessarily use Nutch
>>
> OK so I suppose this is completely open for discussion and I really
> welcome it as well. On one hand I see working with Any23 as a parse-any23
> plugin within Nutch as the first step in the road to answering this
> question. Regardless of whether Any23 graduates and is integrated into Tika
> itself or as a TLP you are completely right that it should be made as
> openly available to as many people. Personally I agree with you Julien.
>
> One last thing, I know this if off topic... but with regards to our
> microformats-reltag plugin... I think the RelTagParser could and should be
> move over to Any23. Any23 already supports extraction of an number of
> microformats. wdyt?
>

it would probably make sense as an initial step if you don't want to
venture in trying to wrap it as a Tika parser :-)

Julien



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Detecting Encoding with plugins

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Julien,

On Wed, Feb 15, 2012 at 12:27 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> I assume Tika does already - why should we duplicate the tests in Nutch?

We don't want to I suppose. However the point I was trying to make was that
as NUTCH-1259 detects the encoding type, however we don't have an automated
test to cover this, I assume the case is somewhat important or else the
ticket for NUTCH-1259 wouldn't have been opened originally? I agree with
you that general cases should be dealt with further upstream within Tika
development itself, however as the encoding detection is done in Nutch
within the cd metadata we may wish to get some test case to check... it's
not a huge thing I suppose.

> we delegate the functionality to Tika, IMHO this means delegating the
> testing as well. What we could do to contribute tests to Tika instead if it
> does not have any.
>
> Yeah this is correct. I'm expecting you guys will know better than me but
I would assume that Tika is mimetype and encoding detection compliant ;0)

> Re-any23 : why not handling it as a Tika parser instead of a Nutch one?
> This could be useful to other Tika users who do not necessarily use Nutch
>
OK so I suppose this is completely open for discussion and I really welcome
it as well. On one hand I see working with Any23 as a parse-any23 plugin
within Nutch as the first step in the road to answering this question.
Regardless of whether Any23 graduates and is integrated into Tika itself or
as a TLP you are completely right that it should be made as openly
available to as many people. Personally I agree with you Julien.

One last thing, I know this if off topic... but with regards to our
microformats-reltag plugin... I think the RelTagParser could and should be
move over to Any23. Any23 already supports extraction of an number of
microformats. wdyt?

Thanks

Re: Detecting Encoding with plugins

Posted by Julien Nioche <li...@gmail.com>.

I assume Tika does already - why should we duplicate the tests in Nutch? we
delegate the functionality to Tika, IMHO this means delegating the testing
as well. What we could do to contribute tests to Tika instead if it does
not have any.

Re-any23 : why not handling it as a Tika parser instead of a Nutch one?
This could be useful to other Tika users who do not necessarily use Nutch

Julien

On 15 February 2012 12:17, Lewis John Mcgibbney
<le...@gmail.com>wrote:

> Yes this is correct, but we still don't test for either of the two.
>
>
> On Wed, Feb 15, 2012 at 10:59 AM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
>> The mimetype is not the same thing as the encoding. As Ken pointed out
>> this is done at the individual parser level
>>
>>
>> On 14 February 2012 23:51, Markus Jelsma <ma...@apache.org> wrote:
>>
>>> Hi,
>>>
>>> This was indeed an issue until today. The detected type is in the crawl
>>> datum
>>> metadata.
>>>
>>> https://issues.apache.org/jira/browse/NUTCH-1259
>>>
>>> > Hi,
>>> >
>>> > I can't see anywhere within our parser plugins where we detect
>>> encoding of
>>> > documents. I've also begun looking through the o.a.n.p package but
>>> again I
>>> > can't see anything.
>>> >
>>> > Can anyone provide some detail on this please?
>>> >
>>> > Thank you
>>> >
>>> > Lewis
>>>
>>
>>
>>
>> --
>> *
>> *Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>>
>>
>
>
> --
> *Lewis*
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Detecting Encoding with plugins

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Yes this is correct, but we still don't test for either of the two.

On Wed, Feb 15, 2012 at 10:59 AM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> The mimetype is not the same thing as the encoding. As Ken pointed out
> this is done at the individual parser level
>
>
> On 14 February 2012 23:51, Markus Jelsma <ma...@apache.org> wrote:
>
>> Hi,
>>
>> This was indeed an issue until today. The detected type is in the crawl
>> datum
>> metadata.
>>
>> https://issues.apache.org/jira/browse/NUTCH-1259
>>
>> > Hi,
>> >
>> > I can't see anywhere within our parser plugins where we detect encoding
>> of
>> > documents. I've also begun looking through the o.a.n.p package but
>> again I
>> > can't see anything.
>> >
>> > Can anyone provide some detail on this please?
>> >
>> > Thank you
>> >
>> > Lewis
>>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>
>


-- 
*Lewis*

Re: Detecting Encoding with plugins

Posted by Julien Nioche <li...@gmail.com>.

The mimetype is not the same thing as the encoding. As Ken pointed out this
is done at the individual parser level

On 14 February 2012 23:51, Markus Jelsma <ma...@apache.org> wrote:

> Hi,
>
> This was indeed an issue until today. The detected type is in the crawl
> datum
> metadata.
>
> https://issues.apache.org/jira/browse/NUTCH-1259
>
> > Hi,
> >
> > I can't see anywhere within our parser plugins where we detect encoding
> of
> > documents. I've also begun looking through the o.a.n.p package but again
> I
> > can't see anything.
> >
> > Can anyone provide some detail on this please?
> >
> > Thank you
> >
> > Lewis
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Detecting Encoding with plugins

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Markus,

I've been vaguely keeping up with yourself and Julien's work on this.

I would really like to get a test case for this though! I'll try working
towards this as a sub-target of another issue. For reference, there is a
Tika mimeType test case here [1] and Tika document encoding test here [2].
Which we may or may not be interested in porting over to o.a.n?

wdyt?

Thanks

Lewis

[1]
https://svn.apache.org/viewvc/incubator/any23/trunk/core/src/test/java/org/apache/any23/mime/TikaMIMETypeDetectorTest.java?view=markup
[2]
https://svn.apache.org/viewvc/incubator/any23/trunk/core/src/test/java/org/apache/any23/encoding/TikaEncodingDetectorTest.java?view=markup

On Tue, Feb 14, 2012 at 11:51 PM, Markus Jelsma <ma...@apache.org> wrote:

> Hi,
>
> This was indeed an issue until today. The detected type is in the crawl
> datum
> metadata.
>
> https://issues.apache.org/jira/browse/NUTCH-1259
>
> > Hi,
> >
> > I can't see anywhere within our parser plugins where we detect encoding
> of
> > documents. I've also begun looking through the o.a.n.p package but again
> I
> > can't see anything.
> >
> > Can anyone provide some detail on this please?
> >
> > Thank you
> >
> > Lewis
>

-- 
*Lewis*

Re: Detecting Encoding with plugins

Posted by Markus Jelsma <ma...@apache.org>.

Hi,

This was indeed an issue until today. The detected type is in the crawl datum 
metadata.

https://issues.apache.org/jira/browse/NUTCH-1259

> Hi,
> 
> I can't see anywhere within our parser plugins where we detect encoding of
> documents. I've also begun looking through the o.a.n.p package but again I
> can't see anything.
> 
> Can anyone provide some detail on this please?
> 
> Thank you
> 
> Lewis