You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2017/11/01 18:06:43 UTC

RE: Incorrect encoding detected

Any ideas?

Thanks!

 
 
-----Original message-----
> From:Markus Jelsma <ma...@openindex.io>
> Sent: Tuesday 31st October 2017 13:14
> To: User <us...@nutch.apache.org>
> Subject: FW: Incorrect encoding detected
> 
> I actually don't know, can we specify a tika-config file in Nutch?
> 
> Thanks,
> Markus
>  
> -----Original message-----
> > From:Allison, Timothy B. <ta...@mitre.org>
> > Sent: Tuesday 31st October 2017 13:11
> > To: user@tika.apache.org
> > Subject: RE: Incorrect encoding detected
> > 
> > For 1.17, the simplest solution, I think, is to allow users to configure extending the detection limit via our @Field config methods, that is, via tika-config.xml.
> > 
> > To confirm, Nutch will allow users to specify a tika-config file?  Will this work for you and Nutch?
> > 
> > -----Original Message-----
> > From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
> > Sent: Tuesday, October 31, 2017 5:47 AM
> > To: user@tika.apache.org
> > Subject: RE: Incorrect encoding detected
> > 
> > Hello Timothy - what would be your preferred solution? Increase detection limit or skip inline styles and possibly other useless head information?
> > 
> > Thanks,
> > Markus
> > 
> >  
> >  
> > -----Original message-----
> > > From:Markus Jelsma <ma...@openindex.io>
> > > Sent: Friday 27th October 2017 15:37
> > > To: user@tika.apache.org
> > > Subject: RE: Incorrect encoding detected
> > > 
> > > Hi Tim,
> > > 
> > > I have opened TIKA-2485 to track the problem. 
> > > 
> > > Thank you very very much!
> > > Markus
> > > 
> > >  
> > >  
> > > -----Original message-----
> > > > From:Allison, Timothy B. <ta...@mitre.org>
> > > > Sent: Friday 27th October 2017 15:33
> > > > To: user@tika.apache.org
> > > > Subject: RE: Incorrect encoding detected
> > > > 
> > > > Unfortunately there is no way to do this now.  _I think_ we could make this configurable, though, fairly easily.  Please open a ticket.
> > > > 
> > > > The next RC for PDFBox might be out next week, and we'll try to release Tika 1.17 shortly after that...so there should be time to get this in.
> > > > 
> > > > -----Original Message-----
> > > > From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
> > > > Sent: Friday, October 27, 2017 9:12 AM
> > > > To: user@tika.apache.org
> > > > Subject: RE: Incorrect encoding detected
> > > > 
> > > > Hello Tim,
> > > > 
> > > > Getting rid of script and style contents sounds plausible indeed. But to work around the problem for now, can i instruct HTMLEncodingDetector from within Nutch to look beyond the limit?
> > > > 
> > > > Thanks!
> > > > Markus
> > > > 
> > > >  
> > > >  
> > > > -----Original message-----
> > > > > From:Allison, Timothy B. <ta...@mitre.org>
> > > > > Sent: Friday 27th October 2017 14:53
> > > > > To: user@tika.apache.org
> > > > > Subject: RE: Incorrect encoding detected
> > > > > 
> > > > > Hi Markus,
> > > > >   
> > > > > My guess is that the ~32,000 characters of mostly ascii-ish <script/> are what is actually being used for encoding detection.  The HTMLEncodingDetector only looks in the first 8,192 characters, and the other encoding detectors have similar (but longer?) restrictions.
> > > > >  
> > > > > At some point, I had a dev version of a stripper that removed contents of <script/> and <style/> before trying to detect the encoding[0]...perhaps it is time to resurrect that code and integrate it?
> > > > > 
> > > > > Or, given that HTML has been, um, blossoming, perhaps, more simply, we should expand how far we look into a stream for detection?
> > > > > 
> > > > > Cheers,
> > > > > 
> > > > >                Tim
> > > > > 
> > > > > [0] https://issues.apache.org/jira/browse/TIKA-2038
> > > > >    
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
> > > > > Sent: Friday, October 27, 2017 8:39 AM
> > > > > To: user@tika.apache.org
> > > > > Subject: Incorrect encoding detected
> > > > > 
> > > > > Hello,
> > > > > 
> > > > > We have a problem with Tika, encoding and pages on this website: https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser
> > > > > 
> > > > > Using Nutch and Tika 1.12, but also using Tika 1.16, we found out that the regular HTML parser does a fine job, but our TikaParser has a tough job dealing with this HTML. For some reason Tika thinks Content-Encoding=windows-1252 is what this webpage says it is, instead the page identifies itself properly as UTF-8.
> > > > > 
> > > > > Of all websites we index, this is so far the only one giving trouble indexing accents, getting fÃ¥ instead of a regular få.
> > > > > 
> > > > > Any tips to spare? 
> > > > > 
> > > > > Many many thanks!
> > > > > Markus
> > > > > 
> > > > 
> > > 
> > 
>

Re: Incorrect encoding detected

Posted by Sebastian Nagel <wa...@googlemail.com>.

I hadn't the time to dig into the problem.
Neither how to pass a tika-config file nor why
actually parse-html is detecting the encoding
although it's also only looking for the first 8192
characters (see CHUNK_SIZE).

Just one point: for the MIME detection we also
pass the Content-Type sent by the web server to Tika.
Could this also be help to pass it as additional glue?
In the concrete example the server sends
  Content-Type: text/html; charset=utf-8

Sebastian

On 11/01/2017 07:06 PM, Markus Jelsma wrote:
> Any ideas?
> 
> Thanks!
> 
>  
>  
> -----Original message-----
>> From:Markus Jelsma <ma...@openindex.io>
>> Sent: Tuesday 31st October 2017 13:14
>> To: User <us...@nutch.apache.org>
>> Subject: FW: Incorrect encoding detected
>>
>> I actually don't know, can we specify a tika-config file in Nutch?
>>
>> Thanks,
>> Markus
>>  
>> -----Original message-----
>>> From:Allison, Timothy B. <ta...@mitre.org>
>>> Sent: Tuesday 31st October 2017 13:11
>>> To: user@tika.apache.org
>>> Subject: RE: Incorrect encoding detected
>>>
>>> For 1.17, the simplest solution, I think, is to allow users to configure extending the detection limit via our @Field config methods, that is, via tika-config.xml.
>>>
>>> To confirm, Nutch will allow users to specify a tika-config file?  Will this work for you and Nutch?
>>>
>>> -----Original Message-----
>>> From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
>>> Sent: Tuesday, October 31, 2017 5:47 AM
>>> To: user@tika.apache.org
>>> Subject: RE: Incorrect encoding detected
>>>
>>> Hello Timothy - what would be your preferred solution? Increase detection limit or skip inline styles and possibly other useless head information?
>>>
>>> Thanks,
>>> Markus
>>>
>>>  
>>>  
>>> -----Original message-----
>>>> From:Markus Jelsma <ma...@openindex.io>
>>>> Sent: Friday 27th October 2017 15:37
>>>> To: user@tika.apache.org
>>>> Subject: RE: Incorrect encoding detected
>>>>
>>>> Hi Tim,
>>>>
>>>> I have opened TIKA-2485 to track the problem. 
>>>>
>>>> Thank you very very much!
>>>> Markus
>>>>
>>>>  
>>>>  
>>>> -----Original message-----
>>>>> From:Allison, Timothy B. <ta...@mitre.org>
>>>>> Sent: Friday 27th October 2017 15:33
>>>>> To: user@tika.apache.org
>>>>> Subject: RE: Incorrect encoding detected
>>>>>
>>>>> Unfortunately there is no way to do this now.  _I think_ we could make this configurable, though, fairly easily.  Please open a ticket.
>>>>>
>>>>> The next RC for PDFBox might be out next week, and we'll try to release Tika 1.17 shortly after that...so there should be time to get this in.
>>>>>
>>>>> -----Original Message-----
>>>>> From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
>>>>> Sent: Friday, October 27, 2017 9:12 AM
>>>>> To: user@tika.apache.org
>>>>> Subject: RE: Incorrect encoding detected
>>>>>
>>>>> Hello Tim,
>>>>>
>>>>> Getting rid of script and style contents sounds plausible indeed. But to work around the problem for now, can i instruct HTMLEncodingDetector from within Nutch to look beyond the limit?
>>>>>
>>>>> Thanks!
>>>>> Markus
>>>>>
>>>>>  
>>>>>  
>>>>> -----Original message-----
>>>>>> From:Allison, Timothy B. <ta...@mitre.org>
>>>>>> Sent: Friday 27th October 2017 14:53
>>>>>> To: user@tika.apache.org
>>>>>> Subject: RE: Incorrect encoding detected
>>>>>>
>>>>>> Hi Markus,
>>>>>>   
>>>>>> My guess is that the ~32,000 characters of mostly ascii-ish <script/> are what is actually being used for encoding detection.  The HTMLEncodingDetector only looks in the first 8,192 characters, and the other encoding detectors have similar (but longer?) restrictions.
>>>>>>  
>>>>>> At some point, I had a dev version of a stripper that removed contents of <script/> and <style/> before trying to detect the encoding[0]...perhaps it is time to resurrect that code and integrate it?
>>>>>>
>>>>>> Or, given that HTML has been, um, blossoming, perhaps, more simply, we should expand how far we look into a stream for detection?
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>>                Tim
>>>>>>
>>>>>> [0] https://issues.apache.org/jira/browse/TIKA-2038
>>>>>>    
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
>>>>>> Sent: Friday, October 27, 2017 8:39 AM
>>>>>> To: user@tika.apache.org
>>>>>> Subject: Incorrect encoding detected
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> We have a problem with Tika, encoding and pages on this website: https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser
>>>>>>
>>>>>> Using Nutch and Tika 1.12, but also using Tika 1.16, we found out that the regular HTML parser does a fine job, but our TikaParser has a tough job dealing with this HTML. For some reason Tika thinks Content-Encoding=windows-1252 is what this webpage says it is, instead the page identifies itself properly as UTF-8.
>>>>>>
>>>>>> Of all websites we index, this is so far the only one giving trouble indexing accents, getting fÃ¥ instead of a regular få.
>>>>>>
>>>>>> Any tips to spare? 
>>>>>>
>>>>>> Many many thanks!
>>>>>> Markus
>>>>>>
>>>>>
>>>>
>>>
>>