You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by gbouchar <gb...@protonmail.com.INVALID> on 2018/07/26 09:38:14 UTC

improving Tika for web contents

Greetings everyone!

I have two pull requests related to the use of tika for web contents that have been waiting for quite some time now.

- [Improving html charset detection](https://github.com/apache/tika/pull/242) : None of the current charset detectors in tika respect the web standards, and in my tests, I found that around 15% of web pages were misdetected using the default charset detector. This pull request implements a new charset detector for web pages, with a better accuracy.
- [fixing mime-type detection over http](https://github.com/apache/tika/pull/236) : Currently, tika has no knowledge of server-side interpreted languages such as PHP. Thus, given an url like "http://example.com/index.php", it tends to guess its mime type will be "text/x-php", whereas this is in fact very unlikely. This PR gives tika the knowledge of which extensions are linked to server-side interpreted languages.

If someone could have a look at these pull requests, and maybe include them in the next release, that would help us a lot ! I am of course still opened to discussion and ready to update the code if changes need to be made.

Cheers,
G. Bouchar

Re: improving Tika for web contents

Posted by Tim Allison <ta...@apache.org>.
Y, we're waiting on dl4j so we have a week probably.
On Thu, Jul 26, 2018 at 8:06 AM gbouchar <gb...@protonmail.com> wrote:
>
> Thank you very much, Tim! Do you think it will make it for the next release ?
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> Le 26 juillet 2018 1:58 PM, Tim Allison <ta...@apache.org> a écrit :
>
> > Y. Sorry. At beach last week. Took care of quick issues yesterday, will try
> > to return to your PRs today. Thank you!
> >
>
>

Re: improving Tika for web contents

Posted by Tim Allison <ta...@apache.org>.
Y. Sorry. At beach last week. Took care of quick issues yesterday, will try
to return to your PRs today. Thank you!

On Thu, Jul 26, 2018 at 5:38 AM gbouchar <gb...@protonmail.com.invalid>
wrote:

> Greetings everyone!
>
> I have two pull requests related to the use of tika for web contents that
> have been waiting for quite some time now.
>
> - [Improving html charset detection](
> https://github.com/apache/tika/pull/242) : None of the current charset
> detectors in tika respect the web standards, and in my tests, I found that
> around 15% of web pages were misdetected using the default charset
> detector. This pull request implements a new charset detector for web
> pages, with a better accuracy.
> - [fixing mime-type detection over http](
> https://github.com/apache/tika/pull/236) : Currently, tika has no
> knowledge of server-side interpreted languages such as PHP. Thus, given an
> url like "http://example.com/index.php", it tends to guess its mime type
> will be "text/x-php", whereas this is in fact very unlikely. This PR gives
> tika the knowledge of which extensions are linked to server-side
> interpreted languages.
>
> If someone could have a look at these pull requests, and maybe include
> them in the next release, that would help us a lot ! I am of course still
> opened to discussion and ready to update the code if changes need to be
> made.
>
> Cheers,
> G. Bouchar