You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov> on 2015/07/16 20:58:31 UTC

Re: Tika Issue?

Hey Alex, what version of Tika Python are you using? And moreover
what version of Tika? I’m CC’ing folks on dev@tika.a.o hope you
don’t mind.

I took the file you attached and saved it as blah.txt and ran
tika-python (with 1.9 tika) against it:

[mattmann-0420740:~] mattmann% tika-python detect type blah.txt
tika.py: Retrieving
http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/
1.9/tika-server-1.9.jar to
/var/folders/05/5qw82z2d77q16fhxxhwt22tr0000gq/T/tika-server.jar.
[(200, u'text/plain')]
[mattmann-0420740:~] mattmann% tika-python language file blah.txt
[(200, u'en')]
[mattmann-0420740:~] mattmann%

Is what what you would expect? In general the language detection using

N-grams and gets better when there is more text as a sample but it can
get fooled sometimes too.

Let me know what you think.

Cheers,
Chris

CC / memex-jpl@

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Alejandro Caceres <ac...@hyperiongray.com>
Date: Thursday, July 16, 2015 at 11:53 AM
To: jpluser <ch...@jpl.nasa.gov>
Cc: Amanda Towler <at...@hyperiongray.com>
Subject: Tika Issue?

>Hey Chris,
>
>
>I was about to submit this as a bug, but figured I'd run it by you first.
>Maybe you've encountered a similar issue.
>
>
>I'm doing some basic language categorization of websites, I saw that the
>Tika server/tika-python returns content as plain text, which is great to
>send to Tika language categorization (and just generally useful). However,
> it seemed to get very confused with sites that have footers in various
>languages, this is actually really common in the results we've found. For
>example, we have a totally English site and at the bottom is some links
>to the same site in other languages. This
> page, even though it's mostly English, gets categorized as a seemingly
>random language (like Lithuanian).
>
>
>As a workaround we tried running the web pages through a text
>summarization algo using lxml-readability, which gives us back a subset
>of the text on a page. My thinking was this would most likely strip
>footers and headers
> and give us back a decent representative sample of text on the page. The
>results seem to have improved a bit, but we're still getting some funky
>results where English pages are categorized as a seemingly random
>language, in many cases these pages seem pretty
> obviously (to the human eye) to be English.
>
>
>I wonder if someone at JPL (I don't see anyone from JPL here right now)
>could shed some light on why this might be happening. I've attached a
>couple of samples below. Also let me know if you'd like me to file any
>bugs anywhere
> to better track this, I just wanted to shoot this to you first to see if
>perhaps I was missing something obvious.
>
>
>
>Alex
>
>
>-- 
>___
>
>Alejandro Caceres
>Hyperion Gray, LLC
>Owner/CTO
>


Re: Tika Issue?

Posted by Alejandro Caceres <ac...@hyperiongray.com>.
Hey Chris,

Awesome! That answers all of my questions in one neat package. Thanks for
the help, I'll file the issues you mentioned.

This is all great stuff btw, tika-python is super clean and easy to use.

Alex

On Thu, Jul 16, 2015 at 4:04 PM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hey Alex,
>
> You nailed it. The first 2 examples are that there is too small a sample
> of text for it to get the language categorization right. Feel free to file
> a Tika issue for this (http://issues.apache.org/jira/browse/TIKA). There
> was
> some talk about integrating the Google language detector into this since
> it’s
> ALv2, not sure if it will perform better with smaller samples or not.
>
> As for the latter example, this one is interesting. The main reason is
> that it’s not detecting the file as HTML, since it doesn’t have a file
> extension,
> and since its MIME magic for that page doesn’t match the traditional HTML
> magic
> (e.g., <html>.. blah). So, it’s parsing that with the TxtParser which well
> just
> extracts out the characters/text from the stream:
>
> >>> print string_parsed["metadata"]
> {u'Content-Encoding': [u'ISO-8859-1'], u'Content-Type': [u'text/plain;
> charset=ISO-8859-1'], u'X-TIKA:parse_time_millis': [u'58'],
> u'X-Parsed-By': [[u'org.apache.tika.parser.DefaultParser',
> u'org.apache.tika.parser.txt.TXTParser']]}
> >>>
>
> Try this. You can use the from_file method to parse URLs as well. Those
>
> URLs will be downloaded in Tika python to /tmp as files, and then parsed
> from there. If you use .from_file on the above, it will correctly just
> strip
> the text out, and then the language detector works. Try this:
>
> from tika import parser
> from tika import language
>
> parsed = parser.from_file("http://ferretspatternu.ucoz.com/index.html")
> lang = language.from_buffer(parsed["content"])
> print lang
>
>
> Which should print out:
>
> >>> parsed = parser.from_file("http://ferretspatternu.ucoz.com/index.html
> ")
> tika.py: Retrieving http://ferretspatternu.ucoz.com/index.html to
> /tmp/index.html.
> >>> lang = language.from_buffer(parsed["content"])
> >>> print lang
> en
> >>>
>
>
> Note that I had to add /index.html at the end. Our code to create the
> tmp file in tika-python needs some prettying up so if the URL doesn’t end
> with an actual file or extension it craps out. For now you can work around
> it that way. If you have time please file a Github issue for tika-python
> and we’ll make this better.
>
> Cheers HTH!
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> -----Original Message-----
> From: Alejandro Caceres <ac...@hyperiongray.com>
> Date: Thursday, July 16, 2015 at 12:56 PM
> To: jpluser <ch...@jpl.nasa.gov>
> Cc: Amanda Towler <at...@hyperiongray.com>, "dev@tika.apache.org"
> <de...@tika.apache.org>, "memex-jpl@googlegroups.com"
> <me...@googlegroups.com>
> Subject: Re: Tika Issue?
>
> >Gah I messed up the bug story. You're right, that text is categorized as
> >en, I screwed up with the file. Here is a better/more accurate summary of
> >what I'm seeing, with some examples. Pretend the previous email
> > was all a terrible dream.
> >
> >
> >There appear to be two potential issues going on, let's start with the
> >language categorization because I already brought it up. I've attached 3
> >files below for reference. language_no_1 and language_no_2 are
> > both picked up as Norwegian, I suspect this is because there's a small
> >amount of text. language_lt_2 is probably the most interesting to me,
> >this text is picked up as Lithuanian, seems to have a good amount of
> >text, but has a footer that is in various languages.
> > I suspected that was throwing it off, however most of the text is
> >definitely English so perhaps something else is going on.
> >
> >
> >The other issue I'm seeing is with the parser, but maybe I've
> >misunderstood something. Here is some code:
> >
> >
> >import requests
> >from tika import parser
> >from tika import language
> >
> >
> >r = requests.get("http://ferretspatternu.ucoz.com/")
> >string_parsed = parser.from_buffer(r.text)
> >lang = language.from_buffer(string_parsed["content"])
> >print string_parsed["content"]
> >print lang
> >
> >
> >The language is picked up Lithuanian, however I see why. The "content"
> >field looks like it is not plain text, but instead raw HTML. In other
> >documents this field looks like it contains sanitized text... or am I
> >missing something?
> >
> >
> >
> >Anyway, hope that's all a little bit clearer! Let me know what you think.
> >
> >
> >
> >Alex
> >
> >
> >PS this is with the latest version of tika-python running tika server 1.9
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >On Thu, Jul 16, 2015 at 2:58 PM, Mattmann, Chris A (3980)
> ><ch...@jpl.nasa.gov> wrote:
> >
> >Hey Alex, what version of Tika Python are you using? And moreover
> >what version of Tika? I’m CC’ing folks on dev@tika.a.o hope you
> >don’t mind.
> >
> >I took the file you attached and saved it as blah.txt and ran
> >tika-python (with 1.9 tika) against it:
> >
> >[mattmann-0420740:~] mattmann% tika-python detect type blah.txt
> >tika.py: Retrieving
> >
> http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server
> >/
> >1.9/tika-server-1.9.jar to
> >/var/folders/05/5qw82z2d77q16fhxxhwt22tr0000gq/T/tika-server.jar.
> >[(200, u'text/plain')]
> >[mattmann-0420740:~] mattmann% tika-python language file blah.txt
> >[(200, u'en')]
> >[mattmann-0420740:~] mattmann%
> >
> >Is what what you would expect? In general the language detection using
> >
> >N-grams and gets better when there is more text as a sample but it can
> >get fooled sometimes too.
> >
> >Let me know what you think.
> >
> >Cheers,
> >Chris
> >
> >CC / memex-jpl@
> >
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >Chris Mattmann, Ph.D.
> >Chief Architect
> >Instrument Software and Science Data Systems Section (398)
> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >Office: 168-519, Mailstop: 168-527
> >Email: chris.a.mattmann@nasa.gov
> >WWW:
> >http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/>
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >Adjunct Associate Professor, Computer Science Department
> >University of Southern California, Los Angeles, CA 90089 USA
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
> >
> >
> >
> >-----Original Message-----
> >From: Alejandro Caceres <ac...@hyperiongray.com>
> >Date: Thursday, July 16, 2015 at 11:53 AM
> >To: jpluser <ch...@jpl.nasa.gov>
> >Cc: Amanda Towler <at...@hyperiongray.com>
> >Subject: Tika Issue?
> >
> >>Hey Chris,
> >>
> >>
> >>I was about to submit this as a bug, but figured I'd run it by you first.
> >>Maybe you've encountered a similar issue.
> >>
> >>
> >>I'm doing some basic language categorization of websites, I saw that the
> >>Tika server/tika-python returns content as plain text, which is great to
> >>send to Tika language categorization (and just generally useful).
> >>However,
> >> it seemed to get very confused with sites that have footers in various
> >>languages, this is actually really common in the results we've found. For
> >>example, we have a totally English site and at the bottom is some links
> >>to the same site in other languages. This
> >> page, even though it's mostly English, gets categorized as a seemingly
> >>random language (like Lithuanian).
> >>
> >>
> >>As a workaround we tried running the web pages through a text
> >>summarization algo using lxml-readability, which gives us back a subset
> >>of the text on a page. My thinking was this would most likely strip
> >>footers and headers
> >> and give us back a decent representative sample of text on the page. The
> >>results seem to have improved a bit, but we're still getting some funky
> >>results where English pages are categorized as a seemingly random
> >>language, in many cases these pages seem pretty
> >> obviously (to the human eye) to be English.
> >>
> >>
> >>I wonder if someone at JPL (I don't see anyone from JPL here right now)
> >>could shed some light on why this might be happening. I've attached a
> >>couple of samples below. Also let me know if you'd like me to file any
> >>bugs anywhere
> >> to better track this, I just wanted to shoot this to you first to see if
> >>perhaps I was missing something obvious.
> >>
> >>
> >>
> >>Alex
> >>
> >>
> >>--
> >>___
> >>
> >>Alejandro Caceres
> >>Hyperion Gray, LLC
> >>Owner/CTO
> >>
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >--
> >___
> >
> >Alejandro Caceres
> >Hyperion Gray, LLC
> >Owner/CTO
> >
>
>


-- 
___

Alejandro Caceres
Hyperion Gray, LLC
Owner/CTO

Re: Tika Issue?

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Hey Alex,

You nailed it. The first 2 examples are that there is too small a sample
of text for it to get the language categorization right. Feel free to file
a Tika issue for this (http://issues.apache.org/jira/browse/TIKA). There
was
some talk about integrating the Google language detector into this since
it’s
ALv2, not sure if it will perform better with smaller samples or not.

As for the latter example, this one is interesting. The main reason is
that it’s not detecting the file as HTML, since it doesn’t have a file
extension,
and since its MIME magic for that page doesn’t match the traditional HTML
magic
(e.g., <html>.. blah). So, it’s parsing that with the TxtParser which well
just
extracts out the characters/text from the stream:

>>> print string_parsed["metadata"]
{u'Content-Encoding': [u'ISO-8859-1'], u'Content-Type': [u'text/plain;
charset=ISO-8859-1'], u'X-TIKA:parse_time_millis': [u'58'],
u'X-Parsed-By': [[u'org.apache.tika.parser.DefaultParser',
u'org.apache.tika.parser.txt.TXTParser']]}
>>> 

Try this. You can use the from_file method to parse URLs as well. Those

URLs will be downloaded in Tika python to /tmp as files, and then parsed
from there. If you use .from_file on the above, it will correctly just
strip
the text out, and then the language detector works. Try this:

from tika import parser
from tika import language

parsed = parser.from_file("http://ferretspatternu.ucoz.com/index.html")
lang = language.from_buffer(parsed["content"])
print lang


Which should print out:

>>> parsed = parser.from_file("http://ferretspatternu.ucoz.com/index.html")
tika.py: Retrieving http://ferretspatternu.ucoz.com/index.html to
/tmp/index.html.
>>> lang = language.from_buffer(parsed["content"])
>>> print lang
en
>>> 


Note that I had to add /index.html at the end. Our code to create the
tmp file in tika-python needs some prettying up so if the URL doesn’t end
with an actual file or extension it craps out. For now you can work around
it that way. If you have time please file a Github issue for tika-python
and we’ll make this better.

Cheers HTH!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Alejandro Caceres <ac...@hyperiongray.com>
Date: Thursday, July 16, 2015 at 12:56 PM
To: jpluser <ch...@jpl.nasa.gov>
Cc: Amanda Towler <at...@hyperiongray.com>, "dev@tika.apache.org"
<de...@tika.apache.org>, "memex-jpl@googlegroups.com"
<me...@googlegroups.com>
Subject: Re: Tika Issue?

>Gah I messed up the bug story. You're right, that text is categorized as
>en, I screwed up with the file. Here is a better/more accurate summary of
>what I'm seeing, with some examples. Pretend the previous email
> was all a terrible dream.
>
>
>There appear to be two potential issues going on, let's start with the
>language categorization because I already brought it up. I've attached 3
>files below for reference. language_no_1 and language_no_2 are
> both picked up as Norwegian, I suspect this is because there's a small
>amount of text. language_lt_2 is probably the most interesting to me,
>this text is picked up as Lithuanian, seems to have a good amount of
>text, but has a footer that is in various languages.
> I suspected that was throwing it off, however most of the text is
>definitely English so perhaps something else is going on.
>
>
>The other issue I'm seeing is with the parser, but maybe I've
>misunderstood something. Here is some code:
>
>
>import requests
>from tika import parser
>from tika import language
>
>
>r = requests.get("http://ferretspatternu.ucoz.com/")
>string_parsed = parser.from_buffer(r.text)
>lang = language.from_buffer(string_parsed["content"])
>print string_parsed["content"]
>print lang
>
>
>The language is picked up Lithuanian, however I see why. The "content"
>field looks like it is not plain text, but instead raw HTML. In other
>documents this field looks like it contains sanitized text... or am I
>missing something?
>
>
>
>Anyway, hope that's all a little bit clearer! Let me know what you think.
>
>
>
>Alex
>
>
>PS this is with the latest version of tika-python running tika server 1.9
>
>
>
>
>
>
>
>
>
>
>On Thu, Jul 16, 2015 at 2:58 PM, Mattmann, Chris A (3980)
><ch...@jpl.nasa.gov> wrote:
>
>Hey Alex, what version of Tika Python are you using? And moreover
>what version of Tika? I’m CC’ing folks on dev@tika.a.o hope you
>don’t mind.
>
>I took the file you attached and saved it as blah.txt and ran
>tika-python (with 1.9 tika) against it:
>
>[mattmann-0420740:~] mattmann% tika-python detect type blah.txt
>tika.py: Retrieving
>http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server
>/
>1.9/tika-server-1.9.jar to
>/var/folders/05/5qw82z2d77q16fhxxhwt22tr0000gq/T/tika-server.jar.
>[(200, u'text/plain')]
>[mattmann-0420740:~] mattmann% tika-python language file blah.txt
>[(200, u'en')]
>[mattmann-0420740:~] mattmann%
>
>Is what what you would expect? In general the language detection using
>
>N-grams and gets better when there is more text as a sample but it can
>get fooled sometimes too.
>
>Let me know what you think.
>
>Cheers,
>Chris
>
>CC / memex-jpl@
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  
>http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>-----Original Message-----
>From: Alejandro Caceres <ac...@hyperiongray.com>
>Date: Thursday, July 16, 2015 at 11:53 AM
>To: jpluser <ch...@jpl.nasa.gov>
>Cc: Amanda Towler <at...@hyperiongray.com>
>Subject: Tika Issue?
>
>>Hey Chris,
>>
>>
>>I was about to submit this as a bug, but figured I'd run it by you first.
>>Maybe you've encountered a similar issue.
>>
>>
>>I'm doing some basic language categorization of websites, I saw that the
>>Tika server/tika-python returns content as plain text, which is great to
>>send to Tika language categorization (and just generally useful).
>>However,
>> it seemed to get very confused with sites that have footers in various
>>languages, this is actually really common in the results we've found. For
>>example, we have a totally English site and at the bottom is some links
>>to the same site in other languages. This
>> page, even though it's mostly English, gets categorized as a seemingly
>>random language (like Lithuanian).
>>
>>
>>As a workaround we tried running the web pages through a text
>>summarization algo using lxml-readability, which gives us back a subset
>>of the text on a page. My thinking was this would most likely strip
>>footers and headers
>> and give us back a decent representative sample of text on the page. The
>>results seem to have improved a bit, but we're still getting some funky
>>results where English pages are categorized as a seemingly random
>>language, in many cases these pages seem pretty
>> obviously (to the human eye) to be English.
>>
>>
>>I wonder if someone at JPL (I don't see anyone from JPL here right now)
>>could shed some light on why this might be happening. I've attached a
>>couple of samples below. Also let me know if you'd like me to file any
>>bugs anywhere
>> to better track this, I just wanted to shoot this to you first to see if
>>perhaps I was missing something obvious.
>>
>>
>>
>>Alex
>>
>>
>>--
>>___
>>
>>Alejandro Caceres
>>Hyperion Gray, LLC
>>Owner/CTO
>>
>
>
>
>
>
>
>
>
>
>
>-- 
>___
>
>Alejandro Caceres
>Hyperion Gray, LLC
>Owner/CTO
>


Re: Tika Issue?

Posted by Alejandro Caceres <ac...@hyperiongray.com>.
Gah I messed up the bug story. You're right, that text is categorized as
en, I screwed up with the file. Here is a better/more accurate summary of
what I'm seeing, with some examples. Pretend the previous email was all a
terrible dream.

There appear to be two potential issues going on, let's start with the
language categorization because I already brought it up. I've attached 3
files below for reference. language_no_1 and language_no_2 are both picked
up as Norwegian, I suspect this is because there's a small amount of text.
language_lt_2 is probably the most interesting to me, this text is picked
up as Lithuanian, seems to have a good amount of text, but has a footer
that is in various languages. I suspected that was throwing it off, however
most of the text is definitely English so perhaps something else is going
on.

The other issue I'm seeing is with the parser, but maybe I've misunderstood
something. Here is some code:

*import requests*
*from tika import parser*
*from tika import language*

*r = requests.get("http://ferretspatternu.ucoz.com/
<http://ferretspatternu.ucoz.com/>")*
*string_parsed = parser.from_buffer(r.text)*
*lang = language.from_buffer(string_parsed["content"])*
*print string_parsed["content"]*
*print lang*

The language is picked up Lithuanian, however I see why. The "content"
field looks like it is not plain text, but instead raw HTML. In other
documents this field looks like it contains sanitized text... or am I
missing something?

Anyway, hope that's all a little bit clearer! Let me know what you think.

Alex

PS this is with the latest version of tika-python running tika server 1.9





On Thu, Jul 16, 2015 at 2:58 PM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hey Alex, what version of Tika Python are you using? And moreover
> what version of Tika? I’m CC’ing folks on dev@tika.a.o hope you
> don’t mind.
>
> I took the file you attached and saved it as blah.txt and ran
> tika-python (with 1.9 tika) against it:
>
> [mattmann-0420740:~] mattmann% tika-python detect type blah.txt
> tika.py: Retrieving
> http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/
> 1.9/tika-server-1.9.jar to
> /var/folders/05/5qw82z2d77q16fhxxhwt22tr0000gq/T/tika-server.jar.
> [(200, u'text/plain')]
> [mattmann-0420740:~] mattmann% tika-python language file blah.txt
> [(200, u'en')]
> [mattmann-0420740:~] mattmann%
>
> Is what what you would expect? In general the language detection using
>
> N-grams and gets better when there is more text as a sample but it can
> get fooled sometimes too.
>
> Let me know what you think.
>
> Cheers,
> Chris
>
> CC / memex-jpl@
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> -----Original Message-----
> From: Alejandro Caceres <ac...@hyperiongray.com>
> Date: Thursday, July 16, 2015 at 11:53 AM
> To: jpluser <ch...@jpl.nasa.gov>
> Cc: Amanda Towler <at...@hyperiongray.com>
> Subject: Tika Issue?
>
> >Hey Chris,
> >
> >
> >I was about to submit this as a bug, but figured I'd run it by you first.
> >Maybe you've encountered a similar issue.
> >
> >
> >I'm doing some basic language categorization of websites, I saw that the
> >Tika server/tika-python returns content as plain text, which is great to
> >send to Tika language categorization (and just generally useful). However,
> > it seemed to get very confused with sites that have footers in various
> >languages, this is actually really common in the results we've found. For
> >example, we have a totally English site and at the bottom is some links
> >to the same site in other languages. This
> > page, even though it's mostly English, gets categorized as a seemingly
> >random language (like Lithuanian).
> >
> >
> >As a workaround we tried running the web pages through a text
> >summarization algo using lxml-readability, which gives us back a subset
> >of the text on a page. My thinking was this would most likely strip
> >footers and headers
> > and give us back a decent representative sample of text on the page. The
> >results seem to have improved a bit, but we're still getting some funky
> >results where English pages are categorized as a seemingly random
> >language, in many cases these pages seem pretty
> > obviously (to the human eye) to be English.
> >
> >
> >I wonder if someone at JPL (I don't see anyone from JPL here right now)
> >could shed some light on why this might be happening. I've attached a
> >couple of samples below. Also let me know if you'd like me to file any
> >bugs anywhere
> > to better track this, I just wanted to shoot this to you first to see if
> >perhaps I was missing something obvious.
> >
> >
> >
> >Alex
> >
> >
> >--
> >___
> >
> >Alejandro Caceres
> >Hyperion Gray, LLC
> >Owner/CTO
> >
>
>


-- 
___

Alejandro Caceres
Hyperion Gray, LLC
Owner/CTO