You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Peter Kronenberg <pe...@torch.ai> on 2021/01/13 19:20:45 UTC
Getting language of parsed text
If I use the BodyContentHandler, it's easy to send the text I get back to a language detector
ContentHandler handler = new BodyContentHandler(-1);
parser.parse(stream, handler, metadata, parseContext);
String str = handler.toString();
LanguageDetector detector = new OptimaizeLangDetector();
detector.loadModels();
log.info("Language: " + detector.detectAll(str));
However, if I use ToXMLContentHandler(), it obviously has problems detecting the language because of all the XML metadata. Is there an easy way to get the body of the XHTML output?
I played around the Javax.xml.xpath, et al, but I'm not sure that the document that comes back is a valid XML document.
Re: Getting language of parsed text
Posted by Tim Allison <ta...@apache.org>.
Accidentally dropped user@...
Try TeeContentHandler?
On Wed, Jan 13, 2021 at 3:05 PM Peter Kronenberg <pe...@torch.ai>
wrote:
> Cool, didn’t know about that. But that doesn’t seem to be able to return
> the text that it got. Can I do both?
>
>
>
> *From:* Tim Allison <ta...@apache.org>
> *Sent:* Wednesday, January 13, 2021 2:34 PM
> *To:* user@tika.apache.org
> *Subject:* Re: Getting language of parsed text
>
>
>
> Try the LanguageHandler()?
>
>
>
> On Wed, Jan 13, 2021 at 2:21 PM Peter Kronenberg <
> peter.kronenberg@torch.ai> wrote:
>
> If I use the BodyContentHandler, it’s easy to send the text I get back to
> a language detector
>
>
>
> ContentHandler handler = *new *BodyContentHandler(-1);
>
> parser.parse(stream, handler, metadata, parseContext);
>
>
>
> String str = handler.toString();
>
>
>
> LanguageDetector detector = *new *OptimaizeLangDetector();
> detector.loadModels();
>
> *log*.info(*"Language: " *+ detector.detectAll(str));
>
>
>
>
>
>
>
> However, if I use ToXMLContentHandler(), it obviously has problems detecting the language because of all the XML metadata. Is there an easy way to get the body of the XHTML output?
>
> I played around the Javax.xml.xpath, et al, but I’m not sure that the document that comes back is a valid XML document.
>
>
>
>
>
>
RE: Getting language of parsed text
Posted by Peter Kronenberg <pe...@torch.ai>.
Cool, didn’t know about that. But that doesn’t seem to be able to return the text that it got. Can I do both?
From: Tim Allison <ta...@apache.org>
Sent: Wednesday, January 13, 2021 2:34 PM
To: user@tika.apache.org
Subject: Re: Getting language of parsed text
Try the LanguageHandler()?
On Wed, Jan 13, 2021 at 2:21 PM Peter Kronenberg <pe...@torch.ai>> wrote:
If I use the BodyContentHandler, it’s easy to send the text I get back to a language detector
ContentHandler handler = new BodyContentHandler(-1);
parser.parse(stream, handler, metadata, parseContext);
String str = handler.toString();
LanguageDetector detector = new OptimaizeLangDetector();
detector.loadModels();
log.info("Language: " + detector.detectAll(str));
However, if I use ToXMLContentHandler(), it obviously has problems detecting the language because of all the XML metadata. Is there an easy way to get the body of the XHTML output?
I played around the Javax.xml.xpath, et al, but I’m not sure that the document that comes back is a valid XML document.
Re: Getting language of parsed text
Posted by Tim Allison <ta...@apache.org>.
Try the LanguageHandler()?
On Wed, Jan 13, 2021 at 2:21 PM Peter Kronenberg <pe...@torch.ai>
wrote:
> If I use the BodyContentHandler, it’s easy to send the text I get back to
> a language detector
>
>
>
> ContentHandler handler = *new *BodyContentHandler(-1);
>
> parser.parse(stream, handler, metadata, parseContext);
>
>
>
> String str = handler.toString();
>
>
>
> LanguageDetector detector = *new *OptimaizeLangDetector();
> detector.loadModels();
>
> *log*.info(*"Language: " *+ detector.detectAll(str));
>
>
>
>
>
>
>
> However, if I use ToXMLContentHandler(), it obviously has problems detecting the language because of all the XML metadata. Is there an easy way to get the body of the XHTML output?
>
> I played around the Javax.xml.xpath, et al, but I’m not sure that the document that comes back is a valid XML document.
>
>
>
>
>