You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Peter Kronenberg <pe...@torch.ai> on 2021/01/13 19:20:45 UTC

Getting language of parsed text

If I use the BodyContentHandler, it's easy to send the text I get back to a language detector

ContentHandler handler = new BodyContentHandler(-1);

parser.parse(stream, handler, metadata, parseContext);



String str = handler.toString();



LanguageDetector detector = new OptimaizeLangDetector();
detector.loadModels();

log.info("Language: " + detector.detectAll(str));







However, if I use ToXMLContentHandler(), it obviously has problems detecting the language because of all the XML metadata.  Is there an easy way to get the body of the XHTML output?

I played around the Javax.xml.xpath, et al, but I'm not sure that the document that comes back is a valid XML document.

Re: Getting language of parsed text

Posted by Tim Allison <ta...@apache.org>.

Accidentally dropped user@...

Try TeeContentHandler?

On Wed, Jan 13, 2021 at 3:05 PM Peter Kronenberg <pe...@torch.ai>
wrote:

> Cool, didn’t know about that.  But that doesn’t seem to be able to return
> the text that it got.  Can I do both?
>
>
>
> *From:* Tim Allison <ta...@apache.org>
> *Sent:* Wednesday, January 13, 2021 2:34 PM
> *To:* user@tika.apache.org
> *Subject:* Re: Getting language of parsed text
>
>
>
> Try the LanguageHandler()?
>
>
>
> On Wed, Jan 13, 2021 at 2:21 PM Peter Kronenberg <
> peter.kronenberg@torch.ai> wrote:
>
> If I use the BodyContentHandler, it’s easy to send the text I get back to
> a language detector
>
>
>
> ContentHandler handler = *new *BodyContentHandler(-1);
>
> parser.parse(stream, handler, metadata, parseContext);
>
>
>
> String str = handler.toString();
>
>
>
> LanguageDetector detector = *new *OptimaizeLangDetector();
> detector.loadModels();
>
> *log*.info(*"Language: " *+ detector.detectAll(str));
>
>
>
>
>
>
>
> However, if I use ToXMLContentHandler(), it obviously has problems detecting the language because of all the XML metadata.  Is there an easy way to get the body of the XHTML output?
>
> I played around the Javax.xml.xpath, et al, but I’m not sure that the document that comes back is a valid XML document.
>
>
>
>
>
>

RE: Getting language of parsed text

Posted by Peter Kronenberg <pe...@torch.ai>.

Cool, didn’t know about that.  But that doesn’t seem to be able to return the text that it got.  Can I do both?

From: Tim Allison <ta...@apache.org>
Sent: Wednesday, January 13, 2021 2:34 PM
To: user@tika.apache.org
Subject: Re: Getting language of parsed text

Try the LanguageHandler()?

On Wed, Jan 13, 2021 at 2:21 PM Peter Kronenberg <pe...@torch.ai>> wrote:
If I use the BodyContentHandler, it’s easy to send the text I get back to a language detector

ContentHandler handler = new BodyContentHandler(-1);

parser.parse(stream, handler, metadata, parseContext);

String str = handler.toString();

LanguageDetector detector = new OptimaizeLangDetector();
detector.loadModels();

log.info("Language: " + detector.detectAll(str));

However, if I use ToXMLContentHandler(), it obviously has problems detecting the language because of all the XML metadata.  Is there an easy way to get the body of the XHTML output?

I played around the Javax.xml.xpath, et al, but I’m not sure that the document that comes back is a valid XML document.

Re: Getting language of parsed text

Posted by Tim Allison <ta...@apache.org>.

Try the LanguageHandler()?

On Wed, Jan 13, 2021 at 2:21 PM Peter Kronenberg <pe...@torch.ai>
wrote:

> If I use the BodyContentHandler, it’s easy to send the text I get back to
> a language detector
>
>
>
> ContentHandler handler = *new *BodyContentHandler(-1);
>
> parser.parse(stream, handler, metadata, parseContext);
>
>
>
> String str = handler.toString();
>
>
>
> LanguageDetector detector = *new *OptimaizeLangDetector();
> detector.loadModels();
>
> *log*.info(*"Language: " *+ detector.detectAll(str));
>
>
>
>
>
>
>
> However, if I use ToXMLContentHandler(), it obviously has problems detecting the language because of all the XML metadata.  Is there an easy way to get the body of the XHTML output?
>
> I played around the Javax.xml.xpath, et al, but I’m not sure that the document that comes back is a valid XML document.
>
>
>
>
>