You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Martin Grotzke <ma...@javakaffee.de> on 2009/07/25 00:49:23 UTC
Getting no text content from html
Hi all,
I'm just starting with tika and try to extract the text content of some
html. Unfortunately, I get no content at all.
This is my test method (in scala):
def testHtml() {
val html = "<html><body>my content</body></html>"
val input = new ByteArrayInputStream(html.getBytes)
val metadata = new Metadata
val textHandler = new BodyContentHandler
val parser = new HtmlParser
parser.parse(input, textHandler, metadata);
input.close();
println("HTML Input: " + html)
println("Title: " + metadata.get("title"))
println("Author: " + metadata.get("Author"))
println("content: " + textHandler.toString)
}
Is there anything wrong here?
Thanx && cheers,
Martin
Re: Getting no text content from html
Posted by Martin Grotzke <ma...@javakaffee.de>.
Hi,
On Sat, 2009-07-25 at 00:49 +0200, Martin Grotzke wrote:
> Hi all,
>
> I'm just starting with tika and try to extract the text content of some
> html. Unfortunately, I get no content at all.
>
> This is my test method (in scala):
>
> def testHtml() {
> val html = "<html><body>my content</body></html>"
> val input = new ByteArrayInputStream(html.getBytes)
> val metadata = new Metadata
> val textHandler = new BodyContentHandler
> val parser = new HtmlParser
> parser.parse(input, textHandler, metadata);
> input.close();
> println("HTML Input: " + html)
> println("Title: " + metadata.get("title"))
> println("Author: " + metadata.get("Author"))
> println("content: " + textHandler.toString)
> }
If the above was not explicit enough: textHandler.toString was empty.
Any help?
Thx && cheers,
Martin
>
> Is there anything wrong here?
>
> Thanx && cheers,
> Martin
>
Re: Getting no text content from html
Posted by Martin Grotzke <ma...@javakaffee.de>.
On Thu, 2009-07-30 at 00:22 +0200, Martin Grotzke wrote:
> Great, with 0.4 it works, now I get the html content!
>
> Just to mention it: I had to depend on
> org.apache.tika:tika-core
> and
> org.apache.tika:tika-parsers
> and was no longer able to just depend on org.apache.tika:tika to get
> everything that's needed.
Ok, probably it's enough just to depend on tika-parsers... :)
Cheers,
Martin
>
> Thanx && cheers,
> Martin
>
>
>
> On Wed, 2009-07-29 at 10:58 +0200, Jukka Zitting wrote:
> > Hi,
> >
> > On Sat, Jul 25, 2009 at 12:49 AM, Martin
> > Grotzke<ma...@javakaffee.de> wrote:
> > > I'm just starting with tika and try to extract the text content of some
> > > html. Unfortunately, I get no content at all.
> > >
> > > This is my test method (in scala):
> > >
> > > def testHtml() {
> > > val html = "<html><body>my content</body></html>"
> >
> > You're most likely hitting issue TIKA-210 [1] that's fixed in the 0.4 release.
> >
> > [1] https://issues.apache.org/jira/browse/TIKA-210
> >
> > BR,
> >
> > Jukka Zitting
> >
Re: Getting no text content from html
Posted by Martin Grotzke <ma...@javakaffee.de>.
Great, with 0.4 it works, now I get the html content!
Just to mention it: I had to depend on
org.apache.tika:tika-core
and
org.apache.tika:tika-parsers
and was no longer able to just depend on org.apache.tika:tika to get
everything that's needed.
Thanx && cheers,
Martin
On Wed, 2009-07-29 at 10:58 +0200, Jukka Zitting wrote:
> Hi,
>
> On Sat, Jul 25, 2009 at 12:49 AM, Martin
> Grotzke<ma...@javakaffee.de> wrote:
> > I'm just starting with tika and try to extract the text content of some
> > html. Unfortunately, I get no content at all.
> >
> > This is my test method (in scala):
> >
> > def testHtml() {
> > val html = "<html><body>my content</body></html>"
>
> You're most likely hitting issue TIKA-210 [1] that's fixed in the 0.4 release.
>
> [1] https://issues.apache.org/jira/browse/TIKA-210
>
> BR,
>
> Jukka Zitting
>
Re: Getting no text content from html
Posted by Jukka Zitting <ju...@gmail.com>.
Hi,
On Sat, Jul 25, 2009 at 12:49 AM, Martin
Grotzke<ma...@javakaffee.de> wrote:
> I'm just starting with tika and try to extract the text content of some
> html. Unfortunately, I get no content at all.
>
> This is my test method (in scala):
>
> def testHtml() {
> val html = "<html><body>my content</body></html>"
You're most likely hitting issue TIKA-210 [1] that's fixed in the 0.4 release.
[1] https://issues.apache.org/jira/browse/TIKA-210
BR,
Jukka Zitting