You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Martin Grotzke <ma...@javakaffee.de> on 2009/07/25 00:49:23 UTC

Getting no text content from html

Hi all,

I'm just starting with tika and try to extract the text content of some
html. Unfortunately, I get no content at all.

This is my test method (in scala):

  def testHtml() {
    val html = "<html><body>my content</body></html>"
    val input = new ByteArrayInputStream(html.getBytes)
    val metadata = new Metadata
    val textHandler = new BodyContentHandler
    val parser = new HtmlParser
    parser.parse(input, textHandler, metadata);
    input.close();
    println("HTML Input: " + html)
    println("Title: " + metadata.get("title"))
    println("Author: " + metadata.get("Author"))
    println("content: " + textHandler.toString)
  }

Is there anything wrong here?

Thanx && cheers,
Martin

Re: Getting no text content from html

Posted by Martin Grotzke <ma...@javakaffee.de>.

Hi,

On Sat, 2009-07-25 at 00:49 +0200, Martin Grotzke wrote:
> Hi all,
> 
> I'm just starting with tika and try to extract the text content of some
> html. Unfortunately, I get no content at all.
> 
> This is my test method (in scala):
> 
>   def testHtml() {
>     val html = "<html><body>my content</body></html>"
>     val input = new ByteArrayInputStream(html.getBytes)
>     val metadata = new Metadata
>     val textHandler = new BodyContentHandler
>     val parser = new HtmlParser
>     parser.parse(input, textHandler, metadata);
>     input.close();
>     println("HTML Input: " + html)
>     println("Title: " + metadata.get("title"))
>     println("Author: " + metadata.get("Author"))
>     println("content: " + textHandler.toString)
>   }
If the above was not explicit enough: textHandler.toString was empty.

Any help?

Thx && cheers,
Martin


> 
> Is there anything wrong here?
> 
> Thanx && cheers,
> Martin
>

Re: Getting no text content from html

Posted by Martin Grotzke <ma...@javakaffee.de>.

On Thu, 2009-07-30 at 00:22 +0200, Martin Grotzke wrote:
> Great, with 0.4 it works, now I get the html content!
> 
> Just to mention it: I had to depend on
>   org.apache.tika:tika-core
> and
>   org.apache.tika:tika-parsers
> and was no longer able to just depend on org.apache.tika:tika to get
> everything that's needed.
Ok, probably it's enough just to depend on tika-parsers... :)

Cheers,
Martin

> 
> Thanx && cheers,
> Martin
> 
> 
> 
> On Wed, 2009-07-29 at 10:58 +0200, Jukka Zitting wrote:
> > Hi,
> > 
> > On Sat, Jul 25, 2009 at 12:49 AM, Martin
> > Grotzke<ma...@javakaffee.de> wrote:
> > > I'm just starting with tika and try to extract the text content of some
> > > html. Unfortunately, I get no content at all.
> > >
> > > This is my test method (in scala):
> > >
> > >  def testHtml() {
> > >    val html = "<html><body>my content</body></html>"
> > 
> > You're most likely hitting issue TIKA-210 [1] that's fixed in the 0.4 release.
> > 
> > [1] https://issues.apache.org/jira/browse/TIKA-210
> > 
> > BR,
> > 
> > Jukka Zitting
> >

Re: Getting no text content from html

Posted by Martin Grotzke <ma...@javakaffee.de>.

Great, with 0.4 it works, now I get the html content!

Just to mention it: I had to depend on
  org.apache.tika:tika-core
and
  org.apache.tika:tika-parsers
and was no longer able to just depend on org.apache.tika:tika to get
everything that's needed.

Thanx && cheers,
Martin



On Wed, 2009-07-29 at 10:58 +0200, Jukka Zitting wrote:
> Hi,
> 
> On Sat, Jul 25, 2009 at 12:49 AM, Martin
> Grotzke<ma...@javakaffee.de> wrote:
> > I'm just starting with tika and try to extract the text content of some
> > html. Unfortunately, I get no content at all.
> >
> > This is my test method (in scala):
> >
> >  def testHtml() {
> >    val html = "<html><body>my content</body></html>"
> 
> You're most likely hitting issue TIKA-210 [1] that's fixed in the 0.4 release.
> 
> [1] https://issues.apache.org/jira/browse/TIKA-210
> 
> BR,
> 
> Jukka Zitting
>

Re: Getting no text content from html

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Sat, Jul 25, 2009 at 12:49 AM, Martin
Grotzke<ma...@javakaffee.de> wrote:
> I'm just starting with tika and try to extract the text content of some
> html. Unfortunately, I get no content at all.
>
> This is my test method (in scala):
>
>  def testHtml() {
>    val html = "<html><body>my content</body></html>"

You're most likely hitting issue TIKA-210 [1] that's fixed in the 0.4 release.

[1] https://issues.apache.org/jira/browse/TIKA-210

BR,

Jukka Zitting