You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Fabian Lazarski <f....@scanplus.de> on 2009/08/26 10:59:59 UTC

No text content from pdf/rtf/odt/...

Hi,

I'm try to extract the text content of some text document formats.
Unfortunately, I get no content at all. I use the tika-core-0.4.jar.

This is my test class:
public class ContentConverter
{
    private final static String filename = "/home/path/dir/foo.odt";
 
    public static void main(String[] args)
    { 
        InputStream input = null;
        ContentHandler textHandler = new BodyContentHandler();
        try {
            input = new FileInputStream(filename);
            Metadata metaData = new Metadata();
            AutoDetectParser parser = new AutoDetectParser();
                //metaData.set(Metadata.RESOURCE_NAME_KEY, filename); -> 
NO success
 
            BufferedReader reader = new BufferedReader(new 
InputStreamReader(input));
            for (int i = 0; i < 10; i++) {
                System.out.println("Test input: "+reader.readLine());// -> 
OK
            }
            parser.parse(input, textHandler, metaData);
 
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                input.close();
            } catch (IOException ioe) {
                ioe.printStackTrace();
            }
        }
        System.out.println("Content: "+textHandler.toString());// NO 
content
    }
}
Is something wrong here?

Regards
Fabian


-----------------------------------------------------------------
Sicherheit - Verfügbarkeit - Kontinuität
-----------------------------------------------------------------
IT-Service-Management von

ScanPlus GmbH 
Lise-Meitner-Strasse 5          Tel. +49 731 92013 152 
D-89081 Ulm, Germany            Fax. +49 731 92013 290
Web: http://www.scan-plus.de/   Mail: lazarski@scan-plus.de

Amtsgericht Ulm HRB 3220 
Geschäftsführung:                 Jürgen Hörmann, Andreas Werther
-----------------------------------------------------------------

Antwort: Re: No text content from pdf/rtf/odt/...

Posted by Fabian Lazarski <f....@scanplus.de>.

Hi.

Thanks! With all dependencies it works ;) 
I added this dependeny to my pom.xml:  
       <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-parsers</artifactId>
            <version>0.4</version>
        </dependency>
Regards
Fabian

-----------------------------------------------------------------
Sicherheit - Verfügbarkeit - Kontinuität
-----------------------------------------------------------------
IT-Service-Management von

ScanPlus GmbH 
Lise-Meitner-Strasse 5          Tel. +49 731 92013 152 
D-89081 Ulm, Germany            Fax. +49 731 92013 290
Web: http://www.scan-plus.de/   Mail: lazarski@scan-plus.de

Amtsgericht Ulm HRB 3220 
Geschäftsführung:                 Jürgen Hörmann, Andreas Werther
-----------------------------------------------------------------



Von:
Jukka Zitting <ju...@gmail.com>
An:
tika-user@lucene.apache.org
Datum:
26.08.2009 11:41
Betreff:
Re: No text content from pdf/rtf/odt/...



Hi,

On Wed, Aug 26, 2009 at 10:59 AM, Fabian Lazarski<f....@scanplus.de> 
wrote:
> I'm try to extract the text content of some text document formats.
> Unfortunately, I get no content at all. I use the tika-core-0.4.jar.

Make sure that you also have the tika-parsers jar and all the parser
library dependencies in your classpath.

See http://lucene.apache.org/tika/gettingstarted.html for details
about the classpath settings.

BR,

Jukka Zitting

Re: No text content from pdf/rtf/odt/...

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Wed, Aug 26, 2009 at 10:59 AM, Fabian Lazarski<f....@scanplus.de> wrote:
> I'm try to extract the text content of some text document formats.
> Unfortunately, I get no content at all. I use the tika-core-0.4.jar.

Make sure that you also have the tika-parsers jar and all the parser
library dependencies in your classpath.

See http://lucene.apache.org/tika/gettingstarted.html for details
about the classpath settings.

BR,

Jukka Zitting

Re: No text content from pdf/rtf/odt/...

Posted by Daan de Wit <d....@o3spaces.com>.

Hi,

Putting the name in the meta data should work, try if removing the path helps.

Best,
Daan

2009/8/26 Fabian Lazarski <f....@scanplus.de>:
> Hi,
>
> I'm try to extract the text content of some text document formats.
> Unfortunately, I get no content at all. I use the tika-core-0.4.jar.
>
> This is my test class:
> public class ContentConverter
> {
>     private final static String filename = "/home/path/dir/foo.odt";
>
>     public static void main(String[] args)
>     {
>         InputStream input = null;
>         ContentHandler textHandler = new BodyContentHandler();
>         try {
>             input = new FileInputStream(filename);
>             Metadata metaData = new Metadata();
>             AutoDetectParser parser = new AutoDetectParser();
>                 //metaData.set(Metadata.RESOURCE_NAME_KEY, filename); -> NO
> success
>
>             BufferedReader reader = new BufferedReader(new
> InputStreamReader(input));
>             for (int i = 0; i < 10; i++) {
>                 System.out.println("Test input: "+reader.readLine());// ->
> OK
>             }
>             parser.parse(input, textHandler, metaData);
>
>         } catch (Exception e) {
>             e.printStackTrace();
>         } finally {
>             try {
>                 input.close();
>             } catch (IOException ioe) {
>                 ioe.printStackTrace();
>             }
>         }
>         System.out.println("Content: "+textHandler.toString());// NO content
>     }
> }
> Is something wrong here?
>
> Regards
> Fabian
>
>
> -----------------------------------------------------------------
> Sicherheit - Verfügbarkeit - Kontinuität
> -----------------------------------------------------------------
> IT-Service-Management von
>
> ScanPlus GmbH
> Lise-Meitner-Strasse 5          Tel. +49 731 92013 152
> D-89081 Ulm, Germany            Fax. +49 731 92013 290
> Web: http://www.scan-plus.de/   Mail: lazarski@scan-plus.de
>
> Amtsgericht Ulm HRB 3220
> Geschäftsführung:                   Jürgen Hörmann, Andreas Werther
> -----------------------------------------------------------------