You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Fabian Lazarski <f....@scanplus.de> on 2009/08/26 10:59:59 UTC
No text content from pdf/rtf/odt/...
Hi,
I'm try to extract the text content of some text document formats.
Unfortunately, I get no content at all. I use the tika-core-0.4.jar.
This is my test class:
public class ContentConverter
{
private final static String filename = "/home/path/dir/foo.odt";
public static void main(String[] args)
{
InputStream input = null;
ContentHandler textHandler = new BodyContentHandler();
try {
input = new FileInputStream(filename);
Metadata metaData = new Metadata();
AutoDetectParser parser = new AutoDetectParser();
//metaData.set(Metadata.RESOURCE_NAME_KEY, filename); ->
NO success
BufferedReader reader = new BufferedReader(new
InputStreamReader(input));
for (int i = 0; i < 10; i++) {
System.out.println("Test input: "+reader.readLine());// ->
OK
}
parser.parse(input, textHandler, metaData);
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
input.close();
} catch (IOException ioe) {
ioe.printStackTrace();
}
}
System.out.println("Content: "+textHandler.toString());// NO
content
}
}
Is something wrong here?
Regards
Fabian
-----------------------------------------------------------------
Sicherheit - Verfügbarkeit - Kontinuität
-----------------------------------------------------------------
IT-Service-Management von
ScanPlus GmbH
Lise-Meitner-Strasse 5 Tel. +49 731 92013 152
D-89081 Ulm, Germany Fax. +49 731 92013 290
Web: http://www.scan-plus.de/ Mail: lazarski@scan-plus.de
Amtsgericht Ulm HRB 3220
Geschäftsführung: Jürgen Hörmann, Andreas Werther
-----------------------------------------------------------------
Antwort: Re: No text content from pdf/rtf/odt/...
Posted by Fabian Lazarski <f....@scanplus.de>.
Hi.
Thanks! With all dependencies it works ;)
I added this dependeny to my pom.xml:
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>0.4</version>
</dependency>
Regards
Fabian
-----------------------------------------------------------------
Sicherheit - Verfügbarkeit - Kontinuität
-----------------------------------------------------------------
IT-Service-Management von
ScanPlus GmbH
Lise-Meitner-Strasse 5 Tel. +49 731 92013 152
D-89081 Ulm, Germany Fax. +49 731 92013 290
Web: http://www.scan-plus.de/ Mail: lazarski@scan-plus.de
Amtsgericht Ulm HRB 3220
Geschäftsführung: Jürgen Hörmann, Andreas Werther
-----------------------------------------------------------------
Von:
Jukka Zitting <ju...@gmail.com>
An:
tika-user@lucene.apache.org
Datum:
26.08.2009 11:41
Betreff:
Re: No text content from pdf/rtf/odt/...
Hi,
On Wed, Aug 26, 2009 at 10:59 AM, Fabian Lazarski<f....@scanplus.de>
wrote:
> I'm try to extract the text content of some text document formats.
> Unfortunately, I get no content at all. I use the tika-core-0.4.jar.
Make sure that you also have the tika-parsers jar and all the parser
library dependencies in your classpath.
See http://lucene.apache.org/tika/gettingstarted.html for details
about the classpath settings.
BR,
Jukka Zitting
Re: No text content from pdf/rtf/odt/...
Posted by Jukka Zitting <ju...@gmail.com>.
Hi,
On Wed, Aug 26, 2009 at 10:59 AM, Fabian Lazarski<f....@scanplus.de> wrote:
> I'm try to extract the text content of some text document formats.
> Unfortunately, I get no content at all. I use the tika-core-0.4.jar.
Make sure that you also have the tika-parsers jar and all the parser
library dependencies in your classpath.
See http://lucene.apache.org/tika/gettingstarted.html for details
about the classpath settings.
BR,
Jukka Zitting
Re: No text content from pdf/rtf/odt/...
Posted by Daan de Wit <d....@o3spaces.com>.
Hi,
Putting the name in the meta data should work, try if removing the path helps.
Best,
Daan
2009/8/26 Fabian Lazarski <f....@scanplus.de>:
> Hi,
>
> I'm try to extract the text content of some text document formats.
> Unfortunately, I get no content at all. I use the tika-core-0.4.jar.
>
> This is my test class:
> public class ContentConverter
> {
> private final static String filename = "/home/path/dir/foo.odt";
>
> public static void main(String[] args)
> {
> InputStream input = null;
> ContentHandler textHandler = new BodyContentHandler();
> try {
> input = new FileInputStream(filename);
> Metadata metaData = new Metadata();
> AutoDetectParser parser = new AutoDetectParser();
> //metaData.set(Metadata.RESOURCE_NAME_KEY, filename); -> NO
> success
>
> BufferedReader reader = new BufferedReader(new
> InputStreamReader(input));
> for (int i = 0; i < 10; i++) {
> System.out.println("Test input: "+reader.readLine());// ->
> OK
> }
> parser.parse(input, textHandler, metaData);
>
> } catch (Exception e) {
> e.printStackTrace();
> } finally {
> try {
> input.close();
> } catch (IOException ioe) {
> ioe.printStackTrace();
> }
> }
> System.out.println("Content: "+textHandler.toString());// NO content
> }
> }
> Is something wrong here?
>
> Regards
> Fabian
>
>
> -----------------------------------------------------------------
> Sicherheit - Verfügbarkeit - Kontinuität
> -----------------------------------------------------------------
> IT-Service-Management von
>
> ScanPlus GmbH
> Lise-Meitner-Strasse 5 Tel. +49 731 92013 152
> D-89081 Ulm, Germany Fax. +49 731 92013 290
> Web: http://www.scan-plus.de/ Mail: lazarski@scan-plus.de
>
> Amtsgericht Ulm HRB 3220
> Geschäftsführung: Jürgen Hörmann, Andreas Werther
> -----------------------------------------------------------------