You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Djarmati, Sandor" <Sa...@roesberg.com> on 2010/08/17 15:00:33 UTC
Tika
Hi,
I'm using Tika 0.7 in C# .Net for extracting text out of PDF Files.
It works fine, but has also some problems for example with the pdf file in the attachment.
In this pdf file there's some text written vertically (without any linereturn or sth.).
When the text is beeing extracted tika doesn't get the whole word,
instead it takes single letters and puts them as a 'word' (as u can see below).
Output from Tika:
################################################
Hallo das ist die ÜBERSCHRIFTHallo das ist die
ÜBERSCHRIFT!!
Ha
llo
da
s is
t ei
n v
ert
ika
les
TE
XT
FE
LD
Hallo das ist ein anderes vertikales TEXTFELD
Hallo das ist ein horizontales TEXTFELD
H
a
ll
o
H
al
lo H
a
l
l
o
...
################################################
If anyone knows how to avoid it, please let me know.
My source code follows the example shown at this page:
http://blogs.dovetailsoftware.com/blogs/kmiller/archive/2010/07/02/using-the-tika-java-library-in-your-net-application-with-ikvm.aspx
With best regards
Sandor Djarmati
<http://www.roesberg.com/>
Sandor Djarmati
Information Engineering
University of Cooperative
Education Karlsruhe
Student
Phone: +49 721 95018-0
Fax: +49 721 503266
sandor.djarmati@roesberg.com
www.roesberg.com <http://www.roesberg.com/>
Roesberg Engineering - Ingenieurgesellschaft mbH für Automation
Industriestr.9, 76189 Karlsruhe, Germany
Sitz der Gesellschaft: 76189 Karlsruhe
Geschaeftsfuehrer: Ute Heimann, Ralph Roesberg
Registergericht Mannheim HRB 104689
________________________________