You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Djarmati, Sandor" <Sa...@roesberg.com> on 2010/08/17 15:00:33 UTC

Tika

Hi,
 
I'm using Tika 0.7 in C# .Net for extracting text out of PDF Files.
It works fine, but has also some problems for example with the pdf file in the attachment.
In this pdf file there's some text written vertically (without any linereturn or sth.).
When the text is beeing extracted tika doesn't get the whole word,
instead it takes single letters and puts them as a 'word' (as u can see below).
 
Output from Tika:
 
################################################
 
Hallo das ist die ÜBERSCHRIFTHallo das ist die 
ÜBERSCHRIFT!! 
Ha
llo
 da
s is
t ei
n v
ert
ika
les
 TE
XT
FE
LD
 
 
Hallo das ist ein anderes vertikales TEXTFELD 
Hallo das ist ein horizontales TEXTFELD 
H
a
ll
o 
H
al
lo H
a
l
l
o 

...
################################################
 
If anyone knows how to avoid it, please let me know.
My source code follows the example shown at this page:
http://blogs.dovetailsoftware.com/blogs/kmiller/archive/2010/07/02/using-the-tika-java-library-in-your-net-application-with-ikvm.aspx
 
 



With best regards 

Sandor Djarmati 


  <http://www.roesberg.com/>  

Sandor Djarmati
Information Engineering
University of Cooperative
Education Karlsruhe
Student 


Phone:	 +49 721 95018-0	
Fax:	 +49 721 503266	
sandor.djarmati@roesberg.com	
www.roesberg.com <http://www.roesberg.com/> 	


Roesberg Engineering - Ingenieurgesellschaft mbH für Automation
Industriestr.9, 76189 Karlsruhe, Germany 

Sitz der Gesellschaft: 76189 Karlsruhe
Geschaeftsfuehrer: Ute Heimann, Ralph Roesberg
Registergericht Mannheim HRB 104689 

________________________________