You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Tim Allison <ta...@apache.org> on 2020/12/26 22:18:10 UTC

Re: Apache Tika issue

On Sat, Dec 26, 2020 at 12:54 PM sofien benharchache <
sofien.benharchache@gmail.com> wrote:

> Hello,
>
> I am using Apache Tika with Python to extract text from PDF. I have a
> problem in extracting the content of PDF files. The order of the text is
> sometimes messed up.
>
> I have some PDF files containing free-form text. Some lines are in the
> form of two columns. One column represents a year and the other represents
> a description associated to the year.
>
> Let’s say :
> dateA   description A
> dateB   description B
>
> For example, here is an extract of one file :
>
> I can’t provide the whole file, as the data is not meant to be shared.
>
> I expect Apache Tika to extract content in the form :
> dateA descriptionA dateB descriptionB.
>
> But the output is the following :
> dateA dateB descriptionA descriptionB
>
> I included this property in my configuration file :
> <property name="sortByPosition" value="true"/>
>
> then this code
> parsed = parser.from_file('/path/to/file',
> config_path='/my/path/tika.config’)
>
> But it doesn’t change the output.
>
> Do you have any idea to resolve this issue ?
>
> Thanks,
>
>

Re: Apache Tika issue

Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,

If it works with PDFBox but not with Tika, then it means it is related 
to a change in PDFBox, probably this one
https://issues.apache.org/jira/browse/PDFBOX-5002

You could try a tika 1.26 snapshot:

https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-app/1.26-SNAPSHOT/

Tilman


Am 27.12.2020 um 14:36 schrieb sofien benharchache:
> Hi,
>
> Thanks for the help and for answering so quickly ! Very 
> appreciated. It works now with PDFBox. Changing the config file 
> was not sufficient.
> Still, I wanted to use Apache Tika for the parsing because I’m 
> sometimes dealing with other formats. Would you have any further idea 
> for me to obtain similar results with Apache Tika ?
> The lines are indeed very close to each other.
>
> Thanks !
>
>> Le 27 déc. 2020 à 05:35, Tilman Hausherr <THausherr@t-online.de 
>> <ma...@t-online.de>> a écrit :
>>
>> Check if the flag has any effect on other PDFs. If not, then there is 
>> a mistake setting the option.
>>
>> Here's a config.xml , the option is different than you did
>>
>> <properties>
>>   <parsers>
>>     <parser class="org.apache.tika.parser.DefaultParser">
>>       <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
>>     </parser>
>>     <parser class="org.apache.tika.parser.pdf.PDFParser">
>>       <params>
>>         <param name="enableAutoSpace" type="bool">true</param>
>>         <param name="sortByPosition" type="bool">true</param>
>>       </params>
>>     </parser>
>>   </parsers>
>> </properties>
>>
>> Second thing: try with PDFBox directly, download pdfbox-app from
>> https://pdfbox.apache.org/download.html
>>
>> and then run
>>
>> java -jar pdfbox-app-2.0.22.jar 
>> <https://www.apache.org/dyn/closer.lua?filename=pdfbox/2.0.22/pdfbox-app-2.0.22.jar&action=download> 
>> ExtractText -sort XXXX.pdf
>>
>>
>> third possibility: the lines are very close to each other. Is your 
>> PDF like that?
>>
>> Tilman
>>
>>
>> Am 26.12.2020 um 23:18 schrieb Tim Allison:
>>>
>>>
>>> On Sat, Dec 26, 2020 at 12:54 PM sofien benharchache 
>>> <sofien.benharchache@gmail.com 
>>> <ma...@gmail.com>> wrote:
>>>
>>>     Hello,
>>>
>>>     I am using Apache Tika with Python to extract text from PDF. I
>>>     have a problem in extracting the content of PDF files. The order
>>>     of the text is sometimes messed up.
>>>
>>>     I have some PDF files containing free-form text. Some lines are
>>>     in the form of two columns. One column represents a year and the
>>>     other represents a description associated to the year.
>>>
>>>     Let’s say :
>>>     dateA   description A
>>>     dateB   description B
>>>
>>>     For example, here is an extract of one file :
>>>
>>>     I can’t provide the whole file, as the data is not meant to be
>>>     shared.
>>>
>>>     I expect Apache Tika to extract content in the form :
>>>     dateA descriptionA dateB descriptionB.
>>>
>>>     But the output is the following :
>>>     dateA dateB descriptionA descriptionB
>>>
>>>     I included this property in my configuration file :
>>>     <property name="sortByPosition" value="true"/>
>>>
>>>     then this code
>>>     parsed = parser.from_file('/path/to/file',
>>>     config_path='/my/path/tika.config’)
>>>
>>>     But it doesn’t change the output.
>>>
>>>     Do you have any idea to resolve this issue ?
>>>
>>>     Thanks,
>>>
>>
>


Re: Apache Tika issue

Posted by Tilman Hausherr <TH...@t-online.de>.
Check if the flag has any effect on other PDFs. If not, then there is a 
mistake setting the option.

Here's a config.xml , the option is different than you did

<properties>
   <parsers>
     <parser class="org.apache.tika.parser.DefaultParser">
       <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
     </parser>
     <parser class="org.apache.tika.parser.pdf.PDFParser">
       <params>
         <param name="enableAutoSpace" type="bool">true</param>
         <param name="sortByPosition" type="bool">true</param>
       </params>
     </parser>
   </parsers>
</properties>

Second thing: try with PDFBox directly, download pdfbox-app from
https://pdfbox.apache.org/download.html

and then run

java -jar pdfbox-app-2.0.22.jar 
<https://www.apache.org/dyn/closer.lua?filename=pdfbox/2.0.22/pdfbox-app-2.0.22.jar&action=download> 
ExtractText -sort XXXX.pdf


third possibility: the lines are very close to each other. Is your PDF 
like that?

Tilman


Am 26.12.2020 um 23:18 schrieb Tim Allison:
>
>
> On Sat, Dec 26, 2020 at 12:54 PM sofien benharchache 
> <sofien.benharchache@gmail.com <ma...@gmail.com>> 
> wrote:
>
>     Hello,
>
>     I am using Apache Tika with Python to extract text from PDF. I
>     have a problem in extracting the content of PDF files. The order
>     of the text is sometimes messed up.
>
>     I have some PDF files containing free-form text. Some lines are in
>     the form of two columns. One column represents a year and the
>     other represents a description associated to the year.
>
>     Let’s say :
>     dateA   description A
>     dateB   description B
>
>     For example, here is an extract of one file :
>
>     I can’t provide the whole file, as the data is not meant to be shared.
>
>     I expect Apache Tika to extract content in the form :
>     dateA descriptionA dateB descriptionB.
>
>     But the output is the following :
>     dateA dateB descriptionA descriptionB
>
>     I included this property in my configuration file :
>     <property name="sortByPosition" value="true"/>
>
>     then this code
>     parsed = parser.from_file('/path/to/file',
>     config_path='/my/path/tika.config’)
>
>     But it doesn’t change the output.
>
>     Do you have any idea to resolve this issue ?
>
>     Thanks,
>