You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Francisco Andrés Fernández <fr...@gmail.com> on 2016/02/24 20:17:37 UTC

Bad text extraction result

Hi all,
I'm extracting some text from pdf, through Tika in Solr. As result, some
important words end with spaces between characters.
For example, I could have the word "Subtitle" that I want to detect,
written like "S u b t i t l e".
How could I make PdfBox detect this type of word occurrence?
Many thanks,

Francisco

Re: Bad text extraction result

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 01.03.2016 um 21:53 schrieb Francisco Andrés Fernández:
> I'm sorry. That was only the case when you use pdftotext to extract text.
> My apologize.

No problem... now I understand what this /ActualText thing is about. This

     /Span << /ActualText (\376\377\000\255) >> BDC
       ( ) Tj
     EMC

means that the space is to be replaced with \376\377\000\255, and \255 
is indeed 0xAD. \376\377 is 0xFEFF which is the magic number for unicode.

Tilman


>
> Francisco
>
> El mar., 1 de mar. de 2016 a la(s) 16:56, Francisco Andrés Fernández <
> franaf@gmail.com> escribió:
>
>> Hi Tilman, regarding this issue, I've found a workaround that does not
>> solve pdfbox problem but might help.
>> I've filtered my documents replacing regex '[\xAD]' that is hex for 'soft
>> hyphen', as that seems to be the symbol that gets included between normal
>> characters.
>> After that, texts appears to be as required.
>> Regards,
>>
>> Francisco
>>
>> El jue., 25 de feb. de 2016 a la(s) 14:14, Francisco Andrés Fernández <
>> franaf@gmail.com> escribió:
>>
>>> Thanks Tilman.
>>>
>>> El jue., 25 de feb. de 2016 a la(s) 14:08, Tilman Hausherr <
>>> THausherr@t-online.de> escribió:
>>>
>>>> Thanks. The issue is here:
>>>> https://issues.apache.org/jira/browse/PDFBOX-3248
>>>>
>>>> Tilman
>>>>
>>>> Am 25.02.2016 um 12:44 schrieb Francisco Andrés Fernández:
>>>>> As additional information, I've found 2 related posts (about another
>>>> tools)
>>>>> in StackOverflow:
>>>>>
>>>> http://stackoverflow.com/questions/34579824/itext-how-to-tweak-text-extraction
>>>> http://stackoverflow.com/questions/22671974/itext-reading-pdf-1s-as-up-arrows-error/22688775#22688775
>>>>> Regards
>>>>>
>>>>> El mié., 24 de feb. de 2016 a la(s) 22:50, Francisco Andrés Fernández <
>>>>> franaf@gmail.com> escribió:
>>>>>
>>>>>> Many thanks Tilman.
>>>>>> I'll try to find a workaround in the meantime.
>>>>>> Cheers,
>>>>>>
>>>>>> Francisco
>>>>>>
>>>>>> El mié., 24 de feb. de 2016 a la(s) 17:47, Tilman Hausherr <
>>>>>> THausherr@t-online.de> escribió:
>>>>>>
>>>>>>> I'll create an issue in JIRA later or tomorrow, but don't expect that
>>>>>>> this will be fixed quickly (unless I missed something obvious). We
>>>> want
>>>>>>> to release 2.0 before doing any "big" work on text extraction.
>>>>>>>
>>>>>>> Tilman
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>
>>>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Bad text extraction result

Posted by Francisco Andrés Fernández <fr...@gmail.com>.

I'm sorry. That was only the case when you use pdftotext to extract text.
My apologize.

Francisco

El mar., 1 de mar. de 2016 a la(s) 16:56, Francisco Andrés Fernández <
franaf@gmail.com> escribió:

> Hi Tilman, regarding this issue, I've found a workaround that does not
> solve pdfbox problem but might help.
> I've filtered my documents replacing regex '[\xAD]' that is hex for 'soft
> hyphen', as that seems to be the symbol that gets included between normal
> characters.
> After that, texts appears to be as required.
> Regards,
>
> Francisco
>
> El jue., 25 de feb. de 2016 a la(s) 14:14, Francisco Andrés Fernández <
> franaf@gmail.com> escribió:
>
>> Thanks Tilman.
>>
>> El jue., 25 de feb. de 2016 a la(s) 14:08, Tilman Hausherr <
>> THausherr@t-online.de> escribió:
>>
>>> Thanks. The issue is here:
>>> https://issues.apache.org/jira/browse/PDFBOX-3248
>>>
>>> Tilman
>>>
>>> Am 25.02.2016 um 12:44 schrieb Francisco Andrés Fernández:
>>> > As additional information, I've found 2 related posts (about another
>>> tools)
>>> > in StackOverflow:
>>> >
>>> http://stackoverflow.com/questions/34579824/itext-how-to-tweak-text-extraction
>>> >
>>> http://stackoverflow.com/questions/22671974/itext-reading-pdf-1s-as-up-arrows-error/22688775#22688775
>>> > Regards
>>> >
>>> > El mié., 24 de feb. de 2016 a la(s) 22:50, Francisco Andrés Fernández <
>>> > franaf@gmail.com> escribió:
>>> >
>>> >> Many thanks Tilman.
>>> >> I'll try to find a workaround in the meantime.
>>> >> Cheers,
>>> >>
>>> >> Francisco
>>> >>
>>> >> El mié., 24 de feb. de 2016 a la(s) 17:47, Tilman Hausherr <
>>> >> THausherr@t-online.de> escribió:
>>> >>
>>> >>> I'll create an issue in JIRA later or tomorrow, but don't expect that
>>> >>> this will be fixed quickly (unless I missed something obvious). We
>>> want
>>> >>> to release 2.0 before doing any "big" work on text extraction.
>>> >>>
>>> >>> Tilman
>>> >>>
>>> >>> ---------------------------------------------------------------------
>>> >>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> >>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>> >>>
>>> >>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>

Re: Bad text extraction result

Posted by Francisco Andrés Fernández <fr...@gmail.com>.

Hi Tilman, regarding this issue, I've found a workaround that does not
solve pdfbox problem but might help.
I've filtered my documents replacing regex '[\xAD]' that is hex for 'soft
hyphen', as that seems to be the symbol that gets included between normal
characters.
After that, texts appears to be as required.
Regards,

Francisco

El jue., 25 de feb. de 2016 a la(s) 14:14, Francisco Andrés Fernández <
franaf@gmail.com> escribió:

> Thanks Tilman.
>
> El jue., 25 de feb. de 2016 a la(s) 14:08, Tilman Hausherr <
> THausherr@t-online.de> escribió:
>
>> Thanks. The issue is here:
>> https://issues.apache.org/jira/browse/PDFBOX-3248
>>
>> Tilman
>>
>> Am 25.02.2016 um 12:44 schrieb Francisco Andrés Fernández:
>> > As additional information, I've found 2 related posts (about another
>> tools)
>> > in StackOverflow:
>> >
>> http://stackoverflow.com/questions/34579824/itext-how-to-tweak-text-extraction
>> >
>> http://stackoverflow.com/questions/22671974/itext-reading-pdf-1s-as-up-arrows-error/22688775#22688775
>> > Regards
>> >
>> > El mié., 24 de feb. de 2016 a la(s) 22:50, Francisco Andrés Fernández <
>> > franaf@gmail.com> escribió:
>> >
>> >> Many thanks Tilman.
>> >> I'll try to find a workaround in the meantime.
>> >> Cheers,
>> >>
>> >> Francisco
>> >>
>> >> El mié., 24 de feb. de 2016 a la(s) 17:47, Tilman Hausherr <
>> >> THausherr@t-online.de> escribió:
>> >>
>> >>> I'll create an issue in JIRA later or tomorrow, but don't expect that
>> >>> this will be fixed quickly (unless I missed something obvious). We
>> want
>> >>> to release 2.0 before doing any "big" work on text extraction.
>> >>>
>> >>> Tilman
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> >>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> >>>
>> >>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>

Re: Bad text extraction result

Posted by Francisco Andrés Fernández <fr...@gmail.com>.

Thanks Tilman.

El jue., 25 de feb. de 2016 a la(s) 14:08, Tilman Hausherr <
THausherr@t-online.de> escribió:

> Thanks. The issue is here:
> https://issues.apache.org/jira/browse/PDFBOX-3248
>
> Tilman
>
> Am 25.02.2016 um 12:44 schrieb Francisco Andrés Fernández:
> > As additional information, I've found 2 related posts (about another
> tools)
> > in StackOverflow:
> >
> http://stackoverflow.com/questions/34579824/itext-how-to-tweak-text-extraction
> >
> http://stackoverflow.com/questions/22671974/itext-reading-pdf-1s-as-up-arrows-error/22688775#22688775
> > Regards
> >
> > El mié., 24 de feb. de 2016 a la(s) 22:50, Francisco Andrés Fernández <
> > franaf@gmail.com> escribió:
> >
> >> Many thanks Tilman.
> >> I'll try to find a workaround in the meantime.
> >> Cheers,
> >>
> >> Francisco
> >>
> >> El mié., 24 de feb. de 2016 a la(s) 17:47, Tilman Hausherr <
> >> THausherr@t-online.de> escribió:
> >>
> >>> I'll create an issue in JIRA later or tomorrow, but don't expect that
> >>> this will be fixed quickly (unless I missed something obvious). We want
> >>> to release 2.0 before doing any "big" work on text extraction.
> >>>
> >>> Tilman
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>
> >>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Bad text extraction result

Posted by Tilman Hausherr <TH...@t-online.de>.

Thanks. The issue is here:
https://issues.apache.org/jira/browse/PDFBOX-3248

Tilman

Am 25.02.2016 um 12:44 schrieb Francisco Andrés Fernández:
> As additional information, I've found 2 related posts (about another tools)
> in StackOverflow:
> http://stackoverflow.com/questions/34579824/itext-how-to-tweak-text-extraction
> http://stackoverflow.com/questions/22671974/itext-reading-pdf-1s-as-up-arrows-error/22688775#22688775
> Regards
>
> El mié., 24 de feb. de 2016 a la(s) 22:50, Francisco Andrés Fernández <
> franaf@gmail.com> escribió:
>
>> Many thanks Tilman.
>> I'll try to find a workaround in the meantime.
>> Cheers,
>>
>> Francisco
>>
>> El mié., 24 de feb. de 2016 a la(s) 17:47, Tilman Hausherr <
>> THausherr@t-online.de> escribió:
>>
>>> I'll create an issue in JIRA later or tomorrow, but don't expect that
>>> this will be fixed quickly (unless I missed something obvious). We want
>>> to release 2.0 before doing any "big" work on text extraction.
>>>
>>> Tilman
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Bad text extraction result

Posted by Francisco Andrés Fernández <fr...@gmail.com>.

As additional information, I've found 2 related posts (about another tools)
in StackOverflow:
http://stackoverflow.com/questions/34579824/itext-how-to-tweak-text-extraction
http://stackoverflow.com/questions/22671974/itext-reading-pdf-1s-as-up-arrows-error/22688775#22688775
Regards

El mié., 24 de feb. de 2016 a la(s) 22:50, Francisco Andrés Fernández <
franaf@gmail.com> escribió:

> Many thanks Tilman.
> I'll try to find a workaround in the meantime.
> Cheers,
>
> Francisco
>
> El mié., 24 de feb. de 2016 a la(s) 17:47, Tilman Hausherr <
> THausherr@t-online.de> escribió:
>
>> I'll create an issue in JIRA later or tomorrow, but don't expect that
>> this will be fixed quickly (unless I missed something obvious). We want
>> to release 2.0 before doing any "big" work on text extraction.
>>
>> Tilman
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>

Re: Bad text extraction result

Posted by Francisco Andrés Fernández <fr...@gmail.com>.

Many thanks Tilman.
I'll try to find a workaround in the meantime.
Cheers,

Francisco

El mié., 24 de feb. de 2016 a la(s) 17:47, Tilman Hausherr <
THausherr@t-online.de> escribió:

> I'll create an issue in JIRA later or tomorrow, but don't expect that
> this will be fixed quickly (unless I missed something obvious). We want
> to release 2.0 before doing any "big" work on text extraction.
>
> Tilman
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Bad text extraction result

Posted by Tilman Hausherr <TH...@t-online.de>.

I'll create an issue in JIRA later or tomorrow, but don't expect that 
this will be fixed quickly (unless I missed something obvious). We want 
to release 2.0 before doing any "big" work on text extraction.

Tilman

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Bad text extraction result

Posted by Tilman Hausherr <TH...@t-online.de>.

I tried all the settings and was unsuccessful. I was unable to extract 
"Cada frasco ampolla" which looked pretty obvious, it always appeared as 
"Ca da fras co ampo lla".

Then I looked into the content stream and found this:

     6 0 1.058 6 122.0924 312.51 Tm
     (Ca) Tj
     /Span << /ActualText (\376\377\000\255) >> BDC
       ( ) Tj
     EMC
     [ (da ) -301 (fras) ] TJ
     /Span << /ActualText (\376\377\000\255) >> BDC
       ( ) Tj
     EMC
     [ (co ) -301 (ampo) ] TJ
     /Span << /ActualText (\376\377\000\255) >> BDC
       ( ) Tj
     EMC
     [ (lla ) -301 (con) ] TJ

So there are really spaces there, and we keep them. Adobe is smarter, 
and ignores them because they are overwritten thanks to the "-301" you 
see (that is a positioning).

This /ActualText thing might be of some help, but I don't think we 
process this.

Tilman


Am 24.02.2016 um 20:47 schrieb Francisco Andrés Fernández:
> Hi Tilman, many thanks for your answer.
> I doesn't find any configuration file to tweak this.
> I send you the link to the pdf file to see if you could figure an idea
> about what the problem is.
> https://drive.google.com/file/d/0B0PMZsHkpcJRSEpBSWhtQndKZTg/view?usp=sharing
> Many thanks in advance,
>
> Francisco
>
> El mié., 24 de feb. de 2016 a la(s) 16:29, Tilman Hausherr <
> THausherr@t-online.de> escribió:
>
>> Am 24.02.2016 um 20:17 schrieb Francisco Andrés Fernández:
>>> Hi all,
>>> I'm extracting some text from pdf, through Tika in Solr. As result, some
>>> important words end with spaces between characters.
>>> For example, I could have the word "Subtitle" that I want to detect,
>>> written like "S u b t i t l e".
>> You could try to modify spacingTolerance or averageCharTolerance in
>> PDFTextStripper (find out if TIKA supports this), but it is likely that
>> if spaces are ignored, they would be ignored at other places where you
>> don't want it.
>>
>> If possible, please upload your file somewhere.
>>
>> Tilman
>>
>>> How could I make PdfBox detect this type of word occurrence?
>>> Many thanks,
>>>
>>> Francisco
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Bad text extraction result

Posted by Francisco Andrés Fernández <fr...@gmail.com>.

Hi Tilman, many thanks for your answer.
I doesn't find any configuration file to tweak this.
I send you the link to the pdf file to see if you could figure an idea
about what the problem is.
https://drive.google.com/file/d/0B0PMZsHkpcJRSEpBSWhtQndKZTg/view?usp=sharing
Many thanks in advance,

Francisco

El mié., 24 de feb. de 2016 a la(s) 16:29, Tilman Hausherr <
THausherr@t-online.de> escribió:

> Am 24.02.2016 um 20:17 schrieb Francisco Andrés Fernández:
> > Hi all,
> > I'm extracting some text from pdf, through Tika in Solr. As result, some
> > important words end with spaces between characters.
> > For example, I could have the word "Subtitle" that I want to detect,
> > written like "S u b t i t l e".
>
> You could try to modify spacingTolerance or averageCharTolerance in
> PDFTextStripper (find out if TIKA supports this), but it is likely that
> if spaces are ignored, they would be ignored at other places where you
> don't want it.
>
> If possible, please upload your file somewhere.
>
> Tilman
>
> > How could I make PdfBox detect this type of word occurrence?
> > Many thanks,
> >
> > Francisco
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Bad text extraction result

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 24.02.2016 um 20:17 schrieb Francisco Andrés Fernández:
> Hi all,
> I'm extracting some text from pdf, through Tika in Solr. As result, some
> important words end with spaces between characters.
> For example, I could have the word "Subtitle" that I want to detect,
> written like "S u b t i t l e".

You could try to modify spacingTolerance or averageCharTolerance in 
PDFTextStripper (find out if TIKA supports this), but it is likely that 
if spaces are ignored, they would be ignored at other places where you 
don't want it.

If possible, please upload your file somewhere.

Tilman

> How could I make PdfBox detect this type of word occurrence?
> Many thanks,
>
> Francisco
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org