You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jackrabbit.apache.org by Péterfi Balázs <b....@i-deal.hu> on 2009/01/26 17:25:53 UTC

searching in OCRed pdf

Hello,

I'm developing an application that uses jackrabbit and have some problem 
with searching in pdf files. When I search in a pdf that was generated 
from a word document it works. When I try to search in a pdf that has a 
scanned document inside it and I can search through its contents from 
within Adobe Reader (some sort of Optical Character Recognition) but my 
application does not obtain results. I don't know how does this kind of 
pdf work but I need to search in it. Does jackrabbit support it?

Thank you!
Balazs

Re: searching in OCRed pdf

Posted by Paco Avila <mo...@gmail.com>.

Jackrabbit PDF text extractor uses PDFBox. If Adobe Reader can search
the text then PDFBox should be capable of extract this text, but I
only is my opinion.

On Mon, Jan 26, 2009 at 5:47 PM, Péterfi Balázs <b....@i-deal.hu> wrote:
> I think it has already OCRed because as I wrote I can search in the pdf with
> adobe reader and it also selects the result. But what I see is a scanned
> paper and I guess there is a text layer "behind" it. Is it possible?
>
> Paco Avila írta:
>>
>> You can make a text extractor which perform an OCR.
>>
>> On Mon, Jan 26, 2009 at 5:25 PM, Péterfi Balázs <b....@i-deal.hu>
>> wrote:
>>
>>>
>>> Hello,
>>>
>>> I'm developing an application that uses jackrabbit and have some problem
>>> with searching in pdf files. When I search in a pdf that was generated
>>> from
>>> a word document it works. When I try to search in a pdf that has a
>>> scanned
>>> document inside it and I can search through its contents from within
>>> Adobe
>>> Reader (some sort of Optical Character Recognition) but my application
>>> does
>>> not obtain results. I don't know how does this kind of pdf work but I
>>> need
>>> to search in it. Does jackrabbit support it?
>>>
>>> Thank you!
>>> Balazs
>>>
>>>
>>>
>>
>>
>>
>>
>



-- 
Paco Avila
GIT Consultors
tel: +34 971 498310
fax: +34 971496189
e-mail: pavila@git.es
http://www.git.es

Re: searching in OCRed pdf

Posted by Péterfi Balázs <b....@i-deal.hu>.

I think it has already OCRed because as I wrote I can search in the pdf 
with adobe reader and it also selects the result. But what I see is a 
scanned paper and I guess there is a text layer "behind" it. Is it possible?

Paco Avila írta:
> You can make a text extractor which perform an OCR.
>
> On Mon, Jan 26, 2009 at 5:25 PM, Péterfi Balázs <b....@i-deal.hu> wrote:
>   
>> Hello,
>>
>> I'm developing an application that uses jackrabbit and have some problem
>> with searching in pdf files. When I search in a pdf that was generated from
>> a word document it works. When I try to search in a pdf that has a scanned
>> document inside it and I can search through its contents from within Adobe
>> Reader (some sort of Optical Character Recognition) but my application does
>> not obtain results. I don't know how does this kind of pdf work but I need
>> to search in it. Does jackrabbit support it?
>>
>> Thank you!
>> Balazs
>>
>>
>>     
>
>
>
>

Re: searching in OCRed pdf

Posted by Paco Avila <mo...@gmail.com>.

You can make a text extractor which perform an OCR.

On Mon, Jan 26, 2009 at 5:25 PM, Péterfi Balázs <b....@i-deal.hu> wrote:
> Hello,
>
> I'm developing an application that uses jackrabbit and have some problem
> with searching in pdf files. When I search in a pdf that was generated from
> a word document it works. When I try to search in a pdf that has a scanned
> document inside it and I can search through its contents from within Adobe
> Reader (some sort of Optical Character Recognition) but my application does
> not obtain results. I don't know how does this kind of pdf work but I need
> to search in it. Does jackrabbit support it?
>
> Thank you!
> Balazs
>
>



-- 
Paco Avila
GIT Consultors
tel: +34 971 498310
fax: +34 971496189
e-mail: pavila@git.es
http://www.git.es