You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Dirk Högemann <di...@googlemail.com> on 2012/02/10 11:21:32 UTC

Solr / Tika Integration

Hello,

we use Solr 3.5 and Tika to index a lot of PDFs. The content of those PDFs
is searchable via a full-text search.
Also the terms are used to make search suggestions.

Unfortunately pdfbox seems to insert a space character, when there are
soft-hyphens in the content of the PDF
Thus the extracted text is sometimes very fragmented. For example the word
Medizin is extracted as Me di zin.
As a consequence the suggestions are often unusable and the search does not
work as expected.

Has anyone a suggestion how to extract the content of PDF containing
sof-hyphens withpout fragmenting it?

Best
Dirk

Re: Solr / Tika Integration

Posted by Shairon Toledo <sh...@gmail.com>.

Hi,
Maybe the pdf creator tool is not generating a "fluid" text, in pdf has
sections defined by objects, e.g. for "Medizin"

20 0 obj
(Medizin)
endobj

However this can happen

20 0 obj
(Me)
endobj

21 0 obj
(di)
endobj

22 0 obj
(zin)
endobj

See that, there are 3 text objects, the extraction tool can interprete that
as 3 words.
Check you pdf file to make sure that it's well-formed.



On Fri, Feb 10, 2012 at 8:21 AM, Dirk Högemann <
dirk.hoegemann@googlemail.com> wrote:

> Hello,
>
> we use Solr 3.5 and Tika to index a lot of PDFs. The content of those PDFs
> is searchable via a full-text search.
> Also the terms are used to make search suggestions.
>
> Unfortunately pdfbox seems to insert a space character, when there are
> soft-hyphens in the content of the PDF
> Thus the extracted text is sometimes very fragmented. For example the word
> Medizin is extracted as Me di zin.
> As a consequence the suggestions are often unusable and the search does not
> work as expected.
>
> Has anyone a suggestion how to extract the content of PDF containing
> sof-hyphens withpout fragmenting it?
>
> Best
> Dirk
>



-- 
[ ]'s
Shairon Toledo
http://www.google.com/profiles/shairon.toledo

Re: Solr / Tika Integration

Posted by Dirk Högemann <di...@googlemail.com>.

Interesting thing is that the only Tool I found to handle my pdf correctly
was pdftotext.


2012/2/10 Robert Muir <rc...@gmail.com>

> On Fri, Feb 10, 2012 at 6:18 AM, Dirk Högemann
> <di...@googlemail.com> wrote:
> >
> > Our suggest component and parts of our search is getting hard to use by
> > this. Any other ideas?
> >
>
> Looks like https://issues.apache.org/jira/browse/PDFBOX-371
>
> The title of the issue is a bit confusing (I don't think it should go
> to hyphen either!), but I think its the reason its being mapped to a
> space.
>
> --
> lucidimagination.com
>

Re: Solr / Tika Integration

Posted by Robert Muir <rc...@gmail.com>.

On Fri, Feb 10, 2012 at 6:18 AM, Dirk Högemann
<di...@googlemail.com> wrote:
>
> Our suggest component and parts of our search is getting hard to use by
> this. Any other ideas?
>

Looks like https://issues.apache.org/jira/browse/PDFBOX-371

The title of the issue is a bit confusing (I don't think it should go
to hyphen either!), but I think its the reason its being mapped to a
space.

-- 
lucidimagination.com

Re: Solr / Tika Integration

Posted by Dirk Högemann <di...@googlemail.com>.

Thanks so far. I will have a closer look at the PDF.

I tried the enableautospace setting with pdfbox1.6 - did not work:

PDFParser parser = new PDFParser();
               parser.setEnableAutoSpace(false);
               ContentHandler handler = new BodyContentHandler();

Output:
Va ri an te Creutz feldt-
Ja kob-Krank heit
Stel lung nah men des Ar beits krei ses Blut

Our suggest component and parts of our search is getting hard to use by
this. Any other ideas?

Best
Dirk


2012/2/10 Jan Høydahl <ja...@cominvent.com>

> I think you need to control the parameter "enableAutoSpace" in PDFBox.
> There's a JIRA for it, but it depends on some Tika1.1 stuff as far I can
> understand
>
> https://issues.apache.org/jira/browse/SOLR-2930
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 10. feb. 2012, at 11:21, Dirk Högemann wrote:
>
> > Hello,
> >
> > we use Solr 3.5 and Tika to index a lot of PDFs. The content of those
> PDFs
> > is searchable via a full-text search.
> > Also the terms are used to make search suggestions.
> >
> > Unfortunately pdfbox seems to insert a space character, when there are
> > soft-hyphens in the content of the PDF
> > Thus the extracted text is sometimes very fragmented. For example the
> word
> > Medizin is extracted as Me di zin.
> > As a consequence the suggestions are often unusable and the search does
> not
> > work as expected.
> >
> > Has anyone a suggestion how to extract the content of PDF containing
> > sof-hyphens withpout fragmenting it?
> >
> > Best
> > Dirk
>
>

Re: Solr / Tika Integration

Posted by Jan Høydahl <ja...@cominvent.com>.

I think you need to control the parameter "enableAutoSpace" in PDFBox. There's a JIRA for it, but it depends on some Tika1.1 stuff as far I can understand

https://issues.apache.org/jira/browse/SOLR-2930

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 10. feb. 2012, at 11:21, Dirk Högemann wrote:

> Hello,
> 
> we use Solr 3.5 and Tika to index a lot of PDFs. The content of those PDFs
> is searchable via a full-text search.
> Also the terms are used to make search suggestions.
> 
> Unfortunately pdfbox seems to insert a space character, when there are
> soft-hyphens in the content of the PDF
> Thus the extracted text is sometimes very fragmented. For example the word
> Medizin is extracted as Me di zin.
> As a consequence the suggestions are often unusable and the search does not
> work as expected.
> 
> Has anyone a suggestion how to extract the content of PDF containing
> sof-hyphens withpout fragmenting it?
> 
> Best
> Dirk