You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Dirk Högemann <di...@googlemail.com> on 2012/02/10 11:21:32 UTC
Solr / Tika Integration
Hello,
we use Solr 3.5 and Tika to index a lot of PDFs. The content of those PDFs
is searchable via a full-text search.
Also the terms are used to make search suggestions.
Unfortunately pdfbox seems to insert a space character, when there are
soft-hyphens in the content of the PDF
Thus the extracted text is sometimes very fragmented. For example the word
Medizin is extracted as Me di zin.
As a consequence the suggestions are often unusable and the search does not
work as expected.
Has anyone a suggestion how to extract the content of PDF containing
sof-hyphens withpout fragmenting it?
Best
Dirk
Re: Solr / Tika Integration
Posted by Shairon Toledo <sh...@gmail.com>.
Hi,
Maybe the pdf creator tool is not generating a "fluid" text, in pdf has
sections defined by objects, e.g. for "Medizin"
20 0 obj
(Medizin)
endobj
However this can happen
20 0 obj
(Me)
endobj
21 0 obj
(di)
endobj
22 0 obj
(zin)
endobj
See that, there are 3 text objects, the extraction tool can interprete that
as 3 words.
Check you pdf file to make sure that it's well-formed.
On Fri, Feb 10, 2012 at 8:21 AM, Dirk Högemann <
dirk.hoegemann@googlemail.com> wrote:
> Hello,
>
> we use Solr 3.5 and Tika to index a lot of PDFs. The content of those PDFs
> is searchable via a full-text search.
> Also the terms are used to make search suggestions.
>
> Unfortunately pdfbox seems to insert a space character, when there are
> soft-hyphens in the content of the PDF
> Thus the extracted text is sometimes very fragmented. For example the word
> Medizin is extracted as Me di zin.
> As a consequence the suggestions are often unusable and the search does not
> work as expected.
>
> Has anyone a suggestion how to extract the content of PDF containing
> sof-hyphens withpout fragmenting it?
>
> Best
> Dirk
>
--
[ ]'s
Shairon Toledo
http://www.google.com/profiles/shairon.toledo
Re: Solr / Tika Integration
Posted by Dirk Högemann <di...@googlemail.com>.
Interesting thing is that the only Tool I found to handle my pdf correctly
was pdftotext.
2012/2/10 Robert Muir <rc...@gmail.com>
> On Fri, Feb 10, 2012 at 6:18 AM, Dirk Högemann
> <di...@googlemail.com> wrote:
> >
> > Our suggest component and parts of our search is getting hard to use by
> > this. Any other ideas?
> >
>
> Looks like https://issues.apache.org/jira/browse/PDFBOX-371
>
> The title of the issue is a bit confusing (I don't think it should go
> to hyphen either!), but I think its the reason its being mapped to a
> space.
>
> --
> lucidimagination.com
>
Re: Solr / Tika Integration
Posted by Robert Muir <rc...@gmail.com>.
On Fri, Feb 10, 2012 at 6:18 AM, Dirk Högemann
<di...@googlemail.com> wrote:
>
> Our suggest component and parts of our search is getting hard to use by
> this. Any other ideas?
>
Looks like https://issues.apache.org/jira/browse/PDFBOX-371
The title of the issue is a bit confusing (I don't think it should go
to hyphen either!), but I think its the reason its being mapped to a
space.
--
lucidimagination.com
Re: Solr / Tika Integration
Posted by Dirk Högemann <di...@googlemail.com>.
Thanks so far. I will have a closer look at the PDF.
I tried the enableautospace setting with pdfbox1.6 - did not work:
PDFParser parser = new PDFParser();
parser.setEnableAutoSpace(false);
ContentHandler handler = new BodyContentHandler();
Output:
Va ri an te Creutz feldt-
Ja kob-Krank heit
Stel lung nah men des Ar beits krei ses Blut
Our suggest component and parts of our search is getting hard to use by
this. Any other ideas?
Best
Dirk
2012/2/10 Jan Høydahl <ja...@cominvent.com>
> I think you need to control the parameter "enableAutoSpace" in PDFBox.
> There's a JIRA for it, but it depends on some Tika1.1 stuff as far I can
> understand
>
> https://issues.apache.org/jira/browse/SOLR-2930
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 10. feb. 2012, at 11:21, Dirk Högemann wrote:
>
> > Hello,
> >
> > we use Solr 3.5 and Tika to index a lot of PDFs. The content of those
> PDFs
> > is searchable via a full-text search.
> > Also the terms are used to make search suggestions.
> >
> > Unfortunately pdfbox seems to insert a space character, when there are
> > soft-hyphens in the content of the PDF
> > Thus the extracted text is sometimes very fragmented. For example the
> word
> > Medizin is extracted as Me di zin.
> > As a consequence the suggestions are often unusable and the search does
> not
> > work as expected.
> >
> > Has anyone a suggestion how to extract the content of PDF containing
> > sof-hyphens withpout fragmenting it?
> >
> > Best
> > Dirk
>
>
Re: Solr / Tika Integration
Posted by Jan Høydahl <ja...@cominvent.com>.
I think you need to control the parameter "enableAutoSpace" in PDFBox. There's a JIRA for it, but it depends on some Tika1.1 stuff as far I can understand
https://issues.apache.org/jira/browse/SOLR-2930
--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com
On 10. feb. 2012, at 11:21, Dirk Högemann wrote:
> Hello,
>
> we use Solr 3.5 and Tika to index a lot of PDFs. The content of those PDFs
> is searchable via a full-text search.
> Also the terms are used to make search suggestions.
>
> Unfortunately pdfbox seems to insert a space character, when there are
> soft-hyphens in the content of the PDF
> Thus the extracted text is sometimes very fragmented. For example the word
> Medizin is extracted as Me di zin.
> As a consequence the suggestions are often unusable and the search does not
> work as expected.
>
> Has anyone a suggestion how to extract the content of PDF containing
> sof-hyphens withpout fragmenting it?
>
> Best
> Dirk