You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by JDJ <Ja...@ustranscom.mil> on 2013/03/12 20:21:42 UTC

PDF keyword searches not accurate

Hello, everyone.

I'm working (basically for the first time) on a project that requires PDFs
to be indexed and searched via Solr under ColdFusion Server 9.

I've completed the project, but the client is asking a question that I don't
have the answer for.

Basically, there is one PDF that has "1386" in it (part of a form
description) that is not appearing when searching for 1386.  The 1386 is in
a relatively small PDF (2.7 Mb).  Is there a way to troubleshoot this issue? 
I'm a Solr n00b.

Thank you,




-----
JDJ
  "There are two kinds of people in the world;
      those who understand binary, and
      those who don't.
--
View this message in context: http://lucene.472066.n3.nabble.com/PDF-keyword-searches-not-accurate-tp4046741.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: PDF keyword searches not accurate

Posted by JDJ <Ja...@ustranscom.mil>.

Unfortunately, I am not in control of the development environment, so
installing a stand-alone Solr is not an option.

Well, let me correct that.. I do have my own instance of ColdFusion Server
on my local machine (sometimes I develop locally, sometimes I develop on the
network), but if I installed a stand-alone Solr, it would not match
production.  Production is using CF's built-in Solr instance, so that's what
I need to code for.  I _could_ suggest that the company install their own
Solr stand-alone, but it would take months for approval, and is unlikely to
be taken seriously.  :)



-----
JDJ
  "There are two kinds of people in the world;
      those who understand binary, and
      those who don't.
--
View this message in context: http://lucene.472066.n3.nabble.com/PDF-keyword-searches-not-accurate-tp4046741p4046963.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: PDF keyword searches not accurate

Posted by Jack Krupansky <ja...@basetechnology.com>.

You could just download Solr and run it by itself (quite easy), sending a 
PDF document to the solr /update/extract handler as per that wiki page.

See:
http://lucene.apache.org/solr/4_2_0/tutorial.html

-- Jack Krupansky

-----Original Message----- 
From: JDJ
Sent: Tuesday, March 12, 2013 5:12 PM
To: solr-user@lucene.apache.org
Subject: Re: PDF keyword searches not accurate

Hello, Jack.  (My name, as well.)

Thank you for the advice and link.  I'll have to see if there is a
ColdFusion equivalent, as I'm not using HTTP to search the collection (CF
has CFSEARCH that will work with Solr collections.)

Much appreciated,




-----
JDJ
  "There are two kinds of people in the world;
      those who understand binary, and
      those who don't.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/PDF-keyword-searches-not-accurate-tp4046741p4046782.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: PDF keyword searches not accurate

Posted by JDJ <Ja...@ustranscom.mil>.

Hello, Jack.  (My name, as well.)

Thank you for the advice and link.  I'll have to see if there is a
ColdFusion equivalent, as I'm not using HTTP to search the collection (CF
has CFSEARCH that will work with Solr collections.)

Much appreciated,




-----
JDJ
  "There are two kinds of people in the world;
      those who understand binary, and
      those who don't.
--
View this message in context: http://lucene.472066.n3.nabble.com/PDF-keyword-searches-not-accurate-tp4046741p4046782.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: PDF keyword searches not accurate

Posted by JDJ <Ja...@ustranscom.mil>.

Does this make a difference?





-----
JDJ
  "There are two kinds of people in the world;
      those who understand binary, and
      those who don't.
--
View this message in context: http://lucene.472066.n3.nabble.com/PDF-keyword-searches-not-accurate-tp4046741p4048596.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: PDF keyword searches not accurate

Posted by JDJ <Ja...@ustranscom.mil>.

I don't know if this is going to make a difference, or not, but I just
discovered that the version of Solr that ships with CF Server 9 is v1.4.0.





-----
JDJ
  "There are two kinds of people in the world;
      those who understand binary, and
      those who don't.
--
View this message in context: http://lucene.472066.n3.nabble.com/PDF-keyword-searches-not-accurate-tp4046741p4046990.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: PDF keyword searches not accurate

Posted by JDJ <Ja...@ustranscom.mil>.

Hello, Michael.

Thank you for your suggestion.  I'm unfamiliar with analysis handler.  Do
you have a link, for that? 

Much appreciated,




-----
JDJ
  "There are two kinds of people in the world;
      those who understand binary, and
      those who don't.
--
View this message in context: http://lucene.472066.n3.nabble.com/PDF-keyword-searches-not-accurate-tp4046741p4046783.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: PDF keyword searches not accurate

Posted by Michael Della Bitta <mi...@appinions.com>.

You could also use the analysis handler to see if your field
definition strips numeric input.

Michael Della Bitta

------------------------------------------------
Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Tue, Mar 12, 2013 at 4:14 PM, Jack Krupansky <ja...@basetechnology.com> wrote:
> Use the "extract only" option for Solr Cell to get the text stream that was
> extracted by Solr Cell/Tika/PDFBox, then manually search through the
> response for some text that is near the "1386", and see what text is output
> in the vicinity of the "1386".
>
> See:
> http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only
>
> Three possibilities: 1) the"text" is actually a graphic image (e.g., screen
> capture), 2) the "1386" has an embedded space and is split into two or more
> terms, or 3) the "1386" is getting concatenated with an adjacent term.
>
> -- Jack Krupansky
>
> -----Original Message----- From: JDJ
> Sent: Tuesday, March 12, 2013 3:21 PM
> To: solr-user@lucene.apache.org
> Subject: PDF keyword searches not accurate
>
>
> Hello, everyone.
>
> I'm working (basically for the first time) on a project that requires PDFs
> to be indexed and searched via Solr under ColdFusion Server 9.
>
> I've completed the project, but the client is asking a question that I don't
> have the answer for.
>
> Basically, there is one PDF that has "1386" in it (part of a form
> description) that is not appearing when searching for 1386.  The 1386 is in
> a relatively small PDF (2.7 Mb).  Is there a way to troubleshoot this issue?
> I'm a Solr n00b.
>
> Thank you,
>
>
>
>
> -----
> JDJ
>  "There are two kinds of people in the world;
>      those who understand binary, and
>      those who don't.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/PDF-keyword-searches-not-accurate-tp4046741.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: PDF keyword searches not accurate

Posted by Jack Krupansky <ja...@basetechnology.com>.

Use the "extract only" option for Solr Cell to get the text stream that was 
extracted by Solr Cell/Tika/PDFBox, then manually search through the 
response for some text that is near the "1386", and see what text is output 
in the vicinity of the "1386".

See:
http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only

Three possibilities: 1) the"text" is actually a graphic image (e.g., screen 
capture), 2) the "1386" has an embedded space and is split into two or more 
terms, or 3) the "1386" is getting concatenated with an adjacent term.

-- Jack Krupansky

-----Original Message----- 
From: JDJ
Sent: Tuesday, March 12, 2013 3:21 PM
To: solr-user@lucene.apache.org
Subject: PDF keyword searches not accurate

Hello, everyone.

I'm working (basically for the first time) on a project that requires PDFs
to be indexed and searched via Solr under ColdFusion Server 9.

I've completed the project, but the client is asking a question that I don't
have the answer for.

Basically, there is one PDF that has "1386" in it (part of a form
description) that is not appearing when searching for 1386.  The 1386 is in
a relatively small PDF (2.7 Mb).  Is there a way to troubleshoot this issue?
I'm a Solr n00b.

Thank you,




-----
JDJ
  "There are two kinds of people in the world;
      those who understand binary, and
      those who don't.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/PDF-keyword-searches-not-accurate-tp4046741.html
Sent from the Solr - User mailing list archive at Nabble.com.