You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Ga...@sungard.com on 2015/01/06 18:30:36 UTC

PDF search functionality using Solr

Hello Solr-users and developers,
Can you please suggest,

1. What I should do to index PDF content information column wise?

2. Do I need to extract the contents using one of the Analyzer, Tokenize and Filter combination and then add it to Index? How can test the results on command prompt? I do not know the selection of specific Analyzer, Tokenizer and Filter for this purpose

3. How can I verify that the needed column info is extracted out of PDF and is indexed?

4. So for example How to verify Ticket number is extracted in Ticket_number tag and is indexed?

5. Is it ok to post 4 GB worth of PDF to be imported and indexed by Solr? I think I saw some posts complaining on how large size that can be posted ?

6. What will enable Solr to search in any PDF out of many, with different words such as "Runtime" "Error" "XXXX" and result will provide the link to the PDF

My PDFs are nothing but Jira ticket system.
PDF has info on
Ticket Number:
Desc:
Client:
Status:
Submitter:
And so on:

1. I imported PDF document in Solr and it does the necessary searching and I can test some of it using the browse client interface provided.

2. I have 80 GB worth of PDFs.

3. Total number of PDFs are about 200

4. Many PDFs are of size 4 GB

5. What do you suggest me to import such a large PDFs? What tools can you suggest to extract PDF contents first in some XML format and later Post that XML to be indexed by Solr.?

Your early response is much appreciated.

Thanks

Re: PDF search functionality using Solr

Posted by Erick Erickson <er...@gmail.com>.

Seconding Jürgen's comment. 4G docs are almost, but not quite totally
useless to search How many JIRA's each? That's _one_ document unless
you do some fancy dancing. Pulling the data directly using the JIRA
API sounds far superior.

If you _must_ use the JIRA->PDF->Solr option, consider the following:
Use Tika on the client to parse the doc and taking control of the
mapping of the meta-data
and, probably, breaking thins up into individual document, one Solr
document per JIRA.

That'll give you a chance to deal with charset issues and the like.
Here's an example:

https://lucidworks.com/blog/indexing-with-solrj/

That one has both Tika and database connectivity but should be pretty
straight-forward to adapt, just pull the database junk out.

Best,
Erick

On Tue, Jan 6, 2015 at 9:55 AM, "Jürgen Wagner (DVT)"
<ju...@devoteam.com> wrote:
> Hello,
>   no matter which search platform you will use, this will pose two
> challenges:
>
> - The size of the documents will render search less and less useful as the
> likelihood of matches increases with document size. So, without a proper
> semantic extraction (e.g., using decent NER or relationship extraction with
> a commercial text mining product), I doubt you will get the required
> precision to make this overly usefiul.
>
> - PDFs can have their own character sets based on the characters actually
> used. Such file-specific character sets are almost impossible to parse,
> i.e., if your PDFs happen to use this "feature" of the PDF format, you won't
> be lucky getting any meaningful text out of them.
>
> My suggestion is to use the Jira REST API to collect all necessary documents
> and index the resulting XML or attachment formats. As the REST API provides
> filtering capabilities, you could easily create incremental feeds to avoid
> humongous indexing every time there's new information in Jira. Dumping Jira
> stuff as PDF seems to me to be the least suitable way of handling this.
>
> Best regards,
> --Jürgen
>
>
>
> On 06.01.2015 18:30, Ganesh.Yadav@sungard.com wrote:
>
> Hello Solr-users and developers,
> Can you please suggest,
>
> 1.       What I should do to index PDF content information column wise?
>
> 2.       Do I need to extract the contents using one of the Analyzer,
> Tokenize and Filter combination and then add it to Index? How can test the
> results on command prompt? I do not know the selection of specific Analyzer,
> Tokenizer and Filter for this purpose
>
> 3.       How can I verify that the needed column info is extracted out of
> PDF and is indexed?
>
> 4.       So for example How to verify Ticket number is extracted in
> Ticket_number tag and is indexed?
>
> 5.       Is it ok to post 4 GB worth of PDF to be imported and indexed by
> Solr? I think I saw some posts complaining on how large size that can be
> posted ?
>
> 6.       What will enable Solr to search in any PDF out of many, with
> different words such as "Runtime" "Error" "XXXX" and result will provide the
> link to the PDF
>
> My PDFs are nothing but Jira ticket system.
> PDF has info on
> Ticket Number:
> Desc:
> Client:
> Status:
> Submitter:
> And so on:
>
>
> 1.       I imported PDF document in Solr and it does the necessary searching
> and I can test some of it using the browse client interface provided.
>
> 2.       I have 80 GB worth of PDFs.
>
> 3.       Total number of PDFs are about 200
>
> 4.       Many PDFs are of size 4 GB
>
> 5.       What do you suggest me to import such a large PDFs? What tools can
> you suggest to extract PDF contents first in some XML format and later Post
> that XML to be indexed by Solr.?
>
>
>
>
>
>
>
> Your early response is much appreciated.
>
>
>
> Thanks
>
> G
>
>
>
>
> --
>
> Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
> уважением
> i.A. Jürgen Wagner
> Head of Competence Center "Intelligence"
> & Senior Cloud Consultant
>
> Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
> Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
> E-Mail: juergen.wagner@devoteam.com, URL: www.devoteam.de
>
> ________________________________
> Managing Board: Jürgen Hatzipantelis (CEO)
> Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
> Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
>
>

Re: PDF search functionality using Solr

Posted by "Jürgen Wagner (DVT)" <ju...@devoteam.com>.

Hello,
  no matter which search platform you will use, this will pose two
challenges:

- The size of the documents will render search less and less useful as
the likelihood of matches increases with document size. So, without a
proper semantic extraction (e.g., using decent NER or relationship
extraction with a commercial text mining product), I doubt you will get
the required precision to make this overly usefiul.

- PDFs can have their own character sets based on the characters
actually used. Such file-specific character sets are almost impossible
to parse, i.e., if your PDFs happen to use this "feature" of the PDF
format, you won't be lucky getting any meaningful text out of them.

My suggestion is to use the Jira REST API to collect all necessary
documents and index the resulting XML or attachment formats. As the REST
API provides filtering capabilities, you could easily create incremental
feeds to avoid humongous indexing every time there's new information in
Jira. Dumping Jira stuff as PDF seems to me to be the least suitable way
of handling this.

Best regards,
--Jürgen


On 06.01.2015 18:30, Ganesh.Yadav@sungard.com wrote:
> Hello Solr-users and developers,
> Can you please suggest,
>
> 1.       What I should do to index PDF content information column wise?
>
> 2.       Do I need to extract the contents using one of the Analyzer, Tokenize and Filter combination and then add it to Index? How can test the results on command prompt? I do not know the selection of specific Analyzer, Tokenizer and Filter for this purpose
>
> 3.       How can I verify that the needed column info is extracted out of PDF and is indexed?
>
> 4.       So for example How to verify Ticket number is extracted in Ticket_number tag and is indexed?
>
> 5.       Is it ok to post 4 GB worth of PDF to be imported and indexed by Solr? I think I saw some posts complaining on how large size that can be posted ?
>
> 6.       What will enable Solr to search in any PDF out of many, with different words such as "Runtime" "Error" "XXXX" and result will provide the link to the PDF
>
> My PDFs are nothing but Jira ticket system.
> PDF has info on
> Ticket Number:
> Desc:
> Client:
> Status:
> Submitter:
> And so on:
>
>
> 1.       I imported PDF document in Solr and it does the necessary searching and I can test some of it using the browse client interface provided.
>
> 2.       I have 80 GB worth of PDFs.
>
> 3.       Total number of PDFs are about 200
>
> 4.       Many PDFs are of size 4 GB
>
> 5.       What do you suggest me to import such a large PDFs? What tools can you suggest to extract PDF contents first in some XML format and later Post that XML to be indexed by Solr.?
>
>
>
>
>
>
>
> Your early response is much appreciated.
>
>
>
> Thanks
>
> G
>
>


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wagner@devoteam.com
<ma...@devoteam.com>, URL: www.devoteam.de
<http://www.devoteam.de/>

------------------------------------------------------------------------
Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071