You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Soumitra Banerjee <so...@gmail.com> on 2011/12/07 21:50:55 UTC

Too long to index PDF - SOLR 3.5, SOLRNet 0.4.0.2001, Tom Cat 7.0

All -

I am using SOLR 3.5, SOLRNet 0.4.0.2001, Tom Cat 7.0 and am running a job
to extract the text from pds, stored on my local hard disk.

*Tomcat StdErr log Shows:*

INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
literal.id=*&resource.name=C:\XXX\10310.pdf&extractFormat=text&version=2.2}
status=0 QTime=125
Dec 7, 2011 12:29:36 PM org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: {} 0 141
Dec 7, 2011 12:29:36 PM org.apache.solr.core.SolrCore execute
INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
literal.id=*&resource.name=C:XXX\10311.pdf&extractFormat=text&version=2.2}
status=0 QTime=141
Dec 7, 2011 12:29:36 PM org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: {} 0 125
Dec 7, 2011 12:29:36 PM org.apache.solr.core.SolrCore execute
INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
literal.id=*&resource.name=C:\XXX\3M_US_EN_10313.pdf&extractFormat=text&version=2.2}
status=0 QTime=125

*Catalina Log Shows:*
**
INFO: {} 0 281
Dec 7, 2011 12:29:04 PM org.apache.solr.core.SolrCore execute
INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
literal.id=*&resource.name=C:\XXX\11511.pdf&extractFormat=text&version=2.2}
status=0 QTime=281
Dec 7, 2011 12:29:05 PM org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: {} 0 391
Dec 7, 2011 12:29:05 PM org.apache.solr.core.SolrCore execute
INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
literal.id=*&resource.name=C:XXX\_11513.pdf&extractFormat=text&version=2.2}
status=0 QTime=391
Dec 7, 2011 12:29:05 PM org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: {} 0 328
Dec 7, 2011 12:29:05 PM org.apache.solr.core.SolrCore execute
INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
literal.id=*&resource.name=C:\XXX\11514.pdf&extractFormat=text&version=2.2}
status=0 QTime=328

The average pdf file size is around 50 KB. My questions are as follows:

1. Can I improve performance by updating any configutaion file for -
SolrConfig, Tomcat, others?
2. Since I am using :

var response = solr.Extract(new ExtractParameters(pdffile, "*")


from SOLRNet 0.4.0.2001, which just came out (Beta), is this a known issue
to be fixed in upcomming versions?


Any help/pointers from the experts will be highly appreciated. Also let me
know if you would need additional information and  will be more than happy
to provide that.

Regards, Soumitra

Re: Too long to index PDF - SOLR 3.5, SOLRNet 0.4.0.2001, Tom Cat 7.0

Posted by Soumitra Banerjee <so...@gmail.com>.

Thanks for the response. I will set the stream accrodingly. As for
extraction of the text from pdf, I want the entire content of the pdf. This
content will be part of a SOLR document, which has an uniqueid.

The unique is for what? Here's my schema:

  <fields>
    <field name="InternalCheckID" type="string" indexed="true"
stored="true" required="true" />
    <field name="ProductName" type="text" indexed="true" stored="false"
required="false" />
    <field name="ProductID" type="text" indexed="true" stored="false"
required="false" />
    <field name="Manufacturer" type="text" indexed="true" stored="false"
required="false" />
    <field name="RevisionDate" type="date" indexed="true" stored="false"
required="false"/>
    <field name="FilePath" type="text" indexed="true" stored="false" />
    <field name="Content" type="text" indexed="true" stored="false"
multiValued="true"/>
  </fields>
  <uniqueKey>InternalCheckID</uniqueKey>
  <defaultSearchField>Content</defaultSearchField>

Thanks for your help as always.

Regards, Soumitra


On Wed, Dec 7, 2011 at 3:06 PM, Mauricio Scheffer <
mauricioscheffer@gmail.com> wrote:

> Try setting the StreamType to application/pdf, that way Tika doesn't have
> to infer it.
> BTW the second argument to ExtractParameters is the unique key... a value
> of "*" probably doesn't make sense.
>
> --
> Mauricio
>
>
> On Wed, Dec 7, 2011 at 5:50 PM, Soumitra Banerjee <
> soumitrabanerjee@gmail.com> wrote:
>
> > All -
> >
> > I am using SOLR 3.5, SOLRNet 0.4.0.2001, Tom Cat 7.0 and am running a job
> > to extract the text from pds, stored on my local hard disk.
> >
> > *Tomcat StdErr log Shows:*
> >
> > INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> > literal.id=*&resource.name
> > =C:\XXX\10310.pdf&extractFormat=text&version=2.2}
> > status=0 QTime=125
> > Dec 7, 2011 12:29:36 PM
> org.apache.solr.update.processor.LogUpdateProcessor
> > finish
> > INFO: {} 0 141
> > Dec 7, 2011 12:29:36 PM org.apache.solr.core.SolrCore execute
> > INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> > literal.id=*&resource.name
> =C:XXX\10311.pdf&extractFormat=text&version=2.2}
> > status=0 QTime=141
> > Dec 7, 2011 12:29:36 PM
> org.apache.solr.update.processor.LogUpdateProcessor
> > finish
> > INFO: {} 0 125
> > Dec 7, 2011 12:29:36 PM org.apache.solr.core.SolrCore execute
> > INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> > literal.id=*&resource.name
> > =C:\XXX\3M_US_EN_10313.pdf&extractFormat=text&version=2.2}
> > status=0 QTime=125
> >
> > *Catalina Log Shows:*
> > **
> > INFO: {} 0 281
> > Dec 7, 2011 12:29:04 PM org.apache.solr.core.SolrCore execute
> > INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> > literal.id=*&resource.name
> > =C:\XXX\11511.pdf&extractFormat=text&version=2.2}
> > status=0 QTime=281
> > Dec 7, 2011 12:29:05 PM
> org.apache.solr.update.processor.LogUpdateProcessor
> > finish
> > INFO: {} 0 391
> > Dec 7, 2011 12:29:05 PM org.apache.solr.core.SolrCore execute
> > INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> > literal.id=*&resource.name
> > =C:XXX\_11513.pdf&extractFormat=text&version=2.2}
> > status=0 QTime=391
> > Dec 7, 2011 12:29:05 PM
> org.apache.solr.update.processor.LogUpdateProcessor
> > finish
> > INFO: {} 0 328
> > Dec 7, 2011 12:29:05 PM org.apache.solr.core.SolrCore execute
> > INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> > literal.id=*&resource.name
> > =C:\XXX\11514.pdf&extractFormat=text&version=2.2}
> > status=0 QTime=328
> >
> > The average pdf file size is around 50 KB. My questions are as follows:
> >
> > 1. Can I improve performance by updating any configutaion file for -
> > SolrConfig, Tomcat, others?
> > 2. Since I am using :
> >
> > var response = solr.Extract(new ExtractParameters(pdffile, "*")
> >
> >
> > from SOLRNet 0.4.0.2001, which just came out (Beta), is this a known
> issue
> > to be fixed in upcomming versions?
> >
> >
> > Any help/pointers from the experts will be highly appreciated. Also let
> me
> > know if you would need additional information and  will be more than
> happy
> > to provide that.
> >
> > Regards, Soumitra
> >
>

Re: Too long to index PDF - SOLR 3.5, SOLRNet 0.4.0.2001, Tom Cat 7.0

Posted by Mauricio Scheffer <ma...@gmail.com>.

Try setting the StreamType to application/pdf, that way Tika doesn't have
to infer it.
BTW the second argument to ExtractParameters is the unique key... a value
of "*" probably doesn't make sense.

--
Mauricio


On Wed, Dec 7, 2011 at 5:50 PM, Soumitra Banerjee <
soumitrabanerjee@gmail.com> wrote:

> All -
>
> I am using SOLR 3.5, SOLRNet 0.4.0.2001, Tom Cat 7.0 and am running a job
> to extract the text from pds, stored on my local hard disk.
>
> *Tomcat StdErr log Shows:*
>
> INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> literal.id=*&resource.name
> =C:\XXX\10310.pdf&extractFormat=text&version=2.2}
> status=0 QTime=125
> Dec 7, 2011 12:29:36 PM org.apache.solr.update.processor.LogUpdateProcessor
> finish
> INFO: {} 0 141
> Dec 7, 2011 12:29:36 PM org.apache.solr.core.SolrCore execute
> INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> literal.id=*&resource.name=C:XXX\10311.pdf&extractFormat=text&version=2.2}
> status=0 QTime=141
> Dec 7, 2011 12:29:36 PM org.apache.solr.update.processor.LogUpdateProcessor
> finish
> INFO: {} 0 125
> Dec 7, 2011 12:29:36 PM org.apache.solr.core.SolrCore execute
> INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> literal.id=*&resource.name
> =C:\XXX\3M_US_EN_10313.pdf&extractFormat=text&version=2.2}
> status=0 QTime=125
>
> *Catalina Log Shows:*
> **
> INFO: {} 0 281
> Dec 7, 2011 12:29:04 PM org.apache.solr.core.SolrCore execute
> INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> literal.id=*&resource.name
> =C:\XXX\11511.pdf&extractFormat=text&version=2.2}
> status=0 QTime=281
> Dec 7, 2011 12:29:05 PM org.apache.solr.update.processor.LogUpdateProcessor
> finish
> INFO: {} 0 391
> Dec 7, 2011 12:29:05 PM org.apache.solr.core.SolrCore execute
> INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> literal.id=*&resource.name
> =C:XXX\_11513.pdf&extractFormat=text&version=2.2}
> status=0 QTime=391
> Dec 7, 2011 12:29:05 PM org.apache.solr.update.processor.LogUpdateProcessor
> finish
> INFO: {} 0 328
> Dec 7, 2011 12:29:05 PM org.apache.solr.core.SolrCore execute
> INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> literal.id=*&resource.name
> =C:\XXX\11514.pdf&extractFormat=text&version=2.2}
> status=0 QTime=328
>
> The average pdf file size is around 50 KB. My questions are as follows:
>
> 1. Can I improve performance by updating any configutaion file for -
> SolrConfig, Tomcat, others?
> 2. Since I am using :
>
> var response = solr.Extract(new ExtractParameters(pdffile, "*")
>
>
> from SOLRNet 0.4.0.2001, which just came out (Beta), is this a known issue
> to be fixed in upcomming versions?
>
>
> Any help/pointers from the experts will be highly appreciated. Also let me
> know if you would need additional information and  will be more than happy
> to provide that.
>
> Regards, Soumitra
>