You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Naveen Gupta <nk...@gmail.com> on 2011/06/02 09:21:26 UTC

tika and solr 3,1 integration

Hi

I am trying to integrate solr 3.1 and tika (which comes default with the
version)

and using curl command trying to index few of the documents, i am getting
this error. the error is attr_meta field is unknown. i checked the
solrconfig, it looks perfect to me.

can you please tell me what i am missing.

I copied all the jars from contrib/extraction/lib to solr/lib folder that is
there in same place where conf is there ....


I am using the same request handler which is coming with default

<requestHandler name="/update/extract"
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <!-- All the main content goes into "text"... if you need to return
           the extracted text or do highlighting, use a stored field. -->
      <str name="fmap.content">text</str>
      <str name="lowernames">true</str>
      <str name="uprefix">ignored_</str>

      <!-- capture link hrefs but ignore div attributes -->
      <str name="captureAttr">true</str>
      <str name="fmap.a">links</str>
      <str name="fmap.div">ignored_</str>
    </lst>
  </requestHandler>





* curl "
http://dev.grexit.com:8080/solr1/update/extract?literal.id=who.pdf&uprefix=attr_&attr_&fmap.content=attr_content&commit=true"
-F "myfile=@/root/apache-solr-3.1.0/docs/who.pdf"*


<html><head><title>Apache Tomcat/6.0.18 - Error report</title><style><!--H1
{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
H2
{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;}
H3
{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
BODY
{font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B
{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;}
P
{font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
{color : black;}A.name {color : black;}HR {color : #525D76;}--></style>
</head><body><h1>HTTP Status 400 - ERROR:unknown field 'attr_meta'</h1><HR
size="1" noshade="noshade"><p><b>type</b> Status report</p><p><b>message</b>
<u>ERROR:unknown field 'attr_meta'</u></p><p><b>description</b> <u>The
request sent by the client was syntactically incorrect (ERROR:unknown field
'attr_meta').</u></p><HR size="1" noshade="noshade"><h3>Apache
Tomcat/6.0.18</h3></body></html>root@weforpeople:/usr/share/solr1/lib#


Please note

i integrated apacha tika 0.9 with apache-solr-1.4 locally on windows machine
and using solr cell

calling the program works fine without any changes in configuration.

Thanks
Naveen

Re: Indexes in ramdisk don't show performance improvement?

Posted by Trey Grainger <so...@gmail.com>.

Linux will cache the open index files in RAM (in the filesystem cache)
after their first read which makes the ram disk generally useless.
Unless you're processing other files on the box with a size greater
than your total unused ram (and thus need to micro-manage what stays
in RAM), then I wouldn't recommend using a ramdisk - it's just more to
manage.  If you reboot the box and run a few searches, those first few
will likely be slower until all the index files are cached in Memory.
After that point, the performance should be comparable because all
files are read out of RAM from that point forward.

If solr caches are enabled and your queries are repetitive then that
could also be contributing to the speed of repetitive queries.  Note
that the above advice assumes your total unused ram (not allocated to
the JVM or any other processes) is greater than the size of your
lucene index files, which should be a safe assumption considering
you're trying to put the whole index in a ramdisk.

-Trey

On Thu, Jun 2, 2011 at 7:15 PM, Erick Erickson <er...@gmail.com> wrote:
> What I expect is happening is that the Solr caches are effectively making the
> two tests identical, using memory to hold the vital parts of the code in both
> cases (after disk warming on the instance using the local disk). I suspect if
> you measured the first few queries (assuming no auto-warming) you'd see the
> local disk version be slower.
>
> Were you running these tests for curiosity or is running from /dev/shm something
> you're considering for production?
>
> Best
> Erick
>
> On Thu, Jun 2, 2011 at 5:47 PM, Parker Johnson <Pa...@gap.com> wrote:
>>
>> Hey everyone.
>>
>> Been doing some load testing over the past few days. I've been throwing a
>> good bit of load at an instance of solr and have been measuring response
>> time.  We're running a variety of different keyword searches to keep
>> solr's cache on its toes.
>>
>> I'm running two exact same load testing scenarios: one with indexes
>> residing in /dev/shm and another from local disk.  The indexes are about
>> 4.5GB in size.
>>
>> On both tests the response times are the same.  I wasn't expecting that.
>> I do see the java heap size grow when indexes are served from disk (which
>> is expected).  When the indexes are served out of /dev/shm, the java heap
>> stays small.
>>
>> So in general is this consistent behavior?  I don't really see the
>> advantage of serving indexes from /dev/shm.  When the indexes are being
>> served out of ramdisk, is the linux kernel or the memory mapper doing
>> something tricky behind the scenes to use ramdisk in lieu of the java heap?
>>
>> For what it is worth, we are running x_64 rh5.4 on a 12 core 2.27Ghz Xeon
>> system with 48GB ram.
>>
>> Thoughts?
>>
>> -Park
>>
>>
>>
>

Re: Indexes in ramdisk don't show performance improvement?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Park,

I think there is no way initial queries will be the same IF:
* your index in ramfs is really in RAM
* your index in regular FS is not already in RAM due to being previously cached 
(you *did* flush OS cache before the test, right?)

Having said that, if you update your index infrequently and make use of warm up 
queries and cache warming, you are likely to be very fine with the index on 
disk.
For example, we have a customer right now that we helped a bit with 
performance.  They also have lots of RAM, 10M docs in the index, and replicate 
the whole optimized index nightly.  They have 2 servers, each handling about 
1000 requests per minute and their average response time is under 20 ms with 
pre-1.4.1 Solr and lots of facets and fqs (they use Solr not only for search, 
but also navigation).  No ramfs involved, but they have zero disk reads because 
the whole index is cached in memory, so things are fast.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Parker Johnson <Pa...@gap.com>
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Sent: Thu, June 2, 2011 9:20:55 PM
> Subject: Re: Indexes in ramdisk don't show performance improvement?
> 
> 
> That¹s just the thing.  Even the initial queries have similar  response
> times as the later ones.  WEIRD!
> 
> I was considering  running from /dev/shm in production, but for slaves only
> (master remains on  disk).  At this point though, I'm not seeing a benefit
> to ramdisk so I  think I'm going back to traditional disk so the indexes
> stay intact after a  power cycle.
> 
> Has anyone else seen that indexes served from disk perform  similarly as
> indexes served from ramdisk?
> 
> -Park
> 
> On 6/2/11 4:15  PM, "Erick Erickson" <er...@gmail.com>  wrote:
> 
> >What I expect is happening is that the Solr caches are  effectively making
> >the
> >two tests identical, using memory to hold  the vital parts of the code in
> >both
> >cases (after disk warming on  the instance using the local disk). I
> >suspect if
> >you measured the  first few queries (assuming no auto-warming) you'd see
> >the
> >local  disk version be slower.
> >
> >Were you running these tests for  curiosity or is running from /dev/shm
> >something
> >you're considering  for production?
> >
> >Best
> >Erick
> >
> >On Thu, Jun 2,  2011 at 5:47 PM, Parker Johnson <Pa...@gap.com>
> >wrote:
> >>
> >>  Hey everyone.
> >>
> >> Been doing some load testing over the  past few days. I've been throwing
> >>a
> >> good bit of load at  an instance of solr and have been measuring response
> >> time.   We're running a variety of different keyword searches to keep
> >> solr's  cache on its toes.
> >>
> >> I'm running two exact same load  testing scenarios: one with indexes
> >> residing in /dev/shm and another  from local disk.  The indexes are about
> >> 4.5GB in  size.
> >>
> >> On both tests the response times are the  same.  I wasn't expecting that.
> >> I do see the java heap size  grow when indexes are served from disk
> >>(which
> >> is  expected).  When the indexes are served out of /dev/shm, the  java
> >>heap
> >> stays small.
> >>
> >> So in  general is this consistent behavior?  I don't really see the
> >>  advantage of serving indexes from /dev/shm.  When the indexes are  being
> >> served out of ramdisk, is the linux kernel or the memory  mapper doing
> >> something tricky behind the scenes to use ramdisk in  lieu of the java
> >>heap?
> >>
> >> For what it is worth,  we are running x_64 rh5.4 on a 12 core 2.27Ghz
> >>Xeon
> >>  system with 48GB ram.
> >>
> >> Thoughts?
> >>
> >>  -Park
> >>
> >>
> >>
> >
> 
> 
>

Re: Indexes in ramdisk don't show performance improvement?

Posted by Parker Johnson <Pa...@gap.com>.

That¹s just the thing.  Even the initial queries have similar response
times as the later ones.  WEIRD!

I was considering running from /dev/shm in production, but for slaves only
(master remains on disk).  At this point though, I'm not seeing a benefit
to ramdisk so I think I'm going back to traditional disk so the indexes
stay intact after a power cycle.

Has anyone else seen that indexes served from disk perform similarly as
indexes served from ramdisk?

-Park

On 6/2/11 4:15 PM, "Erick Erickson" <er...@gmail.com> wrote:

>What I expect is happening is that the Solr caches are effectively making
>the
>two tests identical, using memory to hold the vital parts of the code in
>both
>cases (after disk warming on the instance using the local disk). I
>suspect if
>you measured the first few queries (assuming no auto-warming) you'd see
>the
>local disk version be slower.
>
>Were you running these tests for curiosity or is running from /dev/shm
>something
>you're considering for production?
>
>Best
>Erick
>
>On Thu, Jun 2, 2011 at 5:47 PM, Parker Johnson <Pa...@gap.com>
>wrote:
>>
>> Hey everyone.
>>
>> Been doing some load testing over the past few days. I've been throwing
>>a
>> good bit of load at an instance of solr and have been measuring response
>> time.  We're running a variety of different keyword searches to keep
>> solr's cache on its toes.
>>
>> I'm running two exact same load testing scenarios: one with indexes
>> residing in /dev/shm and another from local disk.  The indexes are about
>> 4.5GB in size.
>>
>> On both tests the response times are the same.  I wasn't expecting that.
>> I do see the java heap size grow when indexes are served from disk
>>(which
>> is expected).  When the indexes are served out of /dev/shm, the java
>>heap
>> stays small.
>>
>> So in general is this consistent behavior?  I don't really see the
>> advantage of serving indexes from /dev/shm.  When the indexes are being
>> served out of ramdisk, is the linux kernel or the memory mapper doing
>> something tricky behind the scenes to use ramdisk in lieu of the java
>>heap?
>>
>> For what it is worth, we are running x_64 rh5.4 on a 12 core 2.27Ghz
>>Xeon
>> system with 48GB ram.
>>
>> Thoughts?
>>
>> -Park
>>
>>
>>
>

Re: Indexes in ramdisk don't show performance improvement?

Posted by Erick Erickson <er...@gmail.com>.

What I expect is happening is that the Solr caches are effectively making the
two tests identical, using memory to hold the vital parts of the code in both
cases (after disk warming on the instance using the local disk). I suspect if
you measured the first few queries (assuming no auto-warming) you'd see the
local disk version be slower.

Were you running these tests for curiosity or is running from /dev/shm something
you're considering for production?

Best
Erick

On Thu, Jun 2, 2011 at 5:47 PM, Parker Johnson <Pa...@gap.com> wrote:
>
> Hey everyone.
>
> Been doing some load testing over the past few days. I've been throwing a
> good bit of load at an instance of solr and have been measuring response
> time.  We're running a variety of different keyword searches to keep
> solr's cache on its toes.
>
> I'm running two exact same load testing scenarios: one with indexes
> residing in /dev/shm and another from local disk.  The indexes are about
> 4.5GB in size.
>
> On both tests the response times are the same.  I wasn't expecting that.
> I do see the java heap size grow when indexes are served from disk (which
> is expected).  When the indexes are served out of /dev/shm, the java heap
> stays small.
>
> So in general is this consistent behavior?  I don't really see the
> advantage of serving indexes from /dev/shm.  When the indexes are being
> served out of ramdisk, is the linux kernel or the memory mapper doing
> something tricky behind the scenes to use ramdisk in lieu of the java heap?
>
> For what it is worth, we are running x_64 rh5.4 on a 12 core 2.27Ghz Xeon
> system with 48GB ram.
>
> Thoughts?
>
> -Park
>
>
>

Re: tika and solr 3,1 integration

Posted by Naveen Gupta <nk...@gmail.com>.

Hi

This is fixed .. yes, schema.xml was the culprit and i fixed it looking at
the sample schema provided in the sample.

But in windows, i am getting slf4j (illegalacess exception) which looks like
jar problem. looking at the fixes, suggested in their FAQs, they are
suggesting to use 1.5.5 version, which is already there in lib folder ..

i have been finding a lot of jars to be deployed .. i am afraid if that is
causing the problem ..

Has somebody experienced the same ?

Thanks
Naveen


On Fri, Jun 3, 2011 at 2:41 AM, Juan Grande <ju...@gmail.com> wrote:

> Hi Naveen,
>
> Check if there is a dynamic field named "attr_*" in the schema. The
> "uprefix=attr_" parameter means that if Solr can't find an extracted field
> in the schema, it'll add the prefix "attr_" and try again.
>
> *Juan*
>
>
>
> On Thu, Jun 2, 2011 at 4:21 AM, Naveen Gupta <nk...@gmail.com> wrote:
>
> > Hi
> >
> > I am trying to integrate solr 3.1 and tika (which comes default with the
> > version)
> >
> > and using curl command trying to index few of the documents, i am getting
> > this error. the error is attr_meta field is unknown. i checked the
> > solrconfig, it looks perfect to me.
> >
> > can you please tell me what i am missing.
> >
> > I copied all the jars from contrib/extraction/lib to solr/lib folder that
> > is
> > there in same place where conf is there ....
> >
> >
> > I am using the same request handler which is coming with default
> >
> > <requestHandler name="/update/extract"
> >                  startup="lazy"
> >                  class="solr.extraction.ExtractingRequestHandler" >
> >    <lst name="defaults">
> >      <!-- All the main content goes into "text"... if you need to return
> >           the extracted text or do highlighting, use a stored field. -->
> >      <str name="fmap.content">text</str>
> >      <str name="lowernames">true</str>
> >      <str name="uprefix">ignored_</str>
> >
> >      <!-- capture link hrefs but ignore div attributes -->
> >      <str name="captureAttr">true</str>
> >      <str name="fmap.a">links</str>
> >      <str name="fmap.div">ignored_</str>
> >    </lst>
> >  </requestHandler>
> >
> >
> >
> >
> >
> > * curl "
> >
> >
> http://dev.grexit.com:8080/solr1/update/extract?literal.id=who.pdf&uprefix=attr_&attr_&fmap.content=attr_content&commit=true
> > "
> > -F "myfile=@/root/apache-solr-3.1.0/docs/who.pdf"*
> >
> >
> > <html><head><title>Apache Tomcat/6.0.18 - Error
> report</title><style><!--H1
> >
> >
> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
> > H2
> >
> >
> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;}
> > H3
> >
> >
> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
> > BODY
> > {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;}
> B
> >
> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;}
> > P
> >
> >
> {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
> > {color : black;}A.name {color : black;}HR {color : #525D76;}--></style>
> > </head><body><h1>HTTP Status 400 - ERROR:unknown field
> 'attr_meta'</h1><HR
> > size="1" noshade="noshade"><p><b>type</b> Status
> > report</p><p><b>message</b>
> > <u>ERROR:unknown field 'attr_meta'</u></p><p><b>description</b> <u>The
> > request sent by the client was syntactically incorrect (ERROR:unknown
> field
> > 'attr_meta').</u></p><HR size="1" noshade="noshade"><h3>Apache
> > Tomcat/6.0.18</h3></body></html>root@weforpeople:/usr/share/solr1/lib#
> >
> >
> > Please note
> >
> > i integrated apacha tika 0.9 with apache-solr-1.4 locally on windows
> > machine
> > and using solr cell
> >
> > calling the program works fine without any changes in configuration.
> >
> > Thanks
> > Naveen
> >
>

Indexes in ramdisk don't show performance improvement?

Posted by Parker Johnson <Pa...@gap.com>.

Hey everyone.

Been doing some load testing over the past few days. I've been throwing a
good bit of load at an instance of solr and have been measuring response
time.  We're running a variety of different keyword searches to keep
solr's cache on its toes.

I'm running two exact same load testing scenarios: one with indexes
residing in /dev/shm and another from local disk.  The indexes are about
4.5GB in size.

On both tests the response times are the same.  I wasn't expecting that.
I do see the java heap size grow when indexes are served from disk (which
is expected).  When the indexes are served out of /dev/shm, the java heap
stays small.

So in general is this consistent behavior?  I don't really see the
advantage of serving indexes from /dev/shm.  When the indexes are being
served out of ramdisk, is the linux kernel or the memory mapper doing
something tricky behind the scenes to use ramdisk in lieu of the java heap?

For what it is worth, we are running x_64 rh5.4 on a 12 core 2.27Ghz Xeon
system with 48GB ram.

Thoughts?

-Park

Re: tika and solr 3,1 integration

Posted by Juan Grande <ju...@gmail.com>.

Hi Naveen,

Check if there is a dynamic field named "attr_*" in the schema. The
"uprefix=attr_" parameter means that if Solr can't find an extracted field
in the schema, it'll add the prefix "attr_" and try again.

*Juan*



On Thu, Jun 2, 2011 at 4:21 AM, Naveen Gupta <nk...@gmail.com> wrote:

> Hi
>
> I am trying to integrate solr 3.1 and tika (which comes default with the
> version)
>
> and using curl command trying to index few of the documents, i am getting
> this error. the error is attr_meta field is unknown. i checked the
> solrconfig, it looks perfect to me.
>
> can you please tell me what i am missing.
>
> I copied all the jars from contrib/extraction/lib to solr/lib folder that
> is
> there in same place where conf is there ....
>
>
> I am using the same request handler which is coming with default
>
> <requestHandler name="/update/extract"
>                  startup="lazy"
>                  class="solr.extraction.ExtractingRequestHandler" >
>    <lst name="defaults">
>      <!-- All the main content goes into "text"... if you need to return
>           the extracted text or do highlighting, use a stored field. -->
>      <str name="fmap.content">text</str>
>      <str name="lowernames">true</str>
>      <str name="uprefix">ignored_</str>
>
>      <!-- capture link hrefs but ignore div attributes -->
>      <str name="captureAttr">true</str>
>      <str name="fmap.a">links</str>
>      <str name="fmap.div">ignored_</str>
>    </lst>
>  </requestHandler>
>
>
>
>
>
> * curl "
>
> http://dev.grexit.com:8080/solr1/update/extract?literal.id=who.pdf&uprefix=attr_&attr_&fmap.content=attr_content&commit=true
> "
> -F "myfile=@/root/apache-solr-3.1.0/docs/who.pdf"*
>
>
> <html><head><title>Apache Tomcat/6.0.18 - Error report</title><style><!--H1
>
> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
> H2
>
> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;}
> H3
>
> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
> BODY
> {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B
> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;}
> P
>
> {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
> {color : black;}A.name {color : black;}HR {color : #525D76;}--></style>
> </head><body><h1>HTTP Status 400 - ERROR:unknown field 'attr_meta'</h1><HR
> size="1" noshade="noshade"><p><b>type</b> Status
> report</p><p><b>message</b>
> <u>ERROR:unknown field 'attr_meta'</u></p><p><b>description</b> <u>The
> request sent by the client was syntactically incorrect (ERROR:unknown field
> 'attr_meta').</u></p><HR size="1" noshade="noshade"><h3>Apache
> Tomcat/6.0.18</h3></body></html>root@weforpeople:/usr/share/solr1/lib#
>
>
> Please note
>
> i integrated apacha tika 0.9 with apache-solr-1.4 locally on windows
> machine
> and using solr cell
>
> calling the program works fine without any changes in configuration.
>
> Thanks
> Naveen
>