You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Naveen Gupta <nk...@gmail.com> on 2011/06/06 07:54:39 UTC

TIKA INTEGRATION PERFORMANCE

Hi

Since it is php, we are using solphp for calling curl based call,

what my concern here is that for each user, we might be having 20-40
attachments needed to be indexed each day, and there are various users
..daily we are targeting around 500-1000 users ..

right now if you see, we

<?php
$ch = curl_init('
http://localhost:8010/solr/update/extract?literal.id=doc2&commit=true');
 curl_setopt ($ch, CURLOPT_POST, 1);
 curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>"@paper.pdf"));
 $result= curl_exec ($ch);
?>

also we are planning to use other fields which are to be indexed and stored
...


There are couple of questions here

1. what would be the best strategies for commit. if we take all the
documents in an array and iterating one by one and fire the curl and for the
last doc, if we commit, will it work or for each doc, we need to commit?

2. we are having several fields which are already defined in schema and few
of the them are required earlier, but for this purpose, we don't want, how
to have two requirement together in the same schema?

3. since it is frequent commit, how to use solr multicore for write and read
operations separately ?

Thanks
Naveen

Re: TIKA INTEGRATION PERFORMANCE

Posted by Tomás Fernández Löbbe <to...@gmail.com>.

On Mon, Jun 6, 2011 at 1:47 PM, Naveen Gupta <nk...@gmail.com> wrote:

> Hi Tomas,
>
> 1. Regarding SolrInputDocument,
>
> We are not using java client, rather we are using php solr, wrapping
> content
> in SolrInputDocument, i am not sure how to do in PHP client? In this case,
> we need tika related jars to avail the metadata such as content .. we
> certainly don't want to handle all these things in PHP client.
>

I don't understand, Tika IS integrated in Solr, it doesn't matter which
client or client language you are using. To add a static value, all you have
to do is add it as a request parameter with the prefix "literal". Something
like "literal.somefield=thevalue". Content and other file metadata such as
author etc (see
http://wiki.apache.org/solr/ExtractingRequestHandler#Metadata) will be added
to the document inside Solr and indexed. You don't need to handle this on
the client application.

>
>  Secondly, what i was asking about commit strategy --
>
> what about suppose you have 100 docs
>
> iterate over 99 docs and fire curl without commit in url
>
> and for 100th doc, we will use commit ....
>
> so doing so, will it also update the indexes for last 99 docs ....
>
> while(upto 99){
>     curl_command = url without commit;
> }
>
> when i = 100, url would be commit
>

You can certainly do this. The 100 documents will be available for search
after the commit. Non of the documents will be available for search before
commit.

>
> i wanted to achieve something similar to optimize kind of thing ....
>

Optimize command should be issued when not many queries or updates are sent
to the index. It uses lots of resources and will slow down queries.

>
> why these kind of use cases which are general purpose not included in
> example (especially in other language ...java guys can easily do using API)
>

They are, you can the auto-commit feature, configured on solrconfig.xml
file. You can either tell Solr to commit on a time interval or when a number
of documents are updated and not committed. On the example file, the
autocommit is commented, but you can uncomment it.


> I am basically a Java Guy, so i can feel the problem
>
> Thanks
> Naveen
> 2011/6/6 Tomás Fernández Löbbe <to...@gmail.com>
>
> > 1. About the commit strategy, all the ExtractingRequestHandler (request
> > handler that uses Tika to extract content from the input file) will do is
> > extract the content of your file and add it to a SolrInputDocument. The
> > commit strategy should not change because of this, compared to other
> > documents you might be indexing. It is usually not recommended to commit
> on
> > every new / updated document.
> >
> > 2. Don't know if I understand the question. you can add all the static
> > fields you want to the document by adding the "literal." prefix to the
> name
> > of the fields when using ExtractingRequestHandler (as you are doing with
> "
> > literal.id"). You can also leave empty fields if they are not marked as
> > "required" at the schema.xml file. See:
> > http://wiki.apache.org/solr/ExtractingRequestHandler#Literals
> >
> > 3. Solr cores can work almost as completely different Solr instances. You
> > could tell one core to replicate from another core. I don't think this
> > would
> > be of any help here. If you want to separate the indexing operations from
> > the query operations, you could probably use different machines, that's
> > usually a better option. Configure the indexing box as master and the
> query
> > box as slave. Here you have some more information about it:
> > http://wiki.apache.org/solr/SolrReplication
> >
> > Were this the answers you were looking for or did I misunderstand your
> > questions?
> >
> > Tomás
> >
> > On Mon, Jun 6, 2011 at 2:54 AM, Naveen Gupta <nk...@gmail.com>
> wrote:
> >
> > > Hi
> > >
> > > Since it is php, we are using solphp for calling curl based call,
> > >
> > > what my concern here is that for each user, we might be having 20-40
> > > attachments needed to be indexed each day, and there are various users
> > > ..daily we are targeting around 500-1000 users ..
> > >
> > > right now if you see, we
> > >
> > > <?php
> > > $ch = curl_init('
> > > http://localhost:8010/solr/update/extract?literal.id=doc2&commit=true'
> );
> > >  curl_setopt ($ch, CURLOPT_POST, 1);
> > >  curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>"@paper.pdf"));
> > >  $result= curl_exec ($ch);
> > > ?>
> > >
> > > also we are planning to use other fields which are to be indexed and
> > stored
> > > ...
> > >
> > >
> > > There are couple of questions here
> > >
> > > 1. what would be the best strategies for commit. if we take all the
> > > documents in an array and iterating one by one and fire the curl and
> for
> > > the
> > > last doc, if we commit, will it work or for each doc, we need to
> commit?
> > >
> > > 2. we are having several fields which are already defined in schema and
> > few
> > > of the them are required earlier, but for this purpose, we don't want,
> > how
> > > to have two requirement together in the same schema?
> > >
> > > 3. since it is frequent commit, how to use solr multicore for write and
> > > read
> > > operations separately ?
> > >
> > > Thanks
> > > Naveen
> > >
> >
>

Re: TIKA INTEGRATION PERFORMANCE

Posted by Naveen Gupta <nk...@gmail.com>.

Hi Tomas,

1. Regarding SolrInputDocument,

We are not using java client, rather we are using php solr, wrapping content
in SolrInputDocument, i am not sure how to do in PHP client? In this case,
we need tika related jars to avail the metadata such as content .. we
certainly don't want to handle all these things in PHP client.

 Secondly, what i was asking about commit strategy --

what about suppose you have 100 docs

iterate over 99 docs and fire curl without commit in url

and for 100th doc, we will use commit ....

so doing so, will it also update the indexes for last 99 docs ....

while(upto 99){
     curl_command = url without commit;
}

when i = 100, url would be commit

i wanted to achieve something similar to optimize kind of thing ....

why these kind of use cases which are general purpose not included in
example (especially in other language ...java guys can easily do using API)

I am basically a Java Guy, so i can feel the problem

Thanks
Naveen
2011/6/6 Tomás Fernández Löbbe <to...@gmail.com>

> 1. About the commit strategy, all the ExtractingRequestHandler (request
> handler that uses Tika to extract content from the input file) will do is
> extract the content of your file and add it to a SolrInputDocument. The
> commit strategy should not change because of this, compared to other
> documents you might be indexing. It is usually not recommended to commit on
> every new / updated document.
>
> 2. Don't know if I understand the question. you can add all the static
> fields you want to the document by adding the "literal." prefix to the name
> of the fields when using ExtractingRequestHandler (as you are doing with "
> literal.id"). You can also leave empty fields if they are not marked as
> "required" at the schema.xml file. See:
> http://wiki.apache.org/solr/ExtractingRequestHandler#Literals
>
> 3. Solr cores can work almost as completely different Solr instances. You
> could tell one core to replicate from another core. I don't think this
> would
> be of any help here. If you want to separate the indexing operations from
> the query operations, you could probably use different machines, that's
> usually a better option. Configure the indexing box as master and the query
> box as slave. Here you have some more information about it:
> http://wiki.apache.org/solr/SolrReplication
>
> Were this the answers you were looking for or did I misunderstand your
> questions?
>
> Tomás
>
> On Mon, Jun 6, 2011 at 2:54 AM, Naveen Gupta <nk...@gmail.com> wrote:
>
> > Hi
> >
> > Since it is php, we are using solphp for calling curl based call,
> >
> > what my concern here is that for each user, we might be having 20-40
> > attachments needed to be indexed each day, and there are various users
> > ..daily we are targeting around 500-1000 users ..
> >
> > right now if you see, we
> >
> > <?php
> > $ch = curl_init('
> > http://localhost:8010/solr/update/extract?literal.id=doc2&commit=true');
> >  curl_setopt ($ch, CURLOPT_POST, 1);
> >  curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>"@paper.pdf"));
> >  $result= curl_exec ($ch);
> > ?>
> >
> > also we are planning to use other fields which are to be indexed and
> stored
> > ...
> >
> >
> > There are couple of questions here
> >
> > 1. what would be the best strategies for commit. if we take all the
> > documents in an array and iterating one by one and fire the curl and for
> > the
> > last doc, if we commit, will it work or for each doc, we need to commit?
> >
> > 2. we are having several fields which are already defined in schema and
> few
> > of the them are required earlier, but for this purpose, we don't want,
> how
> > to have two requirement together in the same schema?
> >
> > 3. since it is frequent commit, how to use solr multicore for write and
> > read
> > operations separately ?
> >
> > Thanks
> > Naveen
> >
>

Re: TIKA INTEGRATION PERFORMANCE

Posted by Tomás Fernández Löbbe <to...@gmail.com>.

1. About the commit strategy, all the ExtractingRequestHandler (request
handler that uses Tika to extract content from the input file) will do is
extract the content of your file and add it to a SolrInputDocument. The
commit strategy should not change because of this, compared to other
documents you might be indexing. It is usually not recommended to commit on
every new / updated document.

2. Don't know if I understand the question. you can add all the static
fields you want to the document by adding the "literal." prefix to the name
of the fields when using ExtractingRequestHandler (as you are doing with "
literal.id"). You can also leave empty fields if they are not marked as
"required" at the schema.xml file. See:
http://wiki.apache.org/solr/ExtractingRequestHandler#Literals

3. Solr cores can work almost as completely different Solr instances. You
could tell one core to replicate from another core. I don't think this would
be of any help here. If you want to separate the indexing operations from
the query operations, you could probably use different machines, that's
usually a better option. Configure the indexing box as master and the query
box as slave. Here you have some more information about it:
http://wiki.apache.org/solr/SolrReplication

Were this the answers you were looking for or did I misunderstand your
questions?

Tomás

On Mon, Jun 6, 2011 at 2:54 AM, Naveen Gupta <nk...@gmail.com> wrote:

> Hi
>
> Since it is php, we are using solphp for calling curl based call,
>
> what my concern here is that for each user, we might be having 20-40
> attachments needed to be indexed each day, and there are various users
> ..daily we are targeting around 500-1000 users ..
>
> right now if you see, we
>
> <?php
> $ch = curl_init('
> http://localhost:8010/solr/update/extract?literal.id=doc2&commit=true');
>  curl_setopt ($ch, CURLOPT_POST, 1);
>  curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>"@paper.pdf"));
>  $result= curl_exec ($ch);
> ?>
>
> also we are planning to use other fields which are to be indexed and stored
> ...
>
>
> There are couple of questions here
>
> 1. what would be the best strategies for commit. if we take all the
> documents in an array and iterating one by one and fire the curl and for
> the
> last doc, if we commit, will it work or for each doc, we need to commit?
>
> 2. we are having several fields which are already defined in schema and few
> of the them are required earlier, but for this purpose, we don't want, how
> to have two requirement together in the same schema?
>
> 3. since it is frequent commit, how to use solr multicore for write and
> read
> operations separately ?
>
> Thanks
> Naveen
>