You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by naryad <na...@yahoo.com> on 2012/12/20 05:50:03 UTC

Improving the speed of Solr query over 16 million tweets

I use Solr (SolrCloud) to index and search my tweets. There are about 16
million tweets and the index size is approximately 3 GB. The tweets are
indexed in real time as they come so that real time search is enabled.
Currently I use lowercase field type for my tweet body field. For a single
search term in the search, it is taking around 7 seconds and with addition
of each search term, time taken for search is linearly increasing. 3GB is
the maximum RAM allocated for the solr process. Sample solr search query
looks like this

*tweet_body:*big* AND tweet_body:*data* AND tweet_tag:big_data*
Any suggestions on improving the speed of searching? Currently I run only 1
shard which contains the entire tweet collection. Not sure if redeclaring
the field as text_en and reindexing the entire thing is the only option.
Currently I figure that the query is scanning all the documents.



--
View this message in context: http://lucene.472066.n3.nabble.com/Improving-the-speed-of-Solr-query-over-16-million-tweets-tp4028222.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Improving the speed of Solr query over 16 million tweets

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

I do not believe there is any way to change the field type without
reindexing. After all, it already got stored as one long string in Lucene.

I think reindexing (probably just deleting/renaming data directory and
indexing again) is the easiest way. On a test machine/core if you plan to
experiment with specific most-suitable type/tokenizer first.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)

On Thu, Dec 20, 2012 at 4:21 PM, naryad <na...@yahoo.com> wrote:

> You are completely right. I realized this, so is the only way to fix this
> is
> to redeclare the field as text_en or text_en_splitting and then delete all
> the documents and recreate the index? Or is there any other easy way?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Improving-the-speed-of-Solr-query-over-16-million-tweets-tp4028222p4028228.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Improving the speed of Solr query over 16 million tweets

Posted by naryad <na...@yahoo.com>.

You are completely right. I realized this, so is the only way to fix this is
to redeclare the field as text_en or text_en_splitting and then delete all
the documents and recreate the index? Or is there any other easy way?



--
View this message in context: http://lucene.472066.n3.nabble.com/Improving-the-speed-of-Solr-query-over-16-million-tweets-tp4028222p4028228.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Improving the speed of Solr query over 16 million tweets

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

KeywordTokenizerFactory does not actually break the text into tokens. It
makes the whole field one whole token.

Is that what you actually want? I would have thought that at least tweet
body would be broken into words/tokens.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Thu, Dec 20, 2012 at 4:07 PM, naryad <na...@yahoo.com> wrote:

> Field type of the field is
> <fieldType name="lowercase" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer>
>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory" />
>       </analyzer>
>     </fieldType>
>
> I will be adding more shards over time. In fact we had two but one went
> down.
>
> This the the requrement, the field tweet_body should contain both big and
> data anywhere in its value. So I am using wildcard for this.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Improving-the-speed-of-Solr-query-over-16-million-tweets-tp4028222p4028225.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Improving the speed of Solr query over 16 million tweets

Posted by naryad <na...@yahoo.com>.

Field type of the field is 
<fieldType name="lowercase" class="solr.TextField"
positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
      </analyzer>
    </fieldType>

I will be adding more shards over time. In fact we had two but one went
down. 

This the the requrement, the field tweet_body should contain both big and
data anywhere in its value. So I am using wildcard for this.



--
View this message in context: http://lucene.472066.n3.nabble.com/Improving-the-speed-of-Solr-query-over-16-million-tweets-tp4028222p4028225.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Improving the speed of Solr query over 16 million tweets

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Naryad,

what is the field type of your tweet_ fields? can you show smaple of their
terms from SolrAdmin?
SolrCloud mostly purposed for sharding, not for single solid index.
are * wildcards or you mean double quotes?
to resolve most of question you can show respose with debugQuery=on


On Thu, Dec 20, 2012 at 8:50 AM, naryad <na...@yahoo.com> wrote:

> I use Solr (SolrCloud) to index and search my tweets. There are about 16
> million tweets and the index size is approximately 3 GB. The tweets are
> indexed in real time as they come so that real time search is enabled.
> Currently I use lowercase field type for my tweet body field. For a single
> search term in the search, it is taking around 7 seconds and with addition
> of each search term, time taken for search is linearly increasing. 3GB is
> the maximum RAM allocated for the solr process. Sample solr search query
> looks like this
>
> *tweet_body:*big* AND tweet_body:*data* AND tweet_tag:big_data*
> Any suggestions on improving the speed of searching? Currently I run only 1
> shard which contains the entire tweet collection. Not sure if redeclaring
> the field as text_en and reindexing the entire thing is the only option.
> Currently I figure that the query is scanning all the documents.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Improving-the-speed-of-Solr-query-over-16-million-tweets-tp4028222.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>