You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by ben boggess <be...@gmail.com> on 2010/10/20 21:23:36 UTC

Multiple indexes inside a single core

We are trying to convert a Lucene-based search solution to a
Solr/Lucene-based solution.  The problem we have is that we currently have
our data split into many indexes and Solr expects things to be in a single
index unless you're sharding.  In addition to this, our indexes wouldn't
work well using the distributed search functionality in Solr because the
documents are not evenly or randomly distributed.  We are currently using
Lucene's MultiSearcher to search over subsets of these indexes.

I know this has been brought up a number of times in previous posts and the
typical response is that the best thing to do is to convert everything into
a single index.  One of the major reasons for having the indexes split up
the way we do is because different types of data need to be indexed at
different intervals.  You may need one index to be updated every 20 minutes
and another is only updated every week.  If we move to a single index, then
we will constantly be warming and replacing searchers for the entire
dataset, and will essentially render the searcher caches useless.  If we
were able to have multiple indexes, they would each have a searcher and
updates would be isolated to a subset of the data.

The other problem is that we will likely need to shard this large single
index and there isn't a clean way to shard randomly and evenly across the of
the data.  We would, however like to shard a single data type.  If we could
use multiple indexes, we would likely be also sharding a small sub-set of
them.

Thanks in advance,

Ben

FieldCache

Posted by Mathias Walter <ma...@gmx.net>.

Hi,

does a field which should be cached needs to be indexed?

I have a binary field which is just stored. Retrieving it via FieldCache.DEFAULT.getTerms returns empty ByteRefs.

Then I found the following post: http://www.mail-archive.com/dev@lucene.apache.org/msg05403.html

How can I use the FieldCache with a binary field?

--
Kind regards,
Mathias

Re: Multiple indexes inside a single core

Posted by Valli Indraganti <va...@gmail.com>.

Here's the Jira issue for the distributed search issue.
https://issues.apache.org/jira/browse/SOLR-1632

I tried applying this patch but, get the same error that is posted in the
discussion section for that issue. I will be glad to help too on this one.

On Sat, Oct 23, 2010 at 2:35 PM, Erick Erickson <er...@gmail.com>wrote:

> Ah, I should have read more carefully...
>
> I remember this being discussed on the dev list, and I thought there might
> be
> a Jira attached but I sure can't find it.
>
> If you're willing to work on it, you might hop over to the solr dev list
> and
> start
> a discussion, maybe ask for a place to start. I'm sure some of the devs
> have
> thought about this...
>
> If nobody on the dev list says "There's already a JIRA on it", then you
> should
> open one. The Jira issues are generally preferred when you start getting
> into
> design because the comments are preserved for the next person who tries
> the idea or makes changes, etc....
>
> Best
> Erick
>
> On Wed, Oct 20, 2010 at 9:52 PM, Ben Boggess <be...@gmail.com>
> wrote:
>
> > Thanks Erick.  The problem with multiple cores is that the documents are
> > scored independently in each core.  I would like to be able to search
> across
> > both cores and have the scores 'normalized' in a way that's similar to
> what
> > Lucene's MultiSearcher would do.  As far a I understand, multiple cores
> > would likely result in seriously skewed scores in my case since the
> > documents are not distributed evenly or randomly.  I could have one
> > core/index with 20 million docs and another with 200.
> >
> > I've poked around in the code and this feature doesn't seem to exist.  I
> > would be happy with finding a decent place to try to add it.  I'm not
> sure
> > if there is a clean place for it.
> >
> > Ben
> >
> > On Oct 20, 2010, at 8:36 PM, Erick Erickson <er...@gmail.com>
> > wrote:
> >
> > > It seems to me that multiple cores are along the lines you
> > > need, a single instance of Solr that can search across multiple
> > > sub-indexes that do not necessarily share schemas, and are
> > > independently maintainable......
> > >
> > > This might be a good place to start:
> > http://wiki.apache.org/solr/CoreAdmin
> > >
> > > HTH
> > > Erick
> > >
> > > On Wed, Oct 20, 2010 at 3:23 PM, ben boggess <be...@gmail.com>
> > wrote:
> > >
> > >> We are trying to convert a Lucene-based search solution to a
> > >> Solr/Lucene-based solution.  The problem we have is that we currently
> > have
> > >> our data split into many indexes and Solr expects things to be in a
> > single
> > >> index unless you're sharding.  In addition to this, our indexes
> wouldn't
> > >> work well using the distributed search functionality in Solr because
> the
> > >> documents are not evenly or randomly distributed.  We are currently
> > using
> > >> Lucene's MultiSearcher to search over subsets of these indexes.
> > >>
> > >> I know this has been brought up a number of times in previous posts
> and
> > the
> > >> typical response is that the best thing to do is to convert everything
> > into
> > >> a single index.  One of the major reasons for having the indexes split
> > up
> > >> the way we do is because different types of data need to be indexed at
> > >> different intervals.  You may need one index to be updated every 20
> > minutes
> > >> and another is only updated every week.  If we move to a single index,
> > then
> > >> we will constantly be warming and replacing searchers for the entire
> > >> dataset, and will essentially render the searcher caches useless.  If
> we
> > >> were able to have multiple indexes, they would each have a searcher
> and
> > >> updates would be isolated to a subset of the data.
> > >>
> > >> The other problem is that we will likely need to shard this large
> single
> > >> index and there isn't a clean way to shard randomly and evenly across
> > the
> > >> of
> > >> the data.  We would, however like to shard a single data type.  If we
> > could
> > >> use multiple indexes, we would likely be also sharding a small sub-set
> > of
> > >> them.
> > >>
> > >> Thanks in advance,
> > >>
> > >> Ben
> > >>
> >
>

Re: Multiple indexes inside a single core

Posted by Erick Erickson <er...@gmail.com>.

Ah, I should have read more carefully...

I remember this being discussed on the dev list, and I thought there might
be
a Jira attached but I sure can't find it.

If you're willing to work on it, you might hop over to the solr dev list and
start
a discussion, maybe ask for a place to start. I'm sure some of the devs have
thought about this...

If nobody on the dev list says "There's already a JIRA on it", then you
should
open one. The Jira issues are generally preferred when you start getting
into
design because the comments are preserved for the next person who tries
the idea or makes changes, etc....

Best
Erick

On Wed, Oct 20, 2010 at 9:52 PM, Ben Boggess <be...@gmail.com> wrote:

> Thanks Erick.  The problem with multiple cores is that the documents are
> scored independently in each core.  I would like to be able to search across
> both cores and have the scores 'normalized' in a way that's similar to what
> Lucene's MultiSearcher would do.  As far a I understand, multiple cores
> would likely result in seriously skewed scores in my case since the
> documents are not distributed evenly or randomly.  I could have one
> core/index with 20 million docs and another with 200.
>
> I've poked around in the code and this feature doesn't seem to exist.  I
> would be happy with finding a decent place to try to add it.  I'm not sure
> if there is a clean place for it.
>
> Ben
>
> On Oct 20, 2010, at 8:36 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
> > It seems to me that multiple cores are along the lines you
> > need, a single instance of Solr that can search across multiple
> > sub-indexes that do not necessarily share schemas, and are
> > independently maintainable......
> >
> > This might be a good place to start:
> http://wiki.apache.org/solr/CoreAdmin
> >
> > HTH
> > Erick
> >
> > On Wed, Oct 20, 2010 at 3:23 PM, ben boggess <be...@gmail.com>
> wrote:
> >
> >> We are trying to convert a Lucene-based search solution to a
> >> Solr/Lucene-based solution.  The problem we have is that we currently
> have
> >> our data split into many indexes and Solr expects things to be in a
> single
> >> index unless you're sharding.  In addition to this, our indexes wouldn't
> >> work well using the distributed search functionality in Solr because the
> >> documents are not evenly or randomly distributed.  We are currently
> using
> >> Lucene's MultiSearcher to search over subsets of these indexes.
> >>
> >> I know this has been brought up a number of times in previous posts and
> the
> >> typical response is that the best thing to do is to convert everything
> into
> >> a single index.  One of the major reasons for having the indexes split
> up
> >> the way we do is because different types of data need to be indexed at
> >> different intervals.  You may need one index to be updated every 20
> minutes
> >> and another is only updated every week.  If we move to a single index,
> then
> >> we will constantly be warming and replacing searchers for the entire
> >> dataset, and will essentially render the searcher caches useless.  If we
> >> were able to have multiple indexes, they would each have a searcher and
> >> updates would be isolated to a subset of the data.
> >>
> >> The other problem is that we will likely need to shard this large single
> >> index and there isn't a clean way to shard randomly and evenly across
> the
> >> of
> >> the data.  We would, however like to shard a single data type.  If we
> could
> >> use multiple indexes, we would likely be also sharding a small sub-set
> of
> >> them.
> >>
> >> Thanks in advance,
> >>
> >> Ben
> >>
>

Re: Multiple indexes inside a single core

Posted by Ben Boggess <be...@gmail.com>.

Thanks Erick.  The problem with multiple cores is that the documents are scored independently in each core.  I would like to be able to search across both cores and have the scores 'normalized' in a way that's similar to what Lucene's MultiSearcher would do.  As far a I understand, multiple cores would likely result in seriously skewed scores in my case since the documents are not distributed evenly or randomly.  I could have one core/index with 20 million docs and another with 200.

I've poked around in the code and this feature doesn't seem to exist.  I would be happy with finding a decent place to try to add it.  I'm not sure if there is a clean place for it.

Ben

On Oct 20, 2010, at 8:36 PM, Erick Erickson <er...@gmail.com> wrote:

> It seems to me that multiple cores are along the lines you
> need, a single instance of Solr that can search across multiple
> sub-indexes that do not necessarily share schemas, and are
> independently maintainable......
> 
> This might be a good place to start: http://wiki.apache.org/solr/CoreAdmin
> 
> HTH
> Erick
> 
> On Wed, Oct 20, 2010 at 3:23 PM, ben boggess <be...@gmail.com> wrote:
> 
>> We are trying to convert a Lucene-based search solution to a
>> Solr/Lucene-based solution.  The problem we have is that we currently have
>> our data split into many indexes and Solr expects things to be in a single
>> index unless you're sharding.  In addition to this, our indexes wouldn't
>> work well using the distributed search functionality in Solr because the
>> documents are not evenly or randomly distributed.  We are currently using
>> Lucene's MultiSearcher to search over subsets of these indexes.
>> 
>> I know this has been brought up a number of times in previous posts and the
>> typical response is that the best thing to do is to convert everything into
>> a single index.  One of the major reasons for having the indexes split up
>> the way we do is because different types of data need to be indexed at
>> different intervals.  You may need one index to be updated every 20 minutes
>> and another is only updated every week.  If we move to a single index, then
>> we will constantly be warming and replacing searchers for the entire
>> dataset, and will essentially render the searcher caches useless.  If we
>> were able to have multiple indexes, they would each have a searcher and
>> updates would be isolated to a subset of the data.
>> 
>> The other problem is that we will likely need to shard this large single
>> index and there isn't a clean way to shard randomly and evenly across the
>> of
>> the data.  We would, however like to shard a single data type.  If we could
>> use multiple indexes, we would likely be also sharding a small sub-set of
>> them.
>> 
>> Thanks in advance,
>> 
>> Ben
>>

Re: Multiple indexes inside a single core

Posted by Erick Erickson <er...@gmail.com>.

It seems to me that multiple cores are along the lines you
need, a single instance of Solr that can search across multiple
sub-indexes that do not necessarily share schemas, and are
independently maintainable......

This might be a good place to start: http://wiki.apache.org/solr/CoreAdmin

HTH
Erick

On Wed, Oct 20, 2010 at 3:23 PM, ben boggess <be...@gmail.com> wrote:

> We are trying to convert a Lucene-based search solution to a
> Solr/Lucene-based solution.  The problem we have is that we currently have
> our data split into many indexes and Solr expects things to be in a single
> index unless you're sharding.  In addition to this, our indexes wouldn't
> work well using the distributed search functionality in Solr because the
> documents are not evenly or randomly distributed.  We are currently using
> Lucene's MultiSearcher to search over subsets of these indexes.
>
> I know this has been brought up a number of times in previous posts and the
> typical response is that the best thing to do is to convert everything into
> a single index.  One of the major reasons for having the indexes split up
> the way we do is because different types of data need to be indexed at
> different intervals.  You may need one index to be updated every 20 minutes
> and another is only updated every week.  If we move to a single index, then
> we will constantly be warming and replacing searchers for the entire
> dataset, and will essentially render the searcher caches useless.  If we
> were able to have multiple indexes, they would each have a searcher and
> updates would be isolated to a subset of the data.
>
> The other problem is that we will likely need to shard this large single
> index and there isn't a clean way to shard randomly and evenly across the
> of
> the data.  We would, however like to shard a single data type.  If we could
> use multiple indexes, we would likely be also sharding a small sub-set of
> them.
>
> Thanks in advance,
>
> Ben
>

RE: IndexableBinaryStringTools (was FieldCache)

Posted by Steven A Rowe <sa...@syr.edu>.

On 11/13/2010 at 2:04 PM, Yonik Seeley wrote:
n Sat, Nov 13, 2010 at 1:50 PM, Steven A Rowe <sa...@syr.edu> wrote:
> > Looks to me like the returned value is in a Solr-internal form of XML
> > character escaping: \u0000 is represented as "#0;" and \u0008 is
> > represented as "#8;".  (The escaping code is in
> > solr/src/java/org/apache/common/util/XML.java.)
> 
> Yep, there is no legal way to represent some unicode code points in XML.

Right - the real fix here (as you pointed out on #lucene) is to not use XML transports.

> > You can get the value back in its original binary form by unescaping the
> > /#[0-9]+;/ format.  Here is a test illustrating this fix that I added to
> > SolrExampleTests, then ran from SolrExampleEmbeddedTest:
> 
> The problem here is that one might then unescape what was meant to be
> a literal "#8;"
> One could come up with a full escaping mechanism over XML I suppose...
> but I'm not sure it would be worth it.

s/illustrating this fix/exposing this dirty hack/ :)

Steve

Re: IndexableBinaryStringTools (was FieldCache)

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Sat, Nov 13, 2010 at 1:50 PM, Steven A Rowe <sa...@syr.edu> wrote:
> Looks to me like the returned value is in a Solr-internal form of XML character escaping: \u0000 is represented as "#0;" and \u0008 is represented as "#8;".  (The escaping code is in solr/src/java/org/apache/common/util/XML.java.)

Yep, there is no legal way to represent some unicode code points in XML.

> You can get the value back in its original binary form by unescaping the /#[0-9]+;/ format.  Here is a test illustrating this fix that I added to SolrExampleTests, then ran from SolrExampleEmbeddedTest:

The problem here is that one might then unescape what was meant to be
a literal "#8;"
One could come up with a full escaping mechanism over XML I suppose...
but I'm not sure it would be worth it.

-Yonik
http://www.lucidimagination.com

RE: IndexableBinaryStringTools (was FieldCache)

Posted by Steven A Rowe <sa...@syr.edu>.

Hi Mathias,

> > > I assume that the char[] returned form
> > > IndexableBinaryStringTools.encode is encoded in UTF-8 again
> > > and then stored. At some point the information is lost and
> > > cannot be recovered.
> >
> > Can you give an example?  This should not happen.
> 
> My character array returned by IndexableBinaryStringTools.encode looks
> like following:
> 
> char[] encoded = new char[] {0, 8508, 3392, 64, 0, 8, 0, 0};
[...]
> BTW: I've tested it with EmbeddedSolrServer and Solr/Lucene trunk.
> 
> Why has the string representation changed? From the changed string I
> cannot decode the correct ID.

Looks to me like the returned value is in a Solr-internal form of XML character escaping: \u0000 is represented as "#0;" and \u0008 is represented as "#8;".  (The escaping code is in solr/src/java/org/apache/common/util/XML.java.)  

You can get the value back in its original binary form by unescaping the /#[0-9]+;/ format.  Here is a test illustrating this fix that I added to SolrExampleTests, then ran from SolrExampleEmbeddedTest:

==============
  @Test
  public void testIndexableBinary() throws Exception {
    // Empty the database...
    server.deleteByQuery( "*:*" );// delete everything!
    server.commit();
    assertNumFound( "*:*", 0 ); // make sure it got in
 
    byte[] binary = new byte[] 
      { (byte)0, (byte)0, (byte)0x84, (byte)0xF0, (byte)0x6A, (byte)0, 
        (byte)4, (byte)0, (byte)0,    (byte)0,    (byte)2,    (byte)0 };
    int encodedLen = IndexableBinaryStringTools.getEncodedLength
      (binary, 0, binary.length);
    char encoded[] = new char[encodedLen];
    IndexableBinaryStringTools.encode
      (binary, 0, binary.length, encoded, 0, encoded.length);
    final String encodedString = new String(encoded);
    log.info("Encoded: " + stringToIntSequence(encodedString));
    // Expected encoded: {         0, 8508, 3392,   64,    0,    8,    0,    0 }
    String expectedEncoded = "\u0000\u213C\u0D40\u0040\u0000\u0008\u0000\u0000";
    assertEquals(stringToIntSequence(expectedEncoded),
                 stringToIntSequence(encodedString));
      
    SolrInputDocument doc = new SolrInputDocument();
    doc.addField("id", encodedString);
    server.add(doc);
    server.commit();
        
    SolrQuery query = new SolrQuery();
    query.setQuery("*:*");
    QueryResponse rsp = server.query(query);
    SolrDocument retrievedDoc = rsp.getResults().get(0);
    String retrievedEncoded = (String)retrievedDoc.getFieldValue("id");
    String unescapedRetrievedEncoded = unescapeSolrXMLEscaping(retrievedEncoded);
    assertEquals(stringToIntSequence(encodedString), 
                 stringToIntSequence(unescapedRetrievedEncoded));
  }
    
  String stringToIntSequence(String str) {
    StringBuilder builder = new StringBuilder();
    for (int chnum = 0 ; chnum < str.length() ; ++chnum) {
      if (chnum > 0) {
        builder.append(", ");
      }
      builder.append((int)str.charAt(chnum))
        .append(" (").append(str.charAt(chnum)).append(")");
    }
    return builder.toString();
  }
  String unescapeSolrXMLEscaping(String escaped) {
    StringBuffer unescaped = new StringBuffer();
    Matcher matcher = Pattern.compile("#(\\d+);").matcher(escaped);
    while (matcher.find()) {
      String replacement = String.format
        ("%c",(char)Integer.parseInt(matcher.group(1)));
      matcher.appendReplacement(unescaped, replacement); 
    }
    matcher.appendTail(unescaped);
    return unescaped.toString();
  }
==============

Steve

IndexableBinaryStringTools (was FieldCache)

Posted by Mathias Walter <ma...@gmx.net>.

Hi,

> > [...] I tried to use IndexableBinaryStringTools to re-encode my 11 byte
> > array. The size was increased to 7 characters (= 14 bytes)
> > which is still a gain of more than 50 percent compared to the UTF8
> > encoding. BTW: I found no sample how to use the
> > IndexableBinaryStringTools class except in the unit tests.
> 
> IndexableBinaryStringTools will eventually be deprecated and then dropped, in favor of native
> indexable/searchable binary terms.  More work is required before these are possible, though.
> 
> Well-maintained unit tests are not a bad way to describe functionality...

Sure, but there is no unit test for Solr.

> > I assume that the char[] returned form IndexableBinaryStringTools.encode
> > is encoded in UTF-8 again and then stored. At some point
> > the information is lost and cannot be recovered.
> 
> Can you give an example?  This should not happen.

It's hard to give an example output, because the binary string representation contains unprintiple characters. I'll try to explain what I'm doing.

My character array returned by IndexableBinaryStringTools.encode looks like following:

char[] encoded = new char[] {0, 8508, 3392, 64, 0, 8, 0, 0};

Then I add it to a SolrInputDocument:

SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", new String(encoded));

If I now print the SolrInputDocument using System.out.println(doc), the String representation of the character array is correct.

Then I add it to a RAMDirectory:

ArrayList<SolrInputDocument> docs = new ArrayList<SolrInputDocument>();
docs.add(doc);
solrServer.add(docs);
solrServer.commit();

... and immediately retrieve it like follows:

SolrQuery query = new SolrQuery();
query.setQuery("*:*");
QueryResponse rsp = solrServer.query(query);
SolrDocumentList docList = rsp.getResults();
System.out.println(docList);

Now the string representation of the SolrDocuments ID looks different than that of the SolrInputDocument.

If I do not create a new string in doc.addField, just the string representation of the array address will be added the the SolrInputDocument.

BTW: I've tested it with EmbeddedSolrServer and Solr/Lucene trunk.

Why has the string representation changed? From the changed string I cannot decode the correct ID.

--
Kind regards,
Mathias

RE: FieldCache

Posted by Steven A Rowe <sa...@syr.edu>.

Hi Mathias,

> [...] I tried to use IndexableBinaryStringTools to re-encode my 11 byte
> array. The size was increased to 7 characters (= 14 bytes)
> which is still a gain of more than 50 percent compared to the UTF8
> encoding. BTW: I found no sample how to use the
> IndexableBinaryStringTools class except in the unit tests.

IndexableBinaryStringTools will eventually be deprecated and then dropped, in favor of native indexable/searchable binary terms.  More work is required before these are possible, though.

Well-maintained unit tests are not a bad way to describe functionality...
 
> I assume that the char[] returned form IndexableBinaryStringTools.encode
> is encoded in UTF-8 again and then stored. At some point
> the information is lost and cannot be recovered.

Can you give an example?  This should not happen.

Steve

Re: AW: FieldCache

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Mon, 2010-10-25 at 09:41 +0200, Mathias Walter wrote:
> [...] I enabled the field cache for my ID field and another
> single char field (PAS type) to get the benefit of accessing
> the fields with an array. Unfortunately, the IDs are too
> large to fit in memory. I gave 12 GB of RAM to each node and
> also tried to use the MMapDirectory and/or CompressedOops.
> Lucene always runs out of memory.

That is a known problem with Lucene 3-. The cache uses Strings for the
terms, which has a lot of overhead. As you discovered, reducing the
length of the ID's does not help much.

[Encoding ID as 11 stored bytes]

> Recently I upgraded to trunk (4.0) and tried to use the ByteRefs
> from FieldCache.DEFAULT.getTerms directly. But the bytes are
> encoded in an unknown form (unknown to me) and cannot be decoded
> with IndexableBinaryStringTools.decode.

It depends on what you put into it, but if you represent your IDs as
normal Strings at index time, they will be stored in UTF-8 encoding.
Since you're using 11 ASCII characters for an ID, this means 11 bytes.
You can get your Strings back by calling myBytesRef.utf8ToString().

The  overhead for BytesRefs is a lot lower than Strings, so simply
indexing your ID's and using the field cache might solve your problem
when you're using trunk.

- Toke

AW: FieldCache

Posted by Mathias Walter <ma...@gmx.net>.

I don't think it is an XY problem.

I indexed about 90 million sentences and the PAS (predicate argument structures) they consist of (which are about 500 million). Then
I try to do NER (named entity recognition) by searching about 5 million entities. For each entity I need the all search results, not
just the top X. Since about 10 percent of the entities are high frequent (i. e. there are more than 5 million hits for "human"), it
takes very long to obtain the data from the index. "Very long" means about a day with 15 distributed Katta nodes. Katta is just a
distribution and shard balancing solution on top of Lucene.

Initially, I tried distributed search with Solr. But it was too slow to retrieve a large set of documents. Then I switch to Lucene
and made some improvements. I enabled the field cache for my ID field and another single char field (PAS type) to get the benefit of
accessing the fields with an array. Unfortunately, the IDs are too large to fit in memory. I gave 12 GB of RAM to each node and also
tried to use the MMapDirectory and/or CompressedOops. Lucene always runs out of memory.

Then I investigated the storage of the fields. String fields are stored in UTF-8 encoding. But my ID will never contain UTF8
characters. It consists of number schema but does not fit into a single long. I encoded it into a byte array of 11 bytes (compared
to 30 bytes of UTF-8 encoding). Then I changed the field description in schema.xml to binary. I still use the EmbeddedSolrServer to
create the indices.
Also, I had to remove the uniquekey node because binary fields cannot be indexed, which is the requirement for the unique key.

After reindexing I discovered that nonindexed or binary fields cannot be used with the FieldCache.

Then I tried to use IndexableBinaryStringTools to re-encode my 11 byte array. The size was increased to 7 characters (= 14 bytes)
which is still a gain of more than 50 percent compared to the UTF8 encoding. BTW: I found no sample how to use the
IndexableBinaryStringTools class except in the unit tests.

Unfortunately, I was not able use it with the EmbeddedSolrServer and the Lucene client. The search result never looked identical
compared to the IDs used to create the SolrInputDocument.

I assume that the char[] returned form IndexableBinaryStringTools.encode is encoded in UTF-8 again and then stored. At some point
the information is lost and cannot be recovered.

Recently I upgraded to trunk (4.0) and tried to use the ByteRefs from FieldCache.DEFAULT.getTerms directly. But the bytes are
encoded in an unknown form (unknown to me) and cannot be decoded with IndexableBinaryStringTools.decode.

The question is now, how to increase the performance of the binary field retrieval by not exploding the memory?

I also read some comments which suggest using of payloads. But I never tried this approach. Also, the column-stride fields approach
(LUCENE-2186) looks promising but is not released yet.

BTW: I made some tests with a smaller index and the ID encoded as string. Using the field cache improves the hit retrieval
dramatically (from 18 seconds down to 2 seconds per query, with a large number of results).

--
Kind regards,
Mathias

> -----Ursprüngliche Nachricht-----
> Von: Erick Erickson [mailto:erickerickson@gmail.com]
> Gesendet: Samstag, 23. Oktober 2010 21:40
> An: solr-user@lucene.apache.org
> Betreff: Re: FieldCache
> 
> Why do you want to? Basically, the caches are there to improve
> #searching#. To search something, you must index it. Retrieving
> it is usually a rare enough operation that caching is irrelevant.
> 
> This smells like an XY problem, see:
> http://people.apache.org/~hossman/#xyproblem
> 
> If this seems like gibberish, could you explain your problem
> a little more?
> 
> Best
> Erick
> 
> On Thu, Oct 21, 2010 at 10:20 AM, Mathias Walter <ma...@gmx.net>wrote:
> 
> > Hi,
> >
> > does a field which should be cached needs to be indexed?
> >
> > I have a binary field which is just stored. Retrieving it via
> > FieldCache.DEFAULT.getTerms returns empty ByteRefs.
> >
> > Then I found the following post:
> > http://www.mail-archive.com/dev@lucene.apache.org/msg05403.html
> >
> > How can I use the FieldCache with a binary field?
> >
> > --
> > Kind regards,
> > Mathias
> >
> >

Re: command line to check if Solr is up running

Posted by Rob Casson <ro...@gmail.com>.

you could look at the ping stuff:

     http://wiki.apache.org/solr/SolrConfigXml#The_Admin.2BAC8-GUI_Section

cheers,
rob

On Mon, Oct 25, 2010 at 3:56 PM, Xin Li <xl...@book.com> wrote:
> As we know we can use browser to check if Solr is running by going to http://$hostName:$portNumber/$masterName/admin, say http://localhost:8080/solr1/admin. My questions is: are there any ways to check it using command line? I used "curl http://localhost:8080" to check my Tomcat, it worked fine. However, no response if I try "curl http://localhost:8080/solr1/admin" (even when my Solr is running). Does anyone know any command line alternatives?
>
> Thanks,
> Xin
> This electronic mail message contains information that (a) is or
> may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE
> PROTECTED
> BY LAW FROM DISCLOSURE, and (b) is intended only for the use of
> the addressee(s) named herein.  If you are not an intended
> recipient, please contact the sender immediately and take the
> steps necessary to delete the message completely from your
> computer system.
>
> Not Intended as a Substitute for a Writing: Notwithstanding the
> Uniform Electronic Transaction Act or any other law of similar
> effect, absent an express statement to the contrary, this e-mail
> message, its contents, and any attachments hereto are not
> intended
> to represent an offer or acceptance to enter into a contract and
> are not otherwise intended to bind this sender,
> barnesandnoble.com
> llc, barnesandnoble.com inc. or any other person or entity.
>

Re: command line to check if Solr is up running

Posted by Pradeep Singh <pk...@gmail.com>.

How about - Please do not respond to 20 emails at one time?

On Wed, Oct 27, 2010 at 12:33 AM, Lance Norskog <go...@gmail.com> wrote:

> Please start new threads for new topics.
>
>
> Xin Li wrote:
>
>> As we know we can use browser to check if Solr is running by going to
>> http://$hostName:$portNumber/$masterName/admin, say
>> http://localhost:8080/solr1/admin. My questions is: are there any ways to
>> check it using command line? I used "curl http://localhost:8080" to check
>> my Tomcat, it worked fine. However, no response if I try "curl
>> http://localhost:8080/solr1/admin" (even when my Solr is running). Does
>> anyone know any command line alternatives?
>>
>> Thanks,
>> Xin
>> This electronic mail message contains information that (a) is or
>> may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE
>> PROTECTED
>> BY LAW FROM DISCLOSURE, and (b) is intended only for the use of
>> the addressee(s) named herein.  If you are not an intended
>> recipient, please contact the sender immediately and take the
>> steps necessary to delete the message completely from your
>> computer system.
>>
>> Not Intended as a Substitute for a Writing: Notwithstanding the
>> Uniform Electronic Transaction Act or any other law of similar
>> effect, absent an express statement to the contrary, this e-mail
>> message, its contents, and any attachments hereto are not
>> intended
>> to represent an offer or acceptance to enter into a contract and
>> are not otherwise intended to bind this sender,
>> barnesandnoble.com
>> llc, barnesandnoble.com inc. or any other person or entity.
>>
>>
>

RE: command line to check if Solr is up running

Posted by Xin Li <xl...@book.com>.

Thanks Bob and Ahmet, 

"curl http://localhost:8080/solr1/admin/ping" works fine :)

Xin



-----Original Message-----
From: Ahmet Arslan [mailto:iorixxx@yahoo.com] 
Sent: Monday, October 25, 2010 4:03 PM
To: solr-user@lucene.apache.org
Subject: Re: command line to check if Solr is up running

> My questions is: are
> there any ways to check it using command line? I used "curl
> http://localhost:8080" to check my Tomcat, it worked
> fine. However, no response if I try "curl
http://localhost:8080/solr1/admin" (even when my Solr
> is running). Does anyone know any command line
> alternatives?


What about curl solr/admin/ping?echoParams=none&omitHeader=on



      

This electronic mail message contains information that (a) is or 
may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE 
PROTECTED 
BY LAW FROM DISCLOSURE, and (b) is intended only for the use of 
the addressee(s) named herein.  If you are not an intended 
recipient, please contact the sender immediately and take the 
steps necessary to delete the message completely from your 
computer system.

Not Intended as a Substitute for a Writing: Notwithstanding the 
Uniform Electronic Transaction Act or any other law of similar 
effect, absent an express statement to the contrary, this e-mail 
message, its contents, and any attachments hereto are not 
intended 
to represent an offer or acceptance to enter into a contract and 
are not otherwise intended to bind this sender, 
barnesandnoble.com 
llc, barnesandnoble.com inc. or any other person or entity.

Re: command line to check if Solr is up running

Posted by Lance Norskog <go...@gmail.com>.

Please start new threads for new topics.

Xin Li wrote:
> As we know we can use browser to check if Solr is running by going to http://$hostName:$portNumber/$masterName/admin, say http://localhost:8080/solr1/admin. My questions is: are there any ways to check it using command line? I used "curl http://localhost:8080" to check my Tomcat, it worked fine. However, no response if I try "curl http://localhost:8080/solr1/admin" (even when my Solr is running). Does anyone know any command line alternatives?
>
> Thanks,
> Xin
> This electronic mail message contains information that (a) is or
> may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE
> PROTECTED
> BY LAW FROM DISCLOSURE, and (b) is intended only for the use of
> the addressee(s) named herein.  If you are not an intended
> recipient, please contact the sender immediately and take the
> steps necessary to delete the message completely from your
> computer system.
>
> Not Intended as a Substitute for a Writing: Notwithstanding the
> Uniform Electronic Transaction Act or any other law of similar
> effect, absent an express statement to the contrary, this e-mail
> message, its contents, and any attachments hereto are not
> intended
> to represent an offer or acceptance to enter into a contract and
> are not otherwise intended to bind this sender,
> barnesandnoble.com
> llc, barnesandnoble.com inc. or any other person or entity.
>

Re: command line to check if Solr is up running

Posted by Peter Karich <pe...@yahoo.de>.

  Hi Xin,

from the wiki:
http://wiki.apache.org/solr/SolrConfigXml

The URL of the "ping" query is* /admin/ping

* You can also check (via wget) the number of documents. it might look 
like a rusty hack but it works for me:

wget -T 1 -q "http://localhost:8080/solr/select?q=*:*" -O - |  tr '/>' 
'\n' | grep numFound | tr '"' ' ' | awk '{print $5}'`

Regards,
Peter.

> As we know we can use browser to check if Solr is running by going to http://$hostName:$portNumber/$masterName/admin, say http://localhost:8080/solr1/admin. My questions is: are there any ways to check it using command line? I used "curl http://localhost:8080" to check my Tomcat, it worked fine. However, no response if I try "curl http://localhost:8080/solr1/admin" (even when my Solr is running). Does anyone know any command line alternatives?
>
> Thanks,
> Xin
> This electronic mail message contains information that (a) is or
> may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE
> PROTECTED
> BY LAW FROM DISCLOSURE, and (b) is intended only for the use of
> the addressee(s) named herein.  If you are not an intended
> recipient, please contact the sender immediately and take the
> steps necessary to delete the message completely from your
> computer system.
>
> Not Intended as a Substitute for a Writing: Notwithstanding the
> Uniform Electronic Transaction Act or any other law of similar
> effect, absent an express statement to the contrary, this e-mail
> message, its contents, and any attachments hereto are not
> intended
> to represent an offer or acceptance to enter into a contract and
> are not otherwise intended to bind this sender,
> barnesandnoble.com
> llc, barnesandnoble.com inc. or any other person or entity.


-- 
http://jetwick.com twitter search prototype

Re: command line to check if Solr is up running

Posted by Ahmet Arslan <io...@yahoo.com>.

> My questions is: are
> there any ways to check it using command line? I used "curl
> http://localhost:8080" to check my Tomcat, it worked
> fine. However, no response if I try "curl http://localhost:8080/solr1/admin" (even when my Solr
> is running). Does anyone know any command line
> alternatives?


What about curl solr/admin/ping?echoParams=none&omitHeader=on

command line to check if Solr is up running

Posted by Xin Li <xl...@book.com>.

As we know we can use browser to check if Solr is running by going to http://$hostName:$portNumber/$masterName/admin, say http://localhost:8080/solr1/admin. My questions is: are there any ways to check it using command line? I used "curl http://localhost:8080" to check my Tomcat, it worked fine. However, no response if I try "curl http://localhost:8080/solr1/admin" (even when my Solr is running). Does anyone know any command line alternatives?

Thanks,
Xin
This electronic mail message contains information that (a) is or 
may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE 
PROTECTED 
BY LAW FROM DISCLOSURE, and (b) is intended only for the use of 
the addressee(s) named herein.  If you are not an intended 
recipient, please contact the sender immediately and take the 
steps necessary to delete the message completely from your 
computer system.

Not Intended as a Substitute for a Writing: Notwithstanding the 
Uniform Electronic Transaction Act or any other law of similar 
effect, absent an express statement to the contrary, this e-mail 
message, its contents, and any attachments hereto are not 
intended 
to represent an offer or acceptance to enter into a contract and 
are not otherwise intended to bind this sender, 
barnesandnoble.com 
llc, barnesandnoble.com inc. or any other person or entity.

RE: FieldCache

Posted by Mathias Walter <ma...@gmx.net>.

Hi,

> On Mon, Oct 25, 2010 at 3:41 AM, Mathias Walter <ma...@gmx.net>
> wrote:
> > I indexed about 90 million sentences and the PAS (predicate argument
> structures) they consist of (which are about 500 million). Then
> > I try to do NER (named entity recognition) by searching about 5 million
> entities. For each entity I need the all search results, not
> > just the top X. Since about 10 percent of the entities are high frequent (i.
> e. there are more than 5 million hits for "human"), it
> > takes very long to obtain the data from the index. "Very long" means about a
> day with 15 distributed Katta nodes. Katta is just a
> > distribution and shard balancing solution on top of Lucene.
> 
> if you aren't getting top-N results/doing search, are you sure a
> search engine library/server is the right tool for this job?

No, I'm not sure, but I didn't find another solution. Any other solution also has to create some kind of index and has to provide some search API. Because I need SpanNearQuery and PhraseQuery to find some multi-term entities, I think Solr/Lucene is a good starting point. Also, I need the classic top-N results for the web application. So a single solution is preferred.

> > Then I tried to use IndexableBinaryStringTools to re-encode my 11 byte
> array. The size was increased to 7 characters (= 14 bytes)
> > which is still a gain of more than 50 percent compared to the UTF8 encoding.
> BTW: I found no sample how to use the
> > IndexableBinaryStringTools class except in the unit tests.
> 
> it is deprecated in trunk, because you can index binary terms (your
> own byte[]) directly if you want. To do this, you need to use a custom
> AttributeFactory.

How do I use it with Solr, i. e. how to set up a schema.xml using a custom AttributeFactory?

--
Kind regards,
Mathias

Re: FieldCache

Posted by Robert Muir <rc...@gmail.com>.

On Mon, Oct 25, 2010 at 3:41 PM, Mathias Walter <ma...@gmx.net> wrote:

> How do I use it with Solr, i. e. how to set up a schema.xml using a custom AttributeFactory?
>

at the moment there is no way to specify an AttributeFactory
(AttributeFactoryFactory? heh) in the schema.xml, nor do the
TokenizerFactories have any way to use any but the default.

So, in order to do this at the moment, you need to make a custom
TokenizerFactory hardwired to your AttributeFactory... take a look at
KeywordTokenizerFactory, you could make MyKeywordTokenizerFactory that
instead of invoking:

new KeywordTokenizer(input);

in its create() method, would use the
KeywordTokenizer(AttributeFactory, Reader, int) ctor with your custom
AttributeFactory.

Re: FieldCache

Posted by Robert Muir <rc...@gmail.com>.

On Mon, Oct 25, 2010 at 9:00 AM, Steven A Rowe <sa...@syr.edu> wrote:
> It's not actually deprecated yet.

you are right! only in my patch!

> AFAICT, Test2BTerms only deals with the indexing side of this issue, and doesn't test searching.
>
> LUCENE-2551 does, however, test searching.  Why hasn't this been committed yet?  I had just assumed that it was because fully indexable/searchable binary terms were not yet ready for prime time.
>
> I hadn't realized that native binary terms were fully functional - is there any reason why integers (for example) could not be directly indexable/searchable?

they are! Term itself now holds a BytesRef behind the scenes, and
pretty much everything is fully-functional (for example, the collated
sort use case works with the patch in LUCENE-2551)

But, the short answer is we still need to fix TermRangeQuery to just
work on bytes.
The problem is i didnt link the dependent issue: LUCENE-2514 (I just did this)

There is a patch to fix all the range query stuff there... its not
finished but not far. The basic idea is to make using
[ICU]CollationAnalyzer the supported way of doing this, including
queryparser support, etc.

The long answer is even after LUCENE-2514 is resolved, there are still
some things to figure out: for example how should we properly expose
stuff like this in Solr? Do we really need to modify the
TokenizerFactories to take AttributeFactory and add
"AttributeFactoryFactory"?

Or is it better to add a Solr fieldtype for these kind of things, and
do it that way? Or we could just add a special
"CollatedKeywordTokenizerFactory" with the current model that supports
the sorting use case easily, but we still want range query support I
think...

RE: FieldCache

Posted by Steven A Rowe <sa...@syr.edu>.

Hi Robert,

On 10/25/2010 at 8:20 AM, Robert Muir wrote:
> it is deprecated in trunk, because you can index binary terms (your
> own byte[]) directly if you want. To do this, you need to use a custom
> AttributeFactory.

It's not actually deprecated yet.

> See src/test/org/apache/lucene/index/Test2BTerms or
> https://issues.apache.org/jira/browse/LUCENE-2551 for examples of how
> to do this.

AFAICT, Test2BTerms only deals with the indexing side of this issue, and doesn't test searching.

LUCENE-2551 does, however, test searching.  Why hasn't this been committed yet?  I had just assumed that it was because fully indexable/searchable binary terms were not yet ready for prime time.

I hadn't realized that native binary terms were fully functional - is there any reason why integers (for example) could not be directly indexable/searchable?

Steve

Re: FieldCache

Posted by Robert Muir <rc...@gmail.com>.

On Mon, Oct 25, 2010 at 3:41 AM, Mathias Walter <ma...@gmx.net> wrote:
> I indexed about 90 million sentences and the PAS (predicate argument structures) they consist of (which are about 500 million). Then
> I try to do NER (named entity recognition) by searching about 5 million entities. For each entity I need the all search results, not
> just the top X. Since about 10 percent of the entities are high frequent (i. e. there are more than 5 million hits for "human"), it
> takes very long to obtain the data from the index. "Very long" means about a day with 15 distributed Katta nodes. Katta is just a
> distribution and shard balancing solution on top of Lucene.

if you aren't getting top-N results/doing search, are you sure a
search engine library/server is the right tool for this job?

> Then I tried to use IndexableBinaryStringTools to re-encode my 11 byte array. The size was increased to 7 characters (= 14 bytes)
> which is still a gain of more than 50 percent compared to the UTF8 encoding. BTW: I found no sample how to use the
> IndexableBinaryStringTools class except in the unit tests.

it is deprecated in trunk, because you can index binary terms (your
own byte[]) directly if you want. To do this, you need to use a custom
AttributeFactory.

See src/test/org/apache/lucene/index/Test2BTerms or
https://issues.apache.org/jira/browse/LUCENE-2551 for examples of how
to do this.

Re: FieldCache

Posted by Erick Erickson <er...@gmail.com>.

Why do you want to? Basically, the caches are there to improve
#searching#. To search something, you must index it. Retrieving
it is usually a rare enough operation that caching is irrelevant.

This smells like an XY problem, see:
http://people.apache.org/~hossman/#xyproblem

If this seems like gibberish, could you explain your problem
a little more?

Best
Erick

On Thu, Oct 21, 2010 at 10:20 AM, Mathias Walter <ma...@gmx.net>wrote:

> Hi,
>
> does a field which should be cached needs to be indexed?
>
> I have a binary field which is just stored. Retrieving it via
> FieldCache.DEFAULT.getTerms returns empty ByteRefs.
>
> Then I found the following post:
> http://www.mail-archive.com/dev@lucene.apache.org/msg05403.html
>
> How can I use the FieldCache with a binary field?
>
> --
> Kind regards,
> Mathias
>
>