You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Luis Lebolo <lu...@gmail.com> on 2014/02/05 18:32:33 UTC

Problem querying large StrField?

Hi All,

It seems that I can't query on a StrField with a large value (say 70k
characters). I have a Solr document with a string type:

    <fieldType name="string" class="solr.StrField" sortMissingLast="true"/>

and field:

   <dynamicField name="someFieldName_*" type="string" indexed="true"
stored="true" />

Note that it's stored, in case that matters.

Across my documents, the length of the value in this StrField can be up to
~70k characters or more.

The query I'm trying is 'someFieldName_1:*'. If someFieldName_1 has values
with length < ~10k characters, then it works fine and I retrieve various
documents with values in that field.

However, if I query 'someFieldName_2:*' and someFieldName_2 has values with
length ~60k, I don't get back any documents. Even though I *know* that many
documents have a value in someFieldName_2.

If I query *:* and add someFieldName_2 in the field list, I am able to see
the (large) value in someFieldName_2.

So is there some type of limit to the length of strings in StrField that I
can query against?

Thanks,
Luis

Re: Problem querying large StrField?

Posted by Jack Krupansky <ja...@basetechnology.com>.

What does each document represent? What concept is holding all these 
entities together?

The standard approach to true many-to-many relationships in Solr is to 
denormalize - each document would represent one relationship and have an ID 
field that links the relationship to whatever each of your current Solr 
documents represent.

Multivalued fields, large string fields, and dynamic fields are all powerful 
tools in Lucene/Solr, but only when used in moderation. The way to scale in 
Lucene/Solr is documents and sharding, not massive documents with lots of 
large multivalued/string fields.

That said, given Lucene/Solr's rich support for large tokenized fields, they 
might be a better choice for representing large lists of entities - if 
denormalization is not quite practical.

-- Jack Krupansky

-----Original Message----- 
From: Luis Lebolo
Sent: Monday, February 10, 2014 12:42 AM
To: solr-user
Subject: Re: Problem querying large StrField?

Hi Yonik,

Thanks for the response. Our use case is perhaps a little unusual. The
actual domain is in bioinformatics, but I'll try to generalize. We have two
types of entities, call them A's and B's. For a given pair of entities
(a_i, b_j) we may or may not have an associated data value z. Standard many
to many stuff in a DB. Users can select an arbitrary set of entities from
A. What we'd then like to ask of Solr is: Which entities of type B have a
data value for any of the A's I've selected.

The way we've approached this to date is to index the set of B, such that
each document has a multivalued field containing the id's of all entities A
that have a data value. If I select a set of A (a1, a2, a5, a9), then I
would query data availability across B as dataAvailabilityField:(a1 OR a2
OR a5 OR a9).

The sets of A and B are fairly large (~10 - 30k). This was working ok, but
our datasets have increased and now the giant OR is getting too slow.

As an alternative approach, we developed a ValueParser plugin that took
advantage of our ability to sort the list of entity id's and do some clever
things, like binary searches and short circuits on the results. For this to
work, we concatenated all the id's into a single comma delimited value. So
the data availability field is now single valued, but has a term that looks
like "a1,a3,a6,a7....". Our function query then takes the list of A id's
that we're interested in and searches the documents for ones that match any
value. Worked great and quite fast when the id list was short enough. But
then we tried it on the full data set and the indexed terms of id's are
HUGE.

I know it's a bit of an odd use case, but have you seen anything like this
before? Do you have any thoughts on how we might better accomplish this
functionality?

Thanks!

On Wed, Feb 5, 2014 at 1:42 PM, Yonik Seeley <yo...@heliosearch.com> wrote:

> On Wed, Feb 5, 2014 at 1:04 PM, Luis Lebolo <lu...@gmail.com> wrote:
> > Update: It seems I get the bad behavior (no documents returned) when the
> > length of a value in the StrField is greater than or equal to 32,767
> > (2^15). Is this some type of bit overflow somewhere?
>
> I believe that's the maximum size of an indexed token.
> Can you share your use-case?  Why are you trying to index such large
> values as a single token?
>
> -Yonik
> http://heliosearch.org - native off-heap filters and fieldcache for solr
>

Re: Problem querying large StrField?

Posted by Yonik Seeley <yo...@heliosearch.com>.

On Mon, Feb 10, 2014 at 12:42 AM, Luis Lebolo <lu...@gmail.com> wrote:
> For this to
> work, we concatenated all the id's into a single comma delimited value.

It doesn't sound like you need the resulting big value to be indexed.
All you need to do is retrieve it relatively quickly and do your own
matching logic on it?
If so, perhaps look at using docvalues.

-Yonik
http://heliosearch.org - native off-heap filters and fieldcache for solr

Re: Problem querying large StrField?

Posted by Luis Lebolo <lu...@gmail.com>.

Hi Yonik,

Thanks for the response. Our use case is perhaps a little unusual. The
actual domain is in bioinformatics, but I'll try to generalize. We have two
types of entities, call them A's and B's. For a given pair of entities
(a_i, b_j) we may or may not have an associated data value z. Standard many
to many stuff in a DB. Users can select an arbitrary set of entities from
A. What we'd then like to ask of Solr is: Which entities of type B have a
data value for any of the A's I've selected.

The way we've approached this to date is to index the set of B, such that
each document has a multivalued field containing the id's of all entities A
that have a data value. If I select a set of A (a1, a2, a5, a9), then I
would query data availability across B as dataAvailabilityField:(a1 OR a2
OR a5 OR a9).

The sets of A and B are fairly large (~10 - 30k). This was working ok, but
our datasets have increased and now the giant OR is getting too slow.

As an alternative approach, we developed a ValueParser plugin that took
advantage of our ability to sort the list of entity id's and do some clever
things, like binary searches and short circuits on the results. For this to
work, we concatenated all the id's into a single comma delimited value. So
the data availability field is now single valued, but has a term that looks
like "a1,a3,a6,a7....". Our function query then takes the list of A id's
that we're interested in and searches the documents for ones that match any
value. Worked great and quite fast when the id list was short enough. But
then we tried it on the full data set and the indexed terms of id's are
HUGE.

I know it's a bit of an odd use case, but have you seen anything like this
before? Do you have any thoughts on how we might better accomplish this
functionality?

Thanks!

On Wed, Feb 5, 2014 at 1:42 PM, Yonik Seeley <yo...@heliosearch.com> wrote:

> On Wed, Feb 5, 2014 at 1:04 PM, Luis Lebolo <lu...@gmail.com> wrote:
> > Update: It seems I get the bad behavior (no documents returned) when the
> > length of a value in the StrField is greater than or equal to 32,767
> > (2^15). Is this some type of bit overflow somewhere?
>
> I believe that's the maximum size of an indexed token.
> Can you share your use-case?  Why are you trying to index such large
> values as a single token?
>
> -Yonik
> http://heliosearch.org - native off-heap filters and fieldcache for solr
>

Re: Problem querying large StrField?

Posted by Yonik Seeley <yo...@heliosearch.com>.

On Wed, Feb 5, 2014 at 1:04 PM, Luis Lebolo <lu...@gmail.com> wrote:
> Update: It seems I get the bad behavior (no documents returned) when the
> length of a value in the StrField is greater than or equal to 32,767
> (2^15). Is this some type of bit overflow somewhere?

I believe that's the maximum size of an indexed token.
Can you share your use-case?  Why are you trying to index such large
values as a single token?

-Yonik
http://heliosearch.org - native off-heap filters and fieldcache for solr

Re: Problem querying large StrField?

Posted by Chris Hostetter <ho...@fucit.org>.

: Update: It seems I get the bad behavior (no documents returned) when the
: length of a value in the StrField is greater than or equal to 32,767
: (2^15). Is this some type of bit overflow somewhere?

IIRC there is a limit in the lower level lucene code to how many bytes a 
single term can be -- but i don't remember off the top of my head where 
that's enforced.

: > However, if I query 'someFieldName_2:*' and someFieldName_2 has values
: > with length ~60k, I don't get back any documents. Even though I *know* that
: > many documents have a value in someFieldName_2.

frist off: don't do a query like that.  you are asking for a prefix query 
using an empty prefix -- that's *hugely* inefficient.  if your goal is to 
find all docs that have some value indexed in the field, then add a 
"has_someFieldName_2" boolean field and query for 
has_someFieldName_2:true, or if you really can't change your index use 
someFieldName_2:[* TO *]

(if the only thing you are querying on is wether that field hsa some 
values, then you can make someFieldName_2 stored but not indexed and save 
a *ton* of space in your index)


that said: i'm also suprised by your description of the problem -- 
specifically that having *any* terms over that length causes a prefix 
query like this to not match any docs at all.  I would have expected you 
do get some errors for the large terms when indexing, and then at query 
time it would only match the docs with the shorter values.

What i'm seeing is that the long terms are silently ignored, but the 
prefix query across the field will still match docs with shorter terms.

i'll open a bug to figure out why we aren't generating an error for this 
at index time, but the behavior at query time looks correct....

hossman@frisbee:~$ perl -le 'print "a,aaa"; print "z," . ("Z" x 32767);' | 
curl 'http://localhost:8983/solr/update?header=false&fieldnames=name,long_s&rowid=id&commit=true' 
-H 'Content-Type: application/csv' --data-binary @- 

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int 
name="QTime">572</int></lst>
</response>

hossman@frisbee:~$ curl 'http://localhost:8983/solr/select?q=*:*&fl=id,name&wt=json&indent=true'{
  "responseHeader":{
    "status":0,
    "QTime":12,
    "params":{
      "fl":"id,name",
      "indent":"true",
      "q":"*:*",
      "wt":"json"}},
  "response":{"numFound":2,"start":0,"docs":[
      {
        "name":"a",
        "id":"0"},
      {
        "name":"z",
        "id":"1"}]
  }}


hossman@frisbee:~$ curl 'http://localhost:8983/solr/select?q=long_s:*&wt=json&indent=true'
{
  "responseHeader":{
    "status":0,
    "QTime":4,
    "params":{
      "indent":"true",
      "q":"long_s:*",
      "wt":"json"}},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "name":"a",
        "long_s":"aaa",
        "id":"0",
        "_version_":1459225819107819520}]
  }}






-Hoss
http://www.lucidworks.com/

Re: Problem querying large StrField?

Posted by Luis Lebolo <lu...@gmail.com>.

Update: It seems I get the bad behavior (no documents returned) when the
length of a value in the StrField is greater than or equal to 32,767
(2^15). Is this some type of bit overflow somewhere?


On Wed, Feb 5, 2014 at 12:32 PM, Luis Lebolo <lu...@gmail.com> wrote:

> Hi All,
>
> It seems that I can't query on a StrField with a large value (say 70k
> characters). I have a Solr document with a string type:
>
>     <fieldType name="string" class="solr.StrField" sortMissingLast="true"/>
>
> and field:
>
>    <dynamicField name="someFieldName_*" type="string" indexed="true"
> stored="true" />
>
> Note that it's stored, in case that matters.
>
> Across my documents, the length of the value in this StrField can be up to
> ~70k characters or more.
>
> The query I'm trying is 'someFieldName_1:*'. If someFieldName_1 has values
> with length < ~10k characters, then it works fine and I retrieve various
> documents with values in that field.
>
> However, if I query 'someFieldName_2:*' and someFieldName_2 has values
> with length ~60k, I don't get back any documents. Even though I *know* that
> many documents have a value in someFieldName_2.
>
> If I query *:* and add someFieldName_2 in the field list, I am able to see
> the (large) value in someFieldName_2.
>
> So is there some type of limit to the length of strings in StrField that I
> can query against?
>
> Thanks,
> Luis
>