You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Erick Erickson (JIRA)" <ji...@apache.org> on 2014/12/29 17:33:13 UTC
[jira] [Resolved] (SOLR-6888) Decompressing documents on first-pass distributed queries to get docId is inefficient, use indexed values instead?

     [ https://issues.apache.org/jira/browse/SOLR-6888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Erick Erickson resolved SOLR-6888.
----------------------------------
    Resolution: Duplicate

[~dsmiley] You're absolutely right, this is the same as SOLR-5478. I'll look over 5478 Real Soon Now.

Closing this one as it's not even, really, anything except a way to try to quantify the problem (and that crudely).

So many JIRAs, so few stick in my head....

> Decompressing documents on first-pass distributed queries to get docId is inefficient, use indexed values instead?
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-6888
>                 URL: https://issues.apache.org/jira/browse/SOLR-6888
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 5.0, Trunk
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>         Attachments: SOLR-6888-hacktiming.patch
>
>
> Assigning this to myself to just not lose track of it, but I won't be working on this in the near term; anyone feeling ambitious should feel free to grab it.
> Note, docId used here is whatever is defined for <uniqueKey>...
> Since Solr 4.1, the compression/decompression process is based on 16K blocks and is automatic, and not configurable. So, to get a single stored value one must decompress an entire 16K block. At least.
> For SolrCloud (and distributed processing in general), we make two trips, one to get the doc id and score (or other sort criteria) and one to return the actual data.
> The first pass here requires that we return the top N docIDs and sort criteria, which means that each and every sub-request has to unpack at least one 16K block (and sometimes more) to get just the doc ID. So if we have 20 shards and only want 20 rows, 95% of the decompression cycles will be wasted. Not to mention all the disk reads.
> It seems like we should be able to do better than that. Can we argue that doc ids are 'special' and should be cached somehow? Let's discuss what this would look like. I can think of a couple of approaches:
> 1> Since doc IDs are "special", can we say that for this purpose returning the indexed version is OK? We'd need to return the actual stored value when the full doc was requested, but for the sub-request only what about returning the indexed value instead of the stored one? On the surface I don't see a problem here, but what do I know? Storing these as DocValues seems useful in this case.
> 1a> A variant is treating numeric docIds specially since the indexed value and the stored value should be identical. And DocValues here would be useful it seems. But this seems an unnecessary specialization if <1> is implemented well.
> 2> We could cache individual doc IDs, although I'm not sure what use that really is. Would maintaining the cache overwhelm the savings of not decompressing? I really don't like this idea, but am throwing it out there. Doing this from stored data up front would essentially mean decompressing every doc so that seems untenable to try up-front.
> 3> We could maintain an array[maxDoc] that held document IDs, perhaps lazily initializing it. I'm not particularly a fan of this either, doesn't seem like a Good Thing. I can see lazy loading being almost, but not quite totally, useless, i.e. a hit ratio near 0, especially since it'd be thrown out on every openSearcher.
> Really, the only one of these that seems viable is <1>/<1a>. The others would all involve decompressing the docs anyway to get the ID, and I suspect that caching would be of very limited usefulness. I guess <1>'s viability hinges on whether, for internal use, the indexed form of DocId is interchangeable with the stored value.
> Or are there other ways to approach this? Or isn't it something to really worry about?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org