You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucy.apache.org by rohit0908 <ro...@gmail.com> on 2017/04/11 10:08:14 UTC

[lucy-user] iterating through hits, is there a way to improve performance, or can we run these iterations in parallel

Hi,

I am working on deeply nested directory structure and large logs in perl and
trying to generate the results from index data,

my $hits = $searcher->hits(
        query      => ['title'],
        num_wanted => -1,
    );

while ( my $hit = $hits->next ) {
    # making 28777 calls to Lucy::Search::Hits::next 
    # do some work - already ran profiling on this code and optimized 

Now, since the hit count is large, the stuff that is in the iteration is
consuming good amount of time, i need to improve its performance to get it
going, 

can we run this in parallel or any other optimization possible for case
where hits are present in thousands? 

I tried implementing Parallel::ForkManager in it, but it had increased the
running time to serious extent instead of reducing it, 

Not getting any clue, please help i am badly stuck now.

Regards
Rohit Singh





--
View this message in context: http://lucene.472066.n3.nabble.com/iterating-through-hits-is-there-a-way-to-improve-performance-or-can-we-run-these-iterations-in-parall-tp4329286.html
Sent from the lucy-user mailing list archive at Nabble.com.

Re: [lucy-user] iterating through hits, is there a way to improve performance, or can we run these iterations in parallel

Posted by Peter Karman <pe...@peknet.com>.

rohit0908 wrote on 4/14/17 7:33 AM:
> Thanks Marvin for your reply and taking a quick look on this. I will try your
> second option of caching and using bitcollector. Meanwhile could you please
> help me on below one,
>
>>> If you don't need any fields other than `title` and you are currently have
>>> other fields which are `stored`, then you could try changing the FieldType
> for
>>> those other fields so that they are no longer `stored`.  That will reduce
> the
>>> the cost of deserializaing a document.
>
> I am running query on title only, and i require almost 4 fields only to
> serve my purpose, title, content, url, urlpath. so, is there a way we can
> fetch only these fields and it reduces the deserializaing cost or you mean
> to say not to store the fields if those are not necessary. Please let me
> know how to do it, thanks!!
>

"Storing" a field means you can retrieve the original value from the index 
directly. You can index a field value without storing it, so that you can search 
on the field but not retrieve the original (un-analyzed) value.

See https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Plan/FieldType.pod for 
the flags available when defining a field.

To give you more concrete advice, we'd need to see your indexing code, 
especially how you define your Schema.


-- 
Peter Karman  .  https://karpet.github.io  .  https://keybase.io/peterkarman

Re: [lucy-user] iterating through hits, is there a way to improve performance, or can we run these iterations in parallel

Posted by rohit0908 <ro...@gmail.com>.

Thanks Marvin for your reply and taking a quick look on this. I will try your
second option of caching and using bitcollector. Meanwhile could you please
help me on below one,

>>If you don't need any fields other than `title` and you are currently have 
>>other fields which are `stored`, then you could try changing the FieldType
for 
>>those other fields so that they are no longer `stored`.  That will reduce
the 
>>the cost of deserializaing a document. 

I am running query on title only, and i require almost 4 fields only to
serve my purpose, title, content, url, urlpath. so, is there a way we can
fetch only these fields and it reduces the deserializaing cost or you mean
to say not to store the fields if those are not necessary. Please let me
know how to do it, thanks!!







-----
Regards
Rohit Singh
--
View this message in context: http://lucene.472066.n3.nabble.com/iterating-through-hits-is-there-a-way-to-improve-performance-or-can-we-run-these-iterations-in-parall-tp4329286p4329963.html
Sent from the lucy-user mailing list archive at Nabble.com.

Re: [lucy-user] iterating through hits, is there a way to improve performance, or can we run these iterations in parallel

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Tue, Apr 11, 2017 at 3:08 AM, rohit0908 <ro...@gmail.com> wrote:

> Now, since the hit count is large, the stuff that is in the iteration is
> consuming good amount of time, i need to improve its performance to get it
> going,

Every call to $hits->next requires deserializing an entire document.  It may
be possible, depending on how your application is structured, to reduce or
avoid the cost of deserialization.

If you don't need any fields other than `title` and you are currently have
other fields which are `stored`, then you could try changing the FieldType for
those other fields so that they are no longer `stored`.  That will reduce the
the cost of deserializaing a document.

Another possibility might be to spend memory to avoid i/o, and cache all the
titles in a Perl array on Searcher initialization with indices corresponding
to Lucy doc IDs.  Then you could use a BitCollector, avoiding the
deserialization that $hits->next does.  Something like this:

    my $searcher = Lucy::Search::IndexSearcher->open(index => $index);
    my @titles;
    my $doc_max = $searcher->doc_max;
    for (1 .. $searcher->doc_max - 1) {
        my $doc = $searcher->fetch_doc($_);
        $titles[$_] = $doc->{title};
    }

    my $bit_vec = Lucy::Object::BitVector->new(
        capacity => $searcher->doc_max + 1,
    );
    my $bit_collector = Lucy::Search::Collector::BitCollector->new(
        bit_vector => $bit_vec,
    );
    $searcher->collect(
        collector => $bit_collector,
        query     => $query,
    );
    my $last_id = 0;
    while (1) {
        my $doc_id = $bit_vec->next_hit($last_id);
        last if $doc_id == -1;
        $last_id = $doc_id;
        print $titles[$doc_id] . "\n"; # or whatever
    }

> can we run this in parallel or any other optimization possible for case
> where hits are present in thousands?

Lucy is single-threaded, and there is not a practical way to parallelize
$hits->next at this time.  I've hacked some process-based parallelism using
unsupported private APIs but the approach wasn't ready for prime-time.

Marvin Humphrey