You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Brian Yee <by...@wayfair.com> on 2018/02/01 18:55:50 UTC

External file fields

Hello,

I want to use external file field to store frequently changing inventory and price data. I got a proof of concept working with a mock text file and this will suit my needs.

What is the best way to keep this file updated in a fast way. Ideally I would like to read changes from a Kafka queue and write to the file. But it seems like I would have to open the whole file, read the whole file, find the line I want to change, and write the whole file for every change. Is there a better way to do that? That approach seems like it would be difficult/slow if the file is several million lines long.

Also, once I come up with a way to update the file quickly, what is the best way to distribute the file to all the different solrcloud nodes in the correct directory?

Re: External file fields

Posted by Charlie Hull <ch...@flax.co.uk>.
On 01/02/2018 18:55, Brian Yee wrote:
> Hello,
> 
> I want to use external file field to store frequently changing inventory and price data. I got a proof of concept working with a mock text file and this will suit my needs.
> 
> What is the best way to keep this file updated in a fast way. Ideally I would like to read changes from a Kafka queue and write to the file. But it seems like I would have to open the whole file, read the whole file, find the line I want to change, and write the whole file for every change. Is there a better way to do that? That approach seems like it would be difficult/slow if the file is several million lines long.
> 
> Also, once I come up with a way to update the file quickly, what is the best way to distribute the file to all the different solrcloud nodes in the correct directory?
> 
Another approach would be the XJoin plugin we wrote - if you wait a few 
days we should have an updated patch for Solr v6.5 and possibly v7. 
XJoin lets you filter/join/rank Solr results using an external data source.

http://www.flax.co.uk/blog/2016/01/25/xjoin-solr-part-1-filtering-using-price-discount-data/
http://www.flax.co.uk/blog/2016/01/29/xjoin-solr-part-2-click-example/


Cheers

Charlie


-- 
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Re: External file fields

Posted by Emir Arnautović <em...@sematext.com>.
Hi Brian,
You should be able to sort on field with only sorted values.

Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 2 Feb 2018, at 16:53, Brian Yee <by...@wayfair.com> wrote:
> 
> Hello Erick,
> 
> I did look into updatable docValues, but my understanding is that the field has to be non-indexed (indexed="false"). I need to be able to sort on these values. External field fields are sortable.
> https://lucene.apache.org/solr/guide/6_6/updating-parts-of-documents.html#UpdatingPartsofDocuments-In-PlaceUpdates
> 
> 
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com] 
> Sent: Thursday, February 1, 2018 5:00 PM
> To: solr-user <so...@lucene.apache.org>
> Subject: Re: External file fields
> 
> Have you considered updateable docValues?
> 
> Best,
> Erick
> 
> On Thu, Feb 1, 2018 at 10:55 AM, Brian Yee <by...@wayfair.com> wrote:
>> Hello,
>> 
>> I want to use external file field to store frequently changing inventory and price data. I got a proof of concept working with a mock text file and this will suit my needs.
>> 
>> What is the best way to keep this file updated in a fast way. Ideally I would like to read changes from a Kafka queue and write to the file. But it seems like I would have to open the whole file, read the whole file, find the line I want to change, and write the whole file for every change. Is there a better way to do that? That approach seems like it would be difficult/slow if the file is several million lines long.
>> 
>> Also, once I come up with a way to update the file quickly, what is the best way to distribute the file to all the different solrcloud nodes in the correct directory?


RE: External file fields

Posted by Chris Hostetter <ho...@fucit.org>.
: Interesting. I will definitely explore this. Just so I'm clear, we can 
: sort on docValues, but not filter? Is there any situation where external 
: file fields would work better than docValues?

For most field types that support docValues, you can still filter on it 
even if it's indexed="false" -- but the filtering may not be as efficient 
as using indexed values.  for numeric fields you certainly can.

One situation where ExternalFileFiled would probably preferable to doing 
inplace updates on docValues is when you know you need to update the value 
for *every* document in your collection in batch -- for large 
collections, looping over every doc and sending an atomic update would 
probably be slower then just replacing the external file.

Another example when i would probably choose external file field over 
docValues is if the "keyField" was not the same as my uniqueKey field ... 
ie: if i have millions of documents each with a category_id that has a 
cardinality of ~100 categories.  I could use 
the category_id field as the keyField to associate every doc w/some 
numeric "category_rank" value (that varies only per category).  If i 
need/want to tweak 1 of those 100 category_rank values updating the 
entire external file just to change that 1 value is still probably much 
easier then redundemntly putting that category_rank field in every 
doc and sending an atomic update to all ~10K docs that have 
same category_id,category_rank i want to change.


: 
: -----Original Message-----
: From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
: Sent: Friday, February 2, 2018 12:24 PM
: To: solr-user@lucene.apache.org
: Subject: RE: External file fields
: 
: 
: : I did look into updatable docValues, but my understanding is that the
: : field has to be non-indexed (indexed="false"). I need to be able to sort
: : on these values. External field fields are sortable.
: 
: YOu can absolutely sort on a field that is docValues="true" 
: indexed="false" ... that is much more efficient then sorting on a field that is docValues="false" index="true" -- in the later case solr has to build a fieldcache (aka: run-time-mock-docvalues) from the indexed values the first time you try to sort on the field after a searcher is opened
: 
: 
: 
: -Hoss
: http://www.lucidworks.com/
: 

-Hoss
http://www.lucidworks.com/

RE: External file fields

Posted by Brian Yee <by...@wayfair.com>.
Interesting. I will definitely explore this. Just so I'm clear, we can sort on docValues, but not filter? Is there any situation where external file fields would work better than docValues?

-----Original Message-----
From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Sent: Friday, February 2, 2018 12:24 PM
To: solr-user@lucene.apache.org
Subject: RE: External file fields


: I did look into updatable docValues, but my understanding is that the
: field has to be non-indexed (indexed="false"). I need to be able to sort
: on these values. External field fields are sortable.

YOu can absolutely sort on a field that is docValues="true" 
indexed="false" ... that is much more efficient then sorting on a field that is docValues="false" index="true" -- in the later case solr has to build a fieldcache (aka: run-time-mock-docvalues) from the indexed values the first time you try to sort on the field after a searcher is opened



-Hoss
http://www.lucidworks.com/

RE: External file fields

Posted by Chris Hostetter <ho...@fucit.org>.
: I did look into updatable docValues, but my understanding is that the 
: field has to be non-indexed (indexed="false"). I need to be able to sort 
: on these values. External field fields are sortable.

YOu can absolutely sort on a field that is docValues="true" 
indexed="false" ... that is much more efficient then sorting on a field 
that is docValues="false" index="true" -- in the later case solr has to 
build a fieldcache (aka: run-time-mock-docvalues) from the indexed values 
the first time you try to sort on the field after a searcher is opened



-Hoss
http://www.lucidworks.com/

RE: External file fields

Posted by Brian Yee <by...@wayfair.com>.
Hello Erick,

I did look into updatable docValues, but my understanding is that the field has to be non-indexed (indexed="false"). I need to be able to sort on these values. External field fields are sortable.
https://lucene.apache.org/solr/guide/6_6/updating-parts-of-documents.html#UpdatingPartsofDocuments-In-PlaceUpdates


-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Thursday, February 1, 2018 5:00 PM
To: solr-user <so...@lucene.apache.org>
Subject: Re: External file fields

Have you considered updateable docValues?

Best,
Erick

On Thu, Feb 1, 2018 at 10:55 AM, Brian Yee <by...@wayfair.com> wrote:
> Hello,
>
> I want to use external file field to store frequently changing inventory and price data. I got a proof of concept working with a mock text file and this will suit my needs.
>
> What is the best way to keep this file updated in a fast way. Ideally I would like to read changes from a Kafka queue and write to the file. But it seems like I would have to open the whole file, read the whole file, find the line I want to change, and write the whole file for every change. Is there a better way to do that? That approach seems like it would be difficult/slow if the file is several million lines long.
>
> Also, once I come up with a way to update the file quickly, what is the best way to distribute the file to all the different solrcloud nodes in the correct directory?

Re: External file fields

Posted by Erick Erickson <er...@gmail.com>.
Have you considered updateable docValues?

Best,
Erick

On Thu, Feb 1, 2018 at 10:55 AM, Brian Yee <by...@wayfair.com> wrote:
> Hello,
>
> I want to use external file field to store frequently changing inventory and price data. I got a proof of concept working with a mock text file and this will suit my needs.
>
> What is the best way to keep this file updated in a fast way. Ideally I would like to read changes from a Kafka queue and write to the file. But it seems like I would have to open the whole file, read the whole file, find the line I want to change, and write the whole file for every change. Is there a better way to do that? That approach seems like it would be difficult/slow if the file is several million lines long.
>
> Also, once I come up with a way to update the file quickly, what is the best way to distribute the file to all the different solrcloud nodes in the correct directory?

Re: External file fields

Posted by Emir Arnautović <em...@sematext.com>.
Maybe you can try or extend Sematext’s Redis parser: https://github.com/sematext/solr-redis <https://github.com/sematext/solr-redis>. Downside of this approach is another moving part - Redis.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 1 Feb 2018, at 19:55, Brian Yee <by...@wayfair.com> wrote:
> 
> Hello,
> 
> I want to use external file field to store frequently changing inventory and price data. I got a proof of concept working with a mock text file and this will suit my needs.
> 
> What is the best way to keep this file updated in a fast way. Ideally I would like to read changes from a Kafka queue and write to the file. But it seems like I would have to open the whole file, read the whole file, find the line I want to change, and write the whole file for every change. Is there a better way to do that? That approach seems like it would be difficult/slow if the file is several million lines long.
> 
> Also, once I come up with a way to update the file quickly, what is the best way to distribute the file to all the different solrcloud nodes in the correct directory?