You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by avenka <av...@gmail.com> on 2012/07/08 19:25:55 UTC
DataImport using last_indexed_id or getting max(id) quickly
My understanding is that the DIH in solr only enters last_indexed_time in
dataimport.properties, but not say last_indexed_id for a primary key 'id'.
How can I efficiently get the max(id) (note that 'id' is an auto-increment
field in the database) ? Maintaining max(id) outside of solr is brittle and
calling max(id) before each dataimport can take several minutes when the
index has several hundred million records.
How can I either import based on ID or get max(id) quickly? I can not use
timestamp-based import because I get out-of-memory errors if/when solr falls
behind and the suggested fixes online did not work for me.
--
View this message in context: http://lucene.472066.n3.nabble.com/DataImport-using-last-indexed-id-or-getting-max-id-quickly-tp3993763.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: DataImport using last_indexed_id or getting max(id) quickly
Posted by Erick Erickson <er...@gmail.com>.
You could also just keep a "special" document in your index with a known
ID that contains meta-data fields. If this document had no fields in common
with any other document it wouldn't satisfy searches (except the *:* search).
Or you could store this info somewhere else (file, DB, etc).
Or you can commit with "user data", although this isn't exposed
through Solr yet, see:
https://issues.apache.org/jira/browse/SOLR-2701
Best
Erick
On Thu, Jul 12, 2012 at 5:22 AM, <ka...@gmx.de> wrote:
> Hi Avenka,
>
> you asked for a HowTo to add a field "inverseID" which allows to calculate max(id) from its first term:
> If you do not use solr you have to calculate "100000000 - id" and store it in an extra field "inverseID".
> If you fill solr with your own code, add a TrieLongField "inverseID" and fill with the value "-id".
> If you only want to change schema.xml (and add some classes):
> * You need a new FieldType "inverseLongType" and a Field "inverseID" of Type "inverseLongType"
> * You need a line <copyField source="id" dest="inverseID"/>
> (see http://wiki.apache.org/solr/SchemaXml#Copy_Fields)
>
> For inverseLongType I see two possibilities
> a) use TextField and make your own filter to calculate "100000000 - id"
> b) extends TrieLongField to a new FieldType "InverseTrieLongField" with:
> @Override
> public String readableToIndexed(String val) {
> return super.readableToIndexed(Long.toString( -Long.parseLong(val)));
> }
> @Override
> public Fieldable createField(SchemaField field, String externalVal, float boost) {
> return super.createField(field,Long.toString( -Long.parseLong(val)), boost );
> }
> @Override
> public Object toObject(Fieldable f) {
> Object result = super.toObject(f);
> if(result instanceof Long){
> return new Long( -((Long)result).longValue());
> }
> return result;
> }
>
> Beste regards
> Karsten
>
> View this message in context:
> http://lucene.472066.n3.nabble.com/DataImport-using-last-indexed-id-or-getting-max-id-quickly-tp3993763p3994560.html
>
>
> -------- Original-Nachricht --------
>> Datum: Wed, 11 Jul 2012 20:59:10 -0700 (PDT)
>> Von: avenka <av...@gmail.com>
>> An: solr-user@lucene.apache.org
>> Betreff: Re: DataImport using last_indexed_id or getting max(id) quickly
>
>> Thanks. Can you explain more the first TermsComponent option to obtain
>> max(id)? Do I have to modify schema.xml to add a new field? How exactly do
>> I
>> query for the lowest value of "100000000 - id"?
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/DataImport-using-last-indexed-id-or-getting-max-id-quickly-tp3993763p3994560.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
Re: DataImport using last_indexed_id or getting max(id) quickly
Posted by ka...@gmx.de.
Hi Avenka,
you asked for a HowTo to add a field "inverseID" which allows to calculate max(id) from its first term:
If you do not use solr you have to calculate "100000000 - id" and store it in an extra field "inverseID".
If you fill solr with your own code, add a TrieLongField "inverseID" and fill with the value "-id".
If you only want to change schema.xml (and add some classes):
* You need a new FieldType "inverseLongType" and a Field "inverseID" of Type "inverseLongType"
* You need a line <copyField source="id" dest="inverseID"/>
(see http://wiki.apache.org/solr/SchemaXml#Copy_Fields)
For inverseLongType I see two possibilities
a) use TextField and make your own filter to calculate "100000000 - id"
b) extends TrieLongField to a new FieldType "InverseTrieLongField" with:
@Override
public String readableToIndexed(String val) {
return super.readableToIndexed(Long.toString( -Long.parseLong(val)));
}
@Override
public Fieldable createField(SchemaField field, String externalVal, float boost) {
return super.createField(field,Long.toString( -Long.parseLong(val)), boost );
}
@Override
public Object toObject(Fieldable f) {
Object result = super.toObject(f);
if(result instanceof Long){
return new Long( -((Long)result).longValue());
}
return result;
}
Beste regards
Karsten
View this message in context:
http://lucene.472066.n3.nabble.com/DataImport-using-last-indexed-id-or-getting-max-id-quickly-tp3993763p3994560.html
-------- Original-Nachricht --------
> Datum: Wed, 11 Jul 2012 20:59:10 -0700 (PDT)
> Von: avenka <av...@gmail.com>
> An: solr-user@lucene.apache.org
> Betreff: Re: DataImport using last_indexed_id or getting max(id) quickly
> Thanks. Can you explain more the first TermsComponent option to obtain
> max(id)? Do I have to modify schema.xml to add a new field? How exactly do
> I
> query for the lowest value of "100000000 - id"?
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/DataImport-using-last-indexed-id-or-getting-max-id-quickly-tp3993763p3994560.html
> Sent from the Solr - User mailing list archive at Nabble.com.
Re: DataImport using last_indexed_id or getting max(id) quickly
Posted by avenka <av...@gmail.com>.
Thanks. Can you explain more the first TermsComponent option to obtain
max(id)? Do I have to modify schema.xml to add a new field? How exactly do I
query for the lowest value of "100000000 - id"?
--
View this message in context: http://lucene.472066.n3.nabble.com/DataImport-using-last-indexed-id-or-getting-max-id-quickly-tp3993763p3994560.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: DataImport using last_indexed_id or getting max(id) quickly
Posted by ka...@gmx.de.
Hi Avenka,
*DataImportHandler*
1.) there is no configuration to add the last uniqueKeyField-Values to dataimport.properties
2.) you can use LogUpdateProcessor to log all "schema.printableUniqueKey(doc)" to log.info( ""+toLog + " 0 " + (elapsed) )
3.) you can write your own LogUpdateProcessor to log only the last UniqueKey
4.) you can change DocBuilder#execute to store the uniqueKey in dataimport.properties
*max(id)*
With TermsComponent you can easily ask for the first term in a field (so you could add a field with "10000000 - id" to find the last term in id).
With solr 4.0 some index-codes will support "give me the last term" in a field: Fields#getUniqueTermCount() together with TermsEnum#seekExact(long)
With solr 3.6 you can use TermsComponent together wir guessing a "terms.lower" to find the last term in a field. This should outran a "*:*" search with function max(id).
Beste regards
Karsten
View this message in context:
http://lucene.472066.n3.nabble.com/DataImport-using-last-indexed-id-or-getting-max-id-quickly-tp3993763.html
-------- Original-Nachricht --------
> Datum: Sun, 8 Jul 2012 10:25:55 -0700 (PDT)
> Von: avenka <av...@gmail.com>
> An: solr-user@lucene.apache.org
> Betreff: DataImport using last_indexed_id or getting max(id) quickly
> My understanding is that the DIH in solr only enters last_indexed_time in
> dataimport.properties, but not say last_indexed_id for a primary key 'id'.
> How can I efficiently get the max(id) (note that 'id' is an auto-increment
> field in the database) ? Maintaining max(id) outside of solr is brittle
> and
> calling max(id) before each dataimport can take several minutes when the
> index has several hundred million records.
>
> How can I either import based on ID or get max(id) quickly? I can not use
> timestamp-based import because I get out-of-memory errors if/when solr
> falls
> behind and the suggested fixes online did not work for me.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/DataImport-using-last-indexed-id-or-getting-max-id-quickly-tp3993763.html
> Sent from the Solr - User mailing list archive at Nabble.com.