You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by avenka <av...@gmail.com> on 2012/07/08 19:25:55 UTC

DataImport using last_indexed_id or getting max(id) quickly

My understanding is that the DIH in solr only enters last_indexed_time in
dataimport.properties, but not say last_indexed_id for a primary key 'id'.
How can I efficiently get the max(id) (note that 'id' is an auto-increment
field in the database) ? Maintaining max(id) outside of solr is brittle and
calling max(id) before each dataimport can take several minutes when the
index has several hundred million records.

How can I either import based on ID or get max(id) quickly? I can not use
timestamp-based import because I get out-of-memory errors if/when solr falls
behind and the suggested fixes online did not work for me. 

--
View this message in context: http://lucene.472066.n3.nabble.com/DataImport-using-last-indexed-id-or-getting-max-id-quickly-tp3993763.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: DataImport using last_indexed_id or getting max(id) quickly

Posted by Erick Erickson <er...@gmail.com>.
You could also just keep a "special" document in your index with a known
ID that contains meta-data fields. If this document had no fields in common
with any other document it wouldn't satisfy searches (except the *:* search).

Or you could store this info somewhere else (file, DB, etc).

Or you can commit with "user data", although this isn't exposed
through Solr yet, see:
https://issues.apache.org/jira/browse/SOLR-2701

Best
Erick

On Thu, Jul 12, 2012 at 5:22 AM,  <ka...@gmx.de> wrote:
> Hi Avenka,
>
> you asked for a HowTo to add a field "inverseID" which allows to calculate max(id) from its first term:
> If you do not use solr you have to calculate "100000000 - id" and store it in an extra field "inverseID".
> If you fill solr with your own code, add a TrieLongField "inverseID" and fill with the value "-id".
> If you only want to change schema.xml (and add some classes):
>   * You need a new FieldType "inverseLongType" and a Field "inverseID" of Type "inverseLongType"
>   * You need a line <copyField source="id" dest="inverseID"/>
>    (see http://wiki.apache.org/solr/SchemaXml#Copy_Fields)
>
> For inverseLongType I see two possibilities
>  a) use TextField and make your own filter to calculate "100000000 - id"
>  b) extends TrieLongField to a new FieldType "InverseTrieLongField" with:
>   @Override
>   public String readableToIndexed(String val) {
>     return super.readableToIndexed(Long.toString( -Long.parseLong(val)));
>   }
>   @Override
>   public Fieldable createField(SchemaField field, String externalVal, float boost) {
>     return super.createField(field,Long.toString( -Long.parseLong(val)), boost );
>   }
>   @Override
>   public Object toObject(Fieldable f) {
>     Object result = super.toObject(f);
>     if(result instanceof Long){
>       return new Long( -((Long)result).longValue());
>     }
>     return result;
>   }
>
> Beste regards
>    Karsten
>
> View this message in context:
> http://lucene.472066.n3.nabble.com/DataImport-using-last-indexed-id-or-getting-max-id-quickly-tp3993763p3994560.html
>
>
> -------- Original-Nachricht --------
>> Datum: Wed, 11 Jul 2012 20:59:10 -0700 (PDT)
>> Von: avenka <av...@gmail.com>
>> An: solr-user@lucene.apache.org
>> Betreff: Re: DataImport using last_indexed_id or getting max(id) quickly
>
>> Thanks. Can you explain more the first TermsComponent option to obtain
>> max(id)? Do I have to modify schema.xml to add a new field? How exactly do
>> I
>> query for the lowest value of "100000000 - id"?
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/DataImport-using-last-indexed-id-or-getting-max-id-quickly-tp3993763p3994560.html
>> Sent from the Solr - User mailing list archive at Nabble.com.

Re: DataImport using last_indexed_id or getting max(id) quickly

Posted by ka...@gmx.de.
Hi Avenka,

you asked for a HowTo to add a field "inverseID" which allows to calculate max(id) from its first term:
If you do not use solr you have to calculate "100000000 - id" and store it in an extra field "inverseID".
If you fill solr with your own code, add a TrieLongField "inverseID" and fill with the value "-id".
If you only want to change schema.xml (and add some classes):
  * You need a new FieldType "inverseLongType" and a Field "inverseID" of Type "inverseLongType"
  * You need a line <copyField source="id" dest="inverseID"/>
   (see http://wiki.apache.org/solr/SchemaXml#Copy_Fields)

For inverseLongType I see two possibilities
 a) use TextField and make your own filter to calculate "100000000 - id"
 b) extends TrieLongField to a new FieldType "InverseTrieLongField" with:
  @Override
  public String readableToIndexed(String val) {
    return super.readableToIndexed(Long.toString( -Long.parseLong(val)));
  }
  @Override
  public Fieldable createField(SchemaField field, String externalVal, float boost) {
    return super.createField(field,Long.toString( -Long.parseLong(val)), boost );
  }
  @Override
  public Object toObject(Fieldable f) {
    Object result = super.toObject(f);
    if(result instanceof Long){
      return new Long( -((Long)result).longValue());
    }
    return result;
  }

Beste regards
   Karsten

View this message in context:
http://lucene.472066.n3.nabble.com/DataImport-using-last-indexed-id-or-getting-max-id-quickly-tp3993763p3994560.html


-------- Original-Nachricht --------
> Datum: Wed, 11 Jul 2012 20:59:10 -0700 (PDT)
> Von: avenka <av...@gmail.com>
> An: solr-user@lucene.apache.org
> Betreff: Re: DataImport using last_indexed_id or getting max(id) quickly

> Thanks. Can you explain more the first TermsComponent option to obtain
> max(id)? Do I have to modify schema.xml to add a new field? How exactly do
> I
> query for the lowest value of "100000000 - id"?
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/DataImport-using-last-indexed-id-or-getting-max-id-quickly-tp3993763p3994560.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: DataImport using last_indexed_id or getting max(id) quickly

Posted by avenka <av...@gmail.com>.
Thanks. Can you explain more the first TermsComponent option to obtain
max(id)? Do I have to modify schema.xml to add a new field? How exactly do I
query for the lowest value of "100000000 - id"?

--
View this message in context: http://lucene.472066.n3.nabble.com/DataImport-using-last-indexed-id-or-getting-max-id-quickly-tp3993763p3994560.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: DataImport using last_indexed_id or getting max(id) quickly

Posted by ka...@gmx.de.
Hi Avenka,

*DataImportHandler*
1.) there is no configuration to add the last uniqueKeyField-Values to dataimport.properties
2.) you can use LogUpdateProcessor to log all "schema.printableUniqueKey(doc)" to log.info( ""+toLog + " 0 " + (elapsed) )
3.) you can write your own LogUpdateProcessor to log only the last UniqueKey
4.) you can change DocBuilder#execute to store the uniqueKey in dataimport.properties

*max(id)*
With TermsComponent you can easily ask for the first term in a field (so you could add a field with "10000000 - id" to find the last term in id).
With solr 4.0 some index-codes will support "give me the last term" in a field: Fields#getUniqueTermCount() together with TermsEnum#seekExact(long)
With solr 3.6 you can use TermsComponent together wir guessing a "terms.lower" to find the last term in a field. This should outran a "*:*" search with function max(id).

Beste regards
  Karsten


View this message in context:
http://lucene.472066.n3.nabble.com/DataImport-using-last-indexed-id-or-getting-max-id-quickly-tp3993763.html

-------- Original-Nachricht --------
> Datum: Sun, 8 Jul 2012 10:25:55 -0700 (PDT)
> Von: avenka <av...@gmail.com>
> An: solr-user@lucene.apache.org
> Betreff: DataImport using last_indexed_id or getting max(id) quickly

> My understanding is that the DIH in solr only enters last_indexed_time in
> dataimport.properties, but not say last_indexed_id for a primary key 'id'.
> How can I efficiently get the max(id) (note that 'id' is an auto-increment
> field in the database) ? Maintaining max(id) outside of solr is brittle
> and
> calling max(id) before each dataimport can take several minutes when the
> index has several hundred million records.
> 
> How can I either import based on ID or get max(id) quickly? I can not use
> timestamp-based import because I get out-of-memory errors if/when solr
> falls
> behind and the suggested fixes online did not work for me. 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/DataImport-using-last-indexed-id-or-getting-max-id-quickly-tp3993763.html
> Sent from the Solr - User mailing list archive at Nabble.com.