You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by AJ Weber <aw...@comcast.net> on 2008/04/15 16:26:30 UTC

replace field in doc?

I'm curious how people are building the "all" Field (for searching "all of the terms at once").

I understand using store=NO, Index=Tokenized is generally the way to add the field, but what if I need to basically use multiple classes to build my Document before adding it to the index (keeping things modular and reusable)?

If one class adds the Field (i.e. doc.add("all", sCurrentString, Field.Store.NO, Field.Index.TOKENIZED) ), then I pass the document ("doc" in this example) to another class, how do I basically concat any additional field-values/strings to the existing text-Field?  This is all before the Document is added to the index in the first-place.

I'm concerned that doing a doc.get("all") and then adding it back to the doc with the additional string will result in two fields of the name "all" (I seem to have read this is the functionality somewhere)???

Thanks in advance!
-AJ

Re: replace field in doc?

Posted by AJ Weber <aw...@comcast.net>.
Characters or "terms"?  (And btw: what's the difference?)  The javadoc says 10,000 "terms", which I assume generally equates to "words" (and given that the analyzer might use stemming, stop words, etc.).

Great info.  Thanks again!

-AJ
  ----- Original Message ----- 
  From: Erick Erickson 
  To: java-user@lucene.apache.org ; AJ Weber 
  Sent: Tuesday, April 15, 2008 2:50 PM
  Subject: Re: replace field in doc?


  Well, "my way" would certainly be simpler to read six months from
  now when you look at this code again <G>........

  And I'm quite sure you can add the same field multiple times, so 
  whatever you want.

  Do note, though, that Lucene defaults to 10,000 characters in any
  single field no matter which way you build it up, so you might need to
  be aware of IndexWriter.setMaxFieldLength().

  Good Luck!
  Erick



  On Tue, Apr 15, 2008 at 1:39 PM, AJ Weber <aw...@comcast.net> wrote:

    I ended up doing this:

                               String docText = doc.get("body");
                               Field fCurAll = doc.getField("all");
                               if ((fCurAll != null) && (docText != null)) {
                                   String newAll = fCurAll.stringValue() + " " + docText;
                                   logCat.debug("Updating 'all' field with props + bodyText");
                                   fCurAll.setValue(newAll);
                               }
                               else if ((fCurAll == null) && (docText != null))
                                   doc.add(new Field("all", docText, Field.Store.NO, Field.Index.TOKENIZED));

    But I suppose if what you say is true, it would be more efficient to do it "your way", because I don't need to chew-up memory and CPU by extracting the existing info, concat it, then put it back, right?

    (Since this is the "all" field, for 'global searching', and I'm not storing it anyway, I'm not concerned about the positions too much.)

    Thanks a bunch for the quick reply!
    -AJ

     ----- Original Message -----
     From: Erick Erickson
     To: java-user@lucene.apache.org ; AJ Weber
     Sent: Tuesday, April 15, 2008 1:12 PM
     Subject: Re: replace field in doc?


     You can freely add the same field (with different text) to a doc. For
     instance
     Document doc = new Document();
     doc.add("field", "this is the first");
     doc.add("field", "starting the second ");
     IndexWriter.addDocument(doc)

     is functionally the same as


     Document doc = new Document();
     doc.add("field", "this is the first starting the second");
     IndexWriter.addDocument(doc)


     with one subtle difference. The first version will add
     whatever is returned from Analyzer.getPositionIncrementGap
     to the term position of the first token in adds 2-n. That is,
     say getPositionIncrementGap returned 100. Then
     the token "first" would have position 3 and "starting"
     would have position 103 (I may be off by one on both these,
     but you get the idea).

     The default return is 1, so if you do nothing special, the two
     calls will be identical.

     Why would you return something other than 1 from
     getPositionIncrementGap? Well, if you want to dance
     fancy and, say, NOT span across the two lines above
     for, say, SpanNear clauses, you could return a large
     number that was bigger than the max allowable span.

     Best
     Erick

     On Tue, Apr 15, 2008 at 10:26 AM, AJ Weber <aw...@comcast.net> wrote:

     > I'm curious how people are building the "all" Field (for searching "all of
     > the terms at once").
     >
     > I understand using store=NO, Index=Tokenized is generally the way to add
     > the field, but what if I need to basically use multiple classes to build my
     > Document before adding it to the index (keeping things modular and
     > reusable)?
     >
     > If one class adds the Field (i.e. doc.add("all", sCurrentString,
     > Field.Store.NO, Field.Index.TOKENIZED) ), then I pass the document ("doc"
     > in this example) to another class, how do I basically concat any additional
     > field-values/strings to the existing text-Field?  This is all before the
     > Document is added to the index in the first-place.
     >
     > I'm concerned that doing a doc.get("all") and then adding it back to the
     > doc with the additional string will result in two fields of the name "all"
     > (I seem to have read this is the functionality somewhere)???
     >
     > Thanks in advance!
     > -AJ



Re: replace field in doc?

Posted by Erick Erickson <er...@gmail.com>.
Well, "my way" would certainly be simpler to read six months from
now when you look at this code again <G>........

And I'm quite sure you can add the same field multiple times, so
whatever you want.

Do note, though, that Lucene defaults to 10,000 characters in any
single field no matter which way you build it up, so you might need to
be aware of IndexWriter.setMaxFieldLength().

Good Luck!
Erick


On Tue, Apr 15, 2008 at 1:39 PM, AJ Weber <aw...@comcast.net> wrote:

> I ended up doing this:
>
>                            String docText = doc.get("body");
>                            Field fCurAll = doc.getField("all");
>                            if ((fCurAll != null) && (docText != null)) {
>                                String newAll = fCurAll.stringValue() + " "
> + docText;
>                                logCat.debug("Updating 'all' field with
> props + bodyText");
>                                fCurAll.setValue(newAll);
>                            }
>                            else if ((fCurAll == null) && (docText !=
> null))
>                                doc.add(new Field("all", docText,
> Field.Store.NO, Field.Index.TOKENIZED));
>
> But I suppose if what you say is true, it would be more efficient to do it
> "your way", because I don't need to chew-up memory and CPU by extracting the
> existing info, concat it, then put it back, right?
>
> (Since this is the "all" field, for 'global searching', and I'm not
> storing it anyway, I'm not concerned about the positions too much.)
>
> Thanks a bunch for the quick reply!
> -AJ
>   ----- Original Message -----
>  From: Erick Erickson
>  To: java-user@lucene.apache.org ; AJ Weber
>  Sent: Tuesday, April 15, 2008 1:12 PM
>  Subject: Re: replace field in doc?
>
>
>  You can freely add the same field (with different text) to a doc. For
>  instance
>  Document doc = new Document();
>  doc.add("field", "this is the first");
>  doc.add("field", "starting the second ");
>  IndexWriter.addDocument(doc)
>
>  is functionally the same as
>
>
>  Document doc = new Document();
>  doc.add("field", "this is the first starting the second");
>  IndexWriter.addDocument(doc)
>
>
>  with one subtle difference. The first version will add
>  whatever is returned from Analyzer.getPositionIncrementGap
>  to the term position of the first token in adds 2-n. That is,
>  say getPositionIncrementGap returned 100. Then
>  the token "first" would have position 3 and "starting"
>  would have position 103 (I may be off by one on both these,
>  but you get the idea).
>
>  The default return is 1, so if you do nothing special, the two
>  calls will be identical.
>
>  Why would you return something other than 1 from
>  getPositionIncrementGap? Well, if you want to dance
>  fancy and, say, NOT span across the two lines above
>  for, say, SpanNear clauses, you could return a large
>  number that was bigger than the max allowable span.
>
>  Best
>  Erick
>
>  On Tue, Apr 15, 2008 at 10:26 AM, AJ Weber <aw...@comcast.net> wrote:
>
>  > I'm curious how people are building the "all" Field (for searching "all
> of
>  > the terms at once").
>  >
>  > I understand using store=NO, Index=Tokenized is generally the way to
> add
>  > the field, but what if I need to basically use multiple classes to
> build my
>  > Document before adding it to the index (keeping things modular and
>  > reusable)?
>  >
>  > If one class adds the Field (i.e. doc.add("all", sCurrentString,
>  > Field.Store.NO, Field.Index.TOKENIZED) ), then I pass the document
> ("doc"
>  > in this example) to another class, how do I basically concat any
> additional
>  > field-values/strings to the existing text-Field?  This is all before
> the
>  > Document is added to the index in the first-place.
>  >
>  > I'm concerned that doing a doc.get("all") and then adding it back to
> the
>  > doc with the additional string will result in two fields of the name
> "all"
>  > (I seem to have read this is the functionality somewhere)???
>  >
>  > Thanks in advance!
>  > -AJ
>

Re: replace field in doc?

Posted by AJ Weber <aw...@comcast.net>.
I ended up doing this:

                            String docText = doc.get("body");
                            Field fCurAll = doc.getField("all");
                            if ((fCurAll != null) && (docText != null)) {
                                String newAll = fCurAll.stringValue() + " " + docText;
                                logCat.debug("Updating 'all' field with props + bodyText");
                                fCurAll.setValue(newAll);
                            }
                            else if ((fCurAll == null) && (docText != null))
                                doc.add(new Field("all", docText, Field.Store.NO, Field.Index.TOKENIZED));

But I suppose if what you say is true, it would be more efficient to do it "your way", because I don't need to chew-up memory and CPU by extracting the existing info, concat it, then put it back, right?

(Since this is the "all" field, for 'global searching', and I'm not storing it anyway, I'm not concerned about the positions too much.)

Thanks a bunch for the quick reply!
-AJ
  ----- Original Message ----- 
  From: Erick Erickson 
  To: java-user@lucene.apache.org ; AJ Weber 
  Sent: Tuesday, April 15, 2008 1:12 PM
  Subject: Re: replace field in doc?


  You can freely add the same field (with different text) to a doc. For
  instance
  Document doc = new Document();
  doc.add("field", "this is the first");
  doc.add("field", "starting the second ");
  IndexWriter.addDocument(doc)

  is functionally the same as


  Document doc = new Document();
  doc.add("field", "this is the first starting the second");
  IndexWriter.addDocument(doc)


  with one subtle difference. The first version will add
  whatever is returned from Analyzer.getPositionIncrementGap
  to the term position of the first token in adds 2-n. That is,
  say getPositionIncrementGap returned 100. Then
  the token "first" would have position 3 and "starting"
  would have position 103 (I may be off by one on both these,
  but you get the idea).

  The default return is 1, so if you do nothing special, the two
  calls will be identical.

  Why would you return something other than 1 from
  getPositionIncrementGap? Well, if you want to dance
  fancy and, say, NOT span across the two lines above
  for, say, SpanNear clauses, you could return a large
  number that was bigger than the max allowable span.

  Best
  Erick

  On Tue, Apr 15, 2008 at 10:26 AM, AJ Weber <aw...@comcast.net> wrote:

  > I'm curious how people are building the "all" Field (for searching "all of
  > the terms at once").
  >
  > I understand using store=NO, Index=Tokenized is generally the way to add
  > the field, but what if I need to basically use multiple classes to build my
  > Document before adding it to the index (keeping things modular and
  > reusable)?
  >
  > If one class adds the Field (i.e. doc.add("all", sCurrentString,
  > Field.Store.NO, Field.Index.TOKENIZED) ), then I pass the document ("doc"
  > in this example) to another class, how do I basically concat any additional
  > field-values/strings to the existing text-Field?  This is all before the
  > Document is added to the index in the first-place.
  >
  > I'm concerned that doing a doc.get("all") and then adding it back to the
  > doc with the additional string will result in two fields of the name "all"
  > (I seem to have read this is the functionality somewhere)???
  >
  > Thanks in advance!
  > -AJ

Re: replace field in doc?

Posted by Erick Erickson <er...@gmail.com>.
You can freely add the same field (with different text) to a doc. For
instance
Document doc = new Document();
doc.add("field", "this is the first");
doc.add("field", "starting the second ");
IndexWriter.addDocument(doc)

is functionally the same as


Document doc = new Document();
doc.add("field", "this is the first starting the second");
IndexWriter.addDocument(doc)


with one subtle difference. The first version will add
whatever is returned from Analyzer.getPositionIncrementGap
to the term position of the first token in adds 2-n. That is,
say getPositionIncrementGap returned 100. Then
the token "first" would have position 3 and "starting"
would have position 103 (I may be off by one on both these,
but you get the idea).

The default return is 1, so if you do nothing special, the two
calls will be identical.

Why would you return something other than 1 from
getPositionIncrementGap? Well, if you want to dance
fancy and, say, NOT span across the two lines above
for, say, SpanNear clauses, you could return a large
number that was bigger than the max allowable span.

Best
Erick

On Tue, Apr 15, 2008 at 10:26 AM, AJ Weber <aw...@comcast.net> wrote:

> I'm curious how people are building the "all" Field (for searching "all of
> the terms at once").
>
> I understand using store=NO, Index=Tokenized is generally the way to add
> the field, but what if I need to basically use multiple classes to build my
> Document before adding it to the index (keeping things modular and
> reusable)?
>
> If one class adds the Field (i.e. doc.add("all", sCurrentString,
> Field.Store.NO, Field.Index.TOKENIZED) ), then I pass the document ("doc"
> in this example) to another class, how do I basically concat any additional
> field-values/strings to the existing text-Field?  This is all before the
> Document is added to the index in the first-place.
>
> I'm concerned that doing a doc.get("all") and then adding it back to the
> doc with the additional string will result in two fields of the name "all"
> (I seem to have read this is the functionality somewhere)???
>
> Thanks in advance!
> -AJ