You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by EM <em...@cpuedge.com> on 2005/08/12 08:14:11 UTC

Field.Text vs Field.UnStored

I need some help figuring out the following:

I was looking at: BasicIndexingFilter.java where it's stated:

// url is both stored and indexed, so it's both searchable and returned
doc.add(Field.Text("url", url));

// content is indexed, so that it's searchable, but not stored in index
doc.add(Field.UnStored("content", parse.getText()));

I'm stuck on what replacement can be made here. I'm assuming doc.add is the
object that would add tokens to the index? How can a token (word, phrase) be
"searchable but not stored in the index"?

I'm basicly trying to do the following, given two pages A and B:
A is written in eastern alphabet
B is written in latin alphabet.
I would like to index page B as it is, and page A as it is, and the content
of page A translated to latin in addition to it.

Would I have to add something as:
String content = parse.getText();
content +=" ";
content += myTranslationFunctionToLatin(content);
doc.add (Field.Text("content", content);

Or would the last line be:
doc.add(Field.UnStored("content", content));

What's the difference with regard to the Field.* object?


Regards,
EM

Re: Site Content not indexed ? Nutch 0.7

Posted by Andrzej Bialecki <ab...@getopt.org>.

Nils Hoeller wrote:
> Hi,
> 
> actually I thought the content of the pages,
> is beeing indexed.
> 
> When I have a look with Luke at the 
> index of a Nutch Crawl, it says 
> contents not available. 

Please try "reconstruct & Edit" button, and you should see some text 
from the content. The plain text is NOT stored in Lucene index, it's 
just indexed there - the text itself is stored in the segment parse_text.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Site Content not indexed ? Nutch 0.7

Posted by Nils Hoeller <ni...@arcor.de>.

Hi,

actually I thought the content of the pages,
is beeing indexed.

When I have a look with Luke at the 
index of a Nutch Crawl, it says 
contents not available. 

When I search for a word in field "content"
that IS IN A SITE in the index, 
it gives me no results. 

Now I saw something in config files,
that contents is not yet beeing indexed!?

Whats correct? Is it my fault, do 
I have to check some feature of crawl, 
to index the contents ?
Is the contents field really not available? 


Thanks for your help.

Nils

Re: Field.Text vs Field.UnStored

Posted by Matthias Jaekle <ja...@eventax.de>.

> I'm assuming doc.add is the
> object that would add tokens to the index? 
Sometimes.

> How can a token (word, phrase) be
> "searchable but not stored in the index"?
Impossible.

You can search only stuff in the index. But you can not reconstruct page 
content from your index.
If you would be able to get parts of the original content, you also have 
to store the page.

So: Index the parts you would like to search, Store the stuff you would 
like to get in their original version out of your system. Or do both.

If you want to search special fields, you should not extend the content 
field, you should create a new field.

Maybe it is better to have a look at index-more plugin instead of the 
basic index stuff.

Matthias
-- 
http://www.eventax.com - eventax GmbH
http://www.umkreisfinder.de - Die Suchmaschine für Lokales und Events

Re: [Nutch-dev] Field.Text vs Field.UnStored

Posted by praveen pathiyil <pa...@gmail.com>.

Hi,

You have four different options for field types

Field method/type                           Tokenized            
Indexed                  Stored

Field.Keyword(String, String)            No                       Yes 
                      Yes
Field.UnIndexed(String, String)         No                        No  
                      Yes
Field.UnStored(String, String)           Yes                      Yes 
                      No
Field.Text(String, String)                  Yes                     
Yes                        Yes
 
Check out Otis' introductory article for a background on this:
http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=1

Regards,
Praveen.



On 8/12/05, EM <em...@cpuedge.com> wrote:
> I need some help figuring out the following:
> 
> I was looking at: BasicIndexingFilter.java where it's stated:
> 
> // url is both stored and indexed, so it's both searchable and returned
> doc.add(Field.Text("url", url));
> 
> // content is indexed, so that it's searchable, but not stored in index
> doc.add(Field.UnStored("content", parse.getText()));
> 
> I'm stuck on what replacement can be made here. I'm assuming doc.add is the
> object that would add tokens to the index? How can a token (word, phrase) be
> "searchable but not stored in the index"?
> 
> I'm basicly trying to do the following, given two pages A and B:
> A is written in eastern alphabet
> B is written in latin alphabet.
> I would like to index page B as it is, and page A as it is, and the content
> of page A translated to latin in addition to it.
> 
> Would I have to add something as:
> String content = parse.getText();
> content +=" ";
> content += myTranslationFunctionToLatin(content);
> doc.add (Field.Text("content", content);
> 
> Or would the last line be:
> doc.add(Field.UnStored("content", content));
> 
> What's the difference with regard to the Field.* object?
> 
> 
> Regards,
> EM
> 
> 
> 
> -------------------------------------------------------
> SF.Net email is Sponsored by the Better Software Conference & EXPO
> September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
> Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
> Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
> _______________________________________________
> Nutch-developers mailing list
> Nutch-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
>