You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Rory Franklin <ro...@chillibean.tv> on 2011/09/03 17:06:43 UTC

couchdb-lucene indexing issues

 I'm using couchdb-lucene to index a list of fields (user defined) in a document using the following design document:

{
 "_id": "_design/foo",
 "_rev": "16-dcd0d39369c35b3d74ceef13a388826f",
 "fulltext": {
"by_metadata": {
 "index": "function(doc) { 
var ret=new Document(); 
if (doc['type'] == 'CSAsset' && doc['deleted'] != true) {
for (var i in doc.metadata) { 
if(doc.metadata[i]['key'] == 'Title') { 
ret.add(doc.metadata[i]['value'].toLowerCase(), {'field':'sort_title', 'store':'yes', 'index' : 'not_analyzed'});
} 
ret.add(doc.metadata[i]['value'],{'field':doc.metadata[i]['key'].toLowerCase() }); 
ret.add(doc.metadata[i]['value']); 
} 
for (var i in doc.partitions) { 
ret.add(doc.partitions[i].partition_id,{'field':'partition'}); ret.add(doc.partitions[i].partition_id);
} 
ret.add(doc['created_at'], {'field':'sort_created_at', 'store':'yes', 'index' : 'not_analyzed'}); 
return ret; 
} else { 
return null; 
} 
}"
 }
 }
}



(I've formatted the definition so that it's not all on one line for readability here)

However, when using the by_metadata view it doesn't appear to be breaking the values up when there are underscores. For instance, searching for the term "wonderland" should return back a document where there is a field with the value "some_wonderland_example" but it doesn't. It returns the document if I search for the full term.

I'm just wondering whether I'm defining the index incorrectly? (of course, feel free to point out if I'm doing anything else glaringly obviously wrong too!)



Rory

Re: couchdb-lucene indexing issues

Posted by Rory Franklin <ro...@chillibean.tv>.

 Got it - I understand how that works now and my search is returning the correct results now. Thanks again!

--

Rory

On Monday, 5 September 2011 at 10:38, Robert Newson wrote:

> The analyzer setting is a top-level item as documented in the README here;
> 
> https://github.com/rnewson/couchdb-lucene
> 
> B.
> 
> On 5 September 2011 10:14, Rory Franklin <rory@chillibean.tv (mailto:rory@chillibean.tv)> wrote:
> > I've modified my original index in CouchDB to be the following, but not having any joy with things being broken up in to tokens:
> > 
> > 
> > {
> > "_id": "_design/foo",
> > "_rev": "19-da99913ce4cdd421903d0d48f9a40cc3",
> > "fulltext": {
> > "by_metadata": {
> > "index": "function(doc) {
> > var ret=new Document();
> > if (doc['type'] == 'CSAsset' && doc['deleted'] != true) {
> > for (var i in doc.metadata) {
> > if(doc.metadata[i]['key'] == 'Title') {
> > ret.add(doc.metadata[i]['value'].toLowerCase(), {'field':'sort_title', 'store':'yes', 'index' : 'not_analyzed'});
> > }
> > ret.add(doc.metadata[i]['value'],{ 'field' : doc.metadata[i]['key'].toLowerCase(), 'analyzer' : 'simple' });
> > ret.add(doc.metadata[i]['value'], { 'analyzer' : 'simple' });
> > }
> > for (var i in doc.partitions) {
> > ret.add(doc.partitions[i].partition_id,{'field':'partition'}); ret.add(doc.partitions[i].partition_id);
> > }
> > ret.add(doc['created_at'], {'field':'sort_created_at', 'store':'yes', 'index' : 'not_analyzed'});
> > return ret;
> > } else {
> > return null;
> > }
> > }"
> > }
> > }
> > }
> > 
> > I've opened the index up in Luke and going to the Documents tab and doing reconstruct & edit on a particular document shows that the fields aren't being split up in to separate tokens.
> > 
> > 
> > --
> > 
> > Rory
> > 
> > On Saturday, 3 September 2011 at 17:12, Robert Newson wrote:
> > 
> > > " For instance, searching for the term "wonderland" should return back
> > > a document where there is a field with the value
> > > "some_wonderland_example" but it doesn't."
> > > 
> > > It shouldn't and doesn't. :)
> > > 
> > > 'some_wonderland_example' is a single token when tokenized by the
> > > default StandardAnalyzer. If instead you specify "analyzer":"simple",
> > > you will find that it is 3 tokens, and your search should work.
> > > 
> > > B.
> > > 
> > > On 3 September 2011 16:06, Rory Franklin <rory@chillibean.tv (mailto:rory@chillibean.tv)> wrote:
> > > > I'm using couchdb-lucene to index a list of fields (user defined) in a document using the following design document:
> > > > 
> > > > {
> > > > "_id": "_design/foo",
> > > > "_rev": "16-dcd0d39369c35b3d74ceef13a388826f",
> > > > "fulltext": {
> > > > "by_metadata": {
> > > > "index": "function(doc) {
> > > > var ret=new Document();
> > > > if (doc['type'] == 'CSAsset' && doc['deleted'] != true) {
> > > > for (var i in doc.metadata) {
> > > > if(doc.metadata[i]['key'] == 'Title') {
> > > > ret.add(doc.metadata[i]['value'].toLowerCase(), {'field':'sort_title', 'store':'yes', 'index' : 'not_analyzed'});
> > > > }
> > > > ret.add(doc.metadata[i]['value'],{'field':doc.metadata[i]['key'].toLowerCase() });
> > > > ret.add(doc.metadata[i]['value']);
> > > > }
> > > > for (var i in doc.partitions) {
> > > > ret.add(doc.partitions[i].partition_id,{'field':'partition'}); ret.add(doc.partitions[i].partition_id);
> > > > }
> > > > ret.add(doc['created_at'], {'field':'sort_created_at', 'store':'yes', 'index' : 'not_analyzed'});
> > > > return ret;
> > > > } else {
> > > > return null;
> > > > }
> > > > }"
> > > > }
> > > > }
> > > > }
> > > > 
> > > > 
> > > > 
> > > > (I've formatted the definition so that it's not all on one line for readability here)
> > > > 
> > > > However, when using the by_metadata view it doesn't appear to be breaking the values up when there are underscores. For instance, searching for the term "wonderland" should return back a document where there is a field with the value "some_wonderland_example" but it doesn't. It returns the document if I search for the full term.
> > > > 
> > > > I'm just wondering whether I'm defining the index incorrectly? (of course, feel free to point out if I'm doing anything else glaringly obviously wrong too!)
> > > > 
> > > > 
> > > > 
> > > > Rory

Re: couchdb-lucene indexing issues

Posted by Robert Newson <rn...@apache.org>.

The analyzer setting is a top-level item as documented in the README here;

https://github.com/rnewson/couchdb-lucene

B.

On 5 September 2011 10:14, Rory Franklin <ro...@chillibean.tv> wrote:
>  I've modified my original index in CouchDB to be the following, but not having any joy with things being broken up in to tokens:
>
>
> {
>  "_id": "_design/foo",
>  "_rev": "19-da99913ce4cdd421903d0d48f9a40cc3",
>  "fulltext": {
> "by_metadata": {
>  "index": "function(doc) {
> var ret=new Document();
> if (doc['type'] == 'CSAsset' && doc['deleted'] != true) {
> for (var i in doc.metadata) {
> if(doc.metadata[i]['key'] == 'Title') {
> ret.add(doc.metadata[i]['value'].toLowerCase(), {'field':'sort_title', 'store':'yes', 'index' : 'not_analyzed'});
> }
> ret.add(doc.metadata[i]['value'],{ 'field' : doc.metadata[i]['key'].toLowerCase(), 'analyzer' : 'simple' });
> ret.add(doc.metadata[i]['value'], { 'analyzer' : 'simple' });
> }
> for (var i in doc.partitions) {
> ret.add(doc.partitions[i].partition_id,{'field':'partition'}); ret.add(doc.partitions[i].partition_id);
> }
> ret.add(doc['created_at'], {'field':'sort_created_at', 'store':'yes', 'index' : 'not_analyzed'});
> return ret;
> } else {
> return null;
> }
> }"
>  }
>  }
> }
>
> I've opened the index up in Luke and going to the Documents tab and doing reconstruct & edit on a particular document shows that the fields aren't being split up in to separate tokens.
>
>
> --
>
> Rory
>
> On Saturday, 3 September 2011 at 17:12, Robert Newson wrote:
>
>> " For instance, searching for the term "wonderland" should return back
>> a document where there is a field with the value
>> "some_wonderland_example" but it doesn't."
>>
>> It shouldn't and doesn't. :)
>>
>> 'some_wonderland_example' is a single token when tokenized by the
>> default StandardAnalyzer. If instead you specify "analyzer":"simple",
>> you will find that it is 3 tokens, and your search should work.
>>
>> B.
>>
>> On 3 September 2011 16:06, Rory Franklin <rory@chillibean.tv (mailto:rory@chillibean.tv)> wrote:
>> > I'm using couchdb-lucene to index a list of fields (user defined) in a document using the following design document:
>> >
>> > {
>> > "_id": "_design/foo",
>> > "_rev": "16-dcd0d39369c35b3d74ceef13a388826f",
>> > "fulltext": {
>> > "by_metadata": {
>> > "index": "function(doc) {
>> > var ret=new Document();
>> > if (doc['type'] == 'CSAsset' && doc['deleted'] != true) {
>> > for (var i in doc.metadata) {
>> > if(doc.metadata[i]['key'] == 'Title') {
>> > ret.add(doc.metadata[i]['value'].toLowerCase(), {'field':'sort_title', 'store':'yes', 'index' : 'not_analyzed'});
>> > }
>> > ret.add(doc.metadata[i]['value'],{'field':doc.metadata[i]['key'].toLowerCase() });
>> > ret.add(doc.metadata[i]['value']);
>> > }
>> > for (var i in doc.partitions) {
>> > ret.add(doc.partitions[i].partition_id,{'field':'partition'}); ret.add(doc.partitions[i].partition_id);
>> > }
>> > ret.add(doc['created_at'], {'field':'sort_created_at', 'store':'yes', 'index' : 'not_analyzed'});
>> > return ret;
>> > } else {
>> > return null;
>> > }
>> > }"
>> > }
>> > }
>> > }
>> >
>> >
>> >
>> > (I've formatted the definition so that it's not all on one line for readability here)
>> >
>> > However, when using the by_metadata view it doesn't appear to be breaking the values up when there are underscores. For instance, searching for the term "wonderland" should return back a document where there is a field with the value "some_wonderland_example" but it doesn't. It returns the document if I search for the full term.
>> >
>> > I'm just wondering whether I'm defining the index incorrectly? (of course, feel free to point out if I'm doing anything else glaringly obviously wrong too!)
>> >
>> >
>> >
>> > Rory
>
>

Re: couchdb-lucene indexing issues

Posted by Rory Franklin <ro...@chillibean.tv>.

 I've modified my original index in CouchDB to be the following, but not having any joy with things being broken up in to tokens:


{
 "_id": "_design/foo",
 "_rev": "19-da99913ce4cdd421903d0d48f9a40cc3",
 "fulltext": {
"by_metadata": {
 "index": "function(doc) { 
var ret=new Document(); 
if (doc['type'] == 'CSAsset' && doc['deleted'] != true) {
for (var i in doc.metadata) { 
if(doc.metadata[i]['key'] == 'Title') { 
ret.add(doc.metadata[i]['value'].toLowerCase(), {'field':'sort_title', 'store':'yes', 'index' : 'not_analyzed'});
} 
ret.add(doc.metadata[i]['value'],{ 'field' : doc.metadata[i]['key'].toLowerCase(), 'analyzer' : 'simple' }); 
ret.add(doc.metadata[i]['value'], { 'analyzer' : 'simple' }); 
} 
for (var i in doc.partitions) { 
ret.add(doc.partitions[i].partition_id,{'field':'partition'}); ret.add(doc.partitions[i].partition_id);
} 
ret.add(doc['created_at'], {'field':'sort_created_at', 'store':'yes', 'index' : 'not_analyzed'}); 
return ret; 
} else { 
return null; 
} 
}"
 }
 }
}

I've opened the index up in Luke and going to the Documents tab and doing reconstruct & edit on a particular document shows that the fields aren't being split up in to separate tokens.


--

Rory

On Saturday, 3 September 2011 at 17:12, Robert Newson wrote:

> " For instance, searching for the term "wonderland" should return back
> a document where there is a field with the value
> "some_wonderland_example" but it doesn't."
> 
> It shouldn't and doesn't. :)
> 
> 'some_wonderland_example' is a single token when tokenized by the
> default StandardAnalyzer. If instead you specify "analyzer":"simple",
> you will find that it is 3 tokens, and your search should work.
> 
> B.
> 
> On 3 September 2011 16:06, Rory Franklin <rory@chillibean.tv (mailto:rory@chillibean.tv)> wrote:
> > I'm using couchdb-lucene to index a list of fields (user defined) in a document using the following design document:
> > 
> > {
> > "_id": "_design/foo",
> > "_rev": "16-dcd0d39369c35b3d74ceef13a388826f",
> > "fulltext": {
> > "by_metadata": {
> > "index": "function(doc) {
> > var ret=new Document();
> > if (doc['type'] == 'CSAsset' && doc['deleted'] != true) {
> > for (var i in doc.metadata) {
> > if(doc.metadata[i]['key'] == 'Title') {
> > ret.add(doc.metadata[i]['value'].toLowerCase(), {'field':'sort_title', 'store':'yes', 'index' : 'not_analyzed'});
> > }
> > ret.add(doc.metadata[i]['value'],{'field':doc.metadata[i]['key'].toLowerCase() });
> > ret.add(doc.metadata[i]['value']);
> > }
> > for (var i in doc.partitions) {
> > ret.add(doc.partitions[i].partition_id,{'field':'partition'}); ret.add(doc.partitions[i].partition_id);
> > }
> > ret.add(doc['created_at'], {'field':'sort_created_at', 'store':'yes', 'index' : 'not_analyzed'});
> > return ret;
> > } else {
> > return null;
> > }
> > }"
> > }
> > }
> > }
> > 
> > 
> > 
> > (I've formatted the definition so that it's not all on one line for readability here)
> > 
> > However, when using the by_metadata view it doesn't appear to be breaking the values up when there are underscores. For instance, searching for the term "wonderland" should return back a document where there is a field with the value "some_wonderland_example" but it doesn't. It returns the document if I search for the full term.
> > 
> > I'm just wondering whether I'm defining the index incorrectly? (of course, feel free to point out if I'm doing anything else glaringly obviously wrong too!)
> > 
> > 
> > 
> > Rory

Re: couchdb-lucene indexing issues

Posted by Rory Franklin <ro...@chillibean.tv>.

Excellent, that was the simple mistake I was making! I thought standard broke it up into tokens.

Rory

Sent from my iPhone

On 3 Sep 2011, at 17:12, Robert Newson <rn...@apache.org> wrote:

> " For instance, searching for the term "wonderland" should return back
> a document where there is a field with the value
> "some_wonderland_example" but it doesn't."
> 
> It shouldn't and doesn't. :)
> 
> 'some_wonderland_example' is a single token when tokenized by the
> default StandardAnalyzer. If instead you specify "analyzer":"simple",
> you will find that it is 3 tokens, and your search should work.
> 
> B.
> 
> On 3 September 2011 16:06, Rory Franklin <ro...@chillibean.tv> wrote:
>>  I'm using couchdb-lucene to index a list of fields (user defined) in a document using the following design document:
>> 
>> {
>>  "_id": "_design/foo",
>>  "_rev": "16-dcd0d39369c35b3d74ceef13a388826f",
>>  "fulltext": {
>> "by_metadata": {
>>  "index": "function(doc) {
>> var ret=new Document();
>> if (doc['type'] == 'CSAsset' && doc['deleted'] != true) {
>> for (var i in doc.metadata) {
>> if(doc.metadata[i]['key'] == 'Title') {
>> ret.add(doc.metadata[i]['value'].toLowerCase(), {'field':'sort_title', 'store':'yes', 'index' : 'not_analyzed'});
>> }
>> ret.add(doc.metadata[i]['value'],{'field':doc.metadata[i]['key'].toLowerCase() });
>> ret.add(doc.metadata[i]['value']);
>> }
>> for (var i in doc.partitions) {
>> ret.add(doc.partitions[i].partition_id,{'field':'partition'}); ret.add(doc.partitions[i].partition_id);
>> }
>> ret.add(doc['created_at'], {'field':'sort_created_at', 'store':'yes', 'index' : 'not_analyzed'});
>> return ret;
>> } else {
>> return null;
>> }
>> }"
>>  }
>>  }
>> }
>> 
>> 
>> 
>> (I've formatted the definition so that it's not all on one line for readability here)
>> 
>> However, when using the by_metadata view it doesn't appear to be breaking the values up when there are underscores. For instance, searching for the term "wonderland" should return back a document where there is a field with the value "some_wonderland_example" but it doesn't. It returns the document if I search for the full term.
>> 
>> I'm just wondering whether I'm defining the index incorrectly? (of course, feel free to point out if I'm doing anything else glaringly obviously wrong too!)
>> 
>> 
>> 
>> Rory
>> 
>> 
>> 
>>

Re: couchdb-lucene indexing issues

Posted by Robert Newson <rn...@apache.org>.

" For instance, searching for the term "wonderland" should return back
a document where there is a field with the value
"some_wonderland_example" but it doesn't."

It shouldn't and doesn't. :)

'some_wonderland_example' is a single token when tokenized by the
default StandardAnalyzer. If instead you specify "analyzer":"simple",
you will find that it is 3 tokens, and your search should work.

B.

On 3 September 2011 16:06, Rory Franklin <ro...@chillibean.tv> wrote:
>  I'm using couchdb-lucene to index a list of fields (user defined) in a document using the following design document:
>
> {
>  "_id": "_design/foo",
>  "_rev": "16-dcd0d39369c35b3d74ceef13a388826f",
>  "fulltext": {
> "by_metadata": {
>  "index": "function(doc) {
> var ret=new Document();
> if (doc['type'] == 'CSAsset' && doc['deleted'] != true) {
> for (var i in doc.metadata) {
> if(doc.metadata[i]['key'] == 'Title') {
> ret.add(doc.metadata[i]['value'].toLowerCase(), {'field':'sort_title', 'store':'yes', 'index' : 'not_analyzed'});
> }
> ret.add(doc.metadata[i]['value'],{'field':doc.metadata[i]['key'].toLowerCase() });
> ret.add(doc.metadata[i]['value']);
> }
> for (var i in doc.partitions) {
> ret.add(doc.partitions[i].partition_id,{'field':'partition'}); ret.add(doc.partitions[i].partition_id);
> }
> ret.add(doc['created_at'], {'field':'sort_created_at', 'store':'yes', 'index' : 'not_analyzed'});
> return ret;
> } else {
> return null;
> }
> }"
>  }
>  }
> }
>
>
>
> (I've formatted the definition so that it's not all on one line for readability here)
>
> However, when using the by_metadata view it doesn't appear to be breaking the values up when there are underscores. For instance, searching for the term "wonderland" should return back a document where there is a field with the value "some_wonderland_example" but it doesn't. It returns the document if I search for the full term.
>
> I'm just wondering whether I'm defining the index incorrectly? (of course, feel free to point out if I'm doing anything else glaringly obviously wrong too!)
>
>
>
> Rory
>
>
>
>