You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by ca...@dazoot.ro on 2005/06/19 11:17:13 UTC

md5 keyword field issue

Hi there,

i have an index with the following infos in it:
url - keyword - Field("url", this.url, Field.Store.YES, Field.Index.UN_TOKENIZED);
md5 - keyword - Field("md5", this.url, Field.Store.YES, Field.Index.UN_TOKENIZED);
alt - Field("alt", this.alt, Field.Store.YES, Field.Index.TOKENIZED);

i use it to index my images.
now it happens that the same image (eg: same md5) is used in different
locations (eg: different urls).
filename: mylogo.gif used in
http://site.com/project1/mylogo.gif and also
http://site.com/project2/some_other_bubu/mylogo.gif

the ALT is different (eg: different text)

now on my image search app when i search mylogo i get "several"
results with the same image.

i would like to reduce the nr of results in that way that the md5 is
unique.
Note: i can't delete from the index the 2nd image cause the ALT might
be different, so in general all the properties put together (md5, url,
alt) compose a different "entity".


i bought "Lucene in Action" book, which is a GREAT book.
i was looking into "filters".

i quote: "If all the information needed to perform filtering is in the
index, there is no need to write your own filter because QueryFilter
can handle it."

i can't seem to figure it out, how query filter can help me.

also tried to write my own filter but not that much info on that
direction either.


any info, links, thoughts, would be highly appreciated !

-- 
Catalin Constantin
http://www.dazoot.ro/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Re[4]: md5 keyword field issue

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jun 20, 2005, at 10:54 AM, catalin-lucene@dazoot.ro wrote:
> Monday, June 20, 2005, 5:48:30 PM, Erik Hatcher wrote:
>
>> Now you've just said the same conflicting thing a different way.  You
>> want to cluster but only return one.  :)
>>
>
> i think i missunderstood here the Term: cluster.
> so yes, i just want one image returned.

Maybe my interpretation of "cluster" is clouded by the search  
domain.  In the search domain, cluster means grouping multiple things.

>> If you only want one image returned, then it seems that only indexing
>> the same image once is the way to go.  When you find a duplicate MD5,
>> don't index that as a second document.  You will, instead, update the
>> document by adding additional ALT text and perhaps the additional  
>> URL.
>>
>
> this sounds pretty ok !

The tricks are to do a search when indexing to find duplicates, and  
to "update" the document by deleting and re-adding it (you'll  
probably want to store the field data so you can retrieve it easily  
and use it for the new updated document.

The negative to this approach is you want know specifically which  
page the image was on in results, though you could keep all URL's  
that point to it as a document can have multiple fields named "URL"  
for example.

>>> in sql this would be:
>>> select distinct md5, url, alt from table group by md5 order by
>>> score asc;
>>>
>
>
>> This would give you multiple records for the same MD5.  You said
>> above you only want one per MD5.
>>
>
> here i'm afraid you are not correct, because i have GROUP BY MD5
> clause which will return no duplicates.

Sorry, I missed the GROUP BY clause there in my first human parse of  
the expression - I was too busy focusing on DISTINCT.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re[4]: md5 keyword field issue

Posted by ca...@dazoot.ro.
Monday, June 20, 2005, 5:48:30 PM, Erik Hatcher wrote:
> Now you've just said the same conflicting thing a different way.  You
> want to cluster but only return one.  :)

i think i missunderstood here the Term: cluster.
so yes, i just want one image returned.

> If you only want one image returned, then it seems that only indexing
> the same image once is the way to go.  When you find a duplicate MD5,
> don't index that as a second document.  You will, instead, update the
> document by adding additional ALT text and perhaps the additional URL.

this sounds pretty ok !

> Is there a reason why indexing each unique image (by MD5) is not a  
> good way to go in your case?

>> in sql this would be:
>> select distinct md5, url, alt from table group by md5 order by  
>> score asc;

> This would give you multiple records for the same MD5.  You said  
> above you only want one per MD5.

here i'm afraid you are not correct, because i have GROUP BY MD5
clause which will return no duplicates.

(tested it on mysql)
for the query above.
170 rows in set (0.13 sec)

select distinct md5 from image;
| e127d0e91af5d8b2522138fb46c2e1bc |
| 7a18b029925d8357599878a85fd6b02f |
+----------------------------------+
170 rows in set (0.00 sec)

same nr of rows :D




-- 
Catalin Constantin



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Re[2]: md5 keyword field issue

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jun 20, 2005, at 9:38 AM, catalin-lucene@dazoot.ro wrote:

> Monday, June 20, 2005, 3:55:36 PM, Erik Hatcher wrote:
>
>> Filters reduce the search space to a subset of the documents in the
>> index.  Which document would you want returned when there are
>> multiple documents in the index with the same MD5?  Or do you want to
>> cluster them by MD5?
>>
>
> i think cluster by md5 is more appropriate.
>
>
>> Do you want to cluster them by MD5 perhaps, but still return multiple
>> documents back from a search?
>>
>
> i want to return just the 1st image (the more relevant one). no use to
> show duplicates in an image search app.

Now you've just said the same conflicting thing a different way.  You  
want to cluster but only return one.  :)

If you only want one image returned, then it seems that only indexing  
the same image once is the way to go.  When you find a duplicate MD5,  
don't index that as a second document.  You will, instead, update the  
document by adding additional ALT text and perhaps the additional URL.

Is there a reason why indexing each unique image (by MD5) is not a  
good way to go in your case?

> in sql this would be:
> select distinct md5, url, alt from table group by md5 order by  
> score asc;

This would give you multiple records for the same MD5.  You said  
above you only want one per MD5.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re[2]: md5 keyword field issue

Posted by ca...@dazoot.ro.
Monday, June 20, 2005, 3:55:36 PM, Erik Hatcher wrote:
> Filters reduce the search space to a subset of the documents in the
> index.  Which document would you want returned when there are  
> multiple documents in the index with the same MD5?  Or do you want to
> cluster them by MD5?

i think cluster by md5 is more appropriate.

> Do you want to cluster them by MD5 perhaps, but still return multiple
> documents back from a search?

i want to return just the 1st image (the more relevant one). no use to
show duplicates in an image search app.

> I'm not sure if a Filter is the appropriate technique for this  
> scenario or not.

well, i am not sure either.
one solution would be when i iterate through the hits collection and
send them to the webapp, to group them by md5 or some.

is this a good way to do it ?
(the bad thing is i would have to do lots of hits.doc(index) in
advance, to make this group by md5 thing, and if the results are
paginated << which is the case >>, on the 2nd page i would need to
keep in session the last "index" or to recalculate it again.. - oh
nein !:)

in sql this would be:
select distinct md5, url, alt from table group by md5 order by score asc;

if i had the score in the DB (which is not the case).

-- 
Catalin Constantin


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: md5 keyword field issue

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jun 19, 2005, at 5:17 AM, catalin-lucene@dazoot.ro wrote:

> Hi there,
>
> i have an index with the following infos in it:
> url - keyword - Field("url", this.url, Field.Store.YES,  
> Field.Index.UN_TOKENIZED);
> md5 - keyword - Field("md5", this.url, Field.Store.YES,  
> Field.Index.UN_TOKENIZED);
> alt - Field("alt", this.alt, Field.Store.YES, Field.Index.TOKENIZED);
>
> i use it to index my images.
> now it happens that the same image (eg: same md5) is used in different
> locations (eg: different urls).
> filename: mylogo.gif used in
> http://site.com/project1/mylogo.gif and also
> http://site.com/project2/some_other_bubu/mylogo.gif
>
> the ALT is different (eg: different text)
>
> now on my image search app when i search mylogo i get "several"
> results with the same image.
>
> i would like to reduce the nr of results in that way that the md5 is
> unique.
> Note: i can't delete from the index the 2nd image cause the ALT might
> be different, so in general all the properties put together (md5, url,
> alt) compose a different "entity".


It seems you have conflicting goals here.  You want (md5, url, alt)  
to be unique in one sense, yet you want md5 itself to be unique in  
another sense.

> i bought "Lucene in Action" book, which is a GREAT book.

Thank you!  :)

> i was looking into "filters".
>
> i quote: "If all the information needed to perform filtering is in the
> index, there is no need to write your own filter because QueryFilter
> can handle it."
>
> i can't seem to figure it out, how query filter can help me.
>
> also tried to write my own filter but not that much info on that
> direction either.

Filters reduce the search space to a subset of the documents in the  
index.  Which document would you want returned when there are  
multiple documents in the index with the same MD5?  Or do you want to  
cluster them by MD5?

Do you want to cluster them by MD5 perhaps, but still return multiple  
documents back from a search?

I'm not sure if a Filter is the appropriate technique for this  
scenario or not.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org