You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2009/08/01 08:32:41 UTC

Re: dealing with duplicates

Joe,

Maybe we can take a step back first.  Would it be better if your index was cleaner and didn't have flagged duplicates in the first place?  If so, have you tried using http://wiki.apache.org/solr/Deduplication ?

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: Joe Calderon <ca...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Friday, July 31, 2009 5:06:48 PM
> Subject: dealing with duplicates
> 
> hello all, i have a collection of a few million documents; i have many
> duplicates in this collection. they have been clustered with a simple
> algorithm, i have a field called 'duplicate' which is 0 or 1 and a
> fields called 'description, tags, meta', documents are clustered on
> different criteria and the text i search against could be very
> different among members of a cluster.
> 
> im currently using a dismax handler to search across the text fields
> with different boosts, and a filter query to restrict to masters
> (duplicate: 0)
> 
> my question is then, how do i best query for documents which are
> masters OR match text but are not included in the matched set of
> masters?
> 
> does this make sense?


Re: Retrieving the boost factor using Solrj CommonsHttpSolrServer

Posted by Avlesh Singh <av...@gmail.com>.
>
> The boost factor is available in the SolrInputDocument, but not in the
> SolrDocument returned by the SolrServer 'query' method
>
Yes, you are right. There seems to be an inconsistency.

And there is no relationship between the SolrInputDocument and the
> SolrDocument (... which in itself is pretty confusing).
>
I definitely agree with you.

Actually, I always had this question for Yonik (the SolrDocument class has
your name, Yonik) - One should always use SolrInputDocument (and not
SolrDocument) for indexing, right? Why do we have setField, addField and
removeFields as public methods in the SolrDocument class?

Cheers
Avlesh

On Tue, Aug 11, 2009 at 10:29 AM, Villemos, Gert
<ge...@logica.com>wrote:

> I'm using the solrj CommonsHttpSolrServer to retrieve documents from the
> index for update. I therefore also need to retrieve the boost factor as else
> each resubmission would reset the boost factor. I just cant figure out how
> to retrieve the boost factor.
>
> The boost factor is available in the SolrInputDocument, but not in the
> SolrDocument returned by the SolrServer 'query' method. And there is no
> relationship between the SolrInputDocument and the SolrDocument (... which
> in itself is pretty confusing).
>
> How can I get the boost factor? Do I have to use 'request' method and parse
> the result myself?
>
> Cheers,
> Gert.
>
>
>
>
> Please help Logica to respect the environment by not printing this email  /
> Pour contribuer comme Logica au respect de l'environnement, merci de ne pas
> imprimer ce mail /  Bitte drucken Sie diese Nachricht nicht aus und helfen
> Sie so Logica dabei, die Umwelt zu schützen /  Por favor ajude a Logica a
> respeitar o ambiente nao imprimindo este correio electronico.
>
>
>
> This e-mail and any attachment is for authorised use by the intended
> recipient(s) only. It may contain proprietary material, confidential
> information and/or be subject to legal privilege. It should not be copied,
> disclosed to, retained or used by, any other party. If you are not an
> intended recipient then please promptly delete this e-mail and any
> attachment and all copies and inform the sender. Thank you.
>
>

Re: Retrieving the boost factor using Solrj CommonsHttpSolrServer

Posted by Chris Hostetter <ho...@fucit.org>.
: Subject: Retrieving the boost factor using Solrj CommonsHttpSolrServer
: References:
:     <e3...@mail.gmail.com><957081.80086
:     .qm@web50309.mail.re2.yahoo.com><e3cd93650908010915j162baaddved542c8482d8e
:     05@mail.gmail.com><e3cd93650908101259y39c4534ekd4643baa714c4960@mail.gmail
:     .com> <f1...@mail.gmail.com>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking



-Hoss


Retrieving the boost factor using Solrj CommonsHttpSolrServer

Posted by "Villemos, Gert" <ge...@logica.com>.
I'm using the solrj CommonsHttpSolrServer to retrieve documents from the index for update. I therefore also need to retrieve the boost factor as else each resubmission would reset the boost factor. I just cant figure out how to retrieve the boost factor.
 
The boost factor is available in the SolrInputDocument, but not in the SolrDocument returned by the SolrServer 'query' method. And there is no relationship between the SolrInputDocument and the SolrDocument (... which in itself is pretty confusing).
 
How can I get the boost factor? Do I have to use 'request' method and parse the result myself?
 
Cheers,
Gert.
 
 


Please help Logica to respect the environment by not printing this email  / Pour contribuer comme Logica au respect de l'environnement, merci de ne pas imprimer ce mail /  Bitte drucken Sie diese Nachricht nicht aus und helfen Sie so Logica dabei, die Umwelt zu schützen /  Por favor ajude a Logica a respeitar o ambiente nao imprimindo este correio electronico.



This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.


Re: dealing with duplicates

Posted by Avlesh Singh <av...@gmail.com>.
Can you please provide your schema details here?

Cheers
Avlesh

On Tue, Aug 11, 2009 at 1:29 AM, Joe Calderon <ca...@gmail.com>wrote:

> so in the case someone can help me with the query syntax, the
> relational query i would use for this would be something like:
>
> SELECT * FROM videos
> WHERE
> title LIKE 'family guy'
> AND desc LIKE 'stewie%'
> AND (
>  ( is_dup = 0 )
>  OR
>  ( is_dup = 1 AND id NOT IN
>    (
>    SELECT id FROM videos
>    WHERE
>    title LIKE 'family guy'
>    AND desc LIKE 'stewie%'
>    AND is_dup = 0
>    )
>  )
> )
> ORDER BY views
> LIMIT 10
>
> can a similar query be written in lucene or do i need to structure my
> index differently to be able to do such a query?
>
> thx much
>
> --joe
>
>
> On Sat, Aug 1, 2009 at 9:15 AM, Joe Calderon<ca...@gmail.com>
> wrote:
> > hello, thanks for the response, i did take a look at that document but
> > in my application i actually want the duplicates, as i mentioned, the
> > matching text could be very different among cluster members, what
> > joins them together is a similar set of numeric features.
> >
> > currently i do a query with fq=duplicate:0 and show a link to
> > optionally show the "dupes" via by querying for all dupes of the
> > master id, however im currently missing any documents that matched the
> > query but are duplicates of other masters not included in that result
> > set.
> >
> > in a relational database (fulltext indexing aside) i would use a
> > subquery, i imagine a similar approach could be used with lucene, i
> > just dont know the syntax
> >
> > best,
> >
> > --joe
> >
> > On Fri, Jul 31, 2009 at 11:32 PM, Otis
> > Gospodnetic<ot...@yahoo.com> wrote:
> >> Joe,
> >>
> >> Maybe we can take a step back first.  Would it be better if your index
> was cleaner and didn't have flagged duplicates in the first place?  If so,
> have you tried using http://wiki.apache.org/solr/Deduplication ?
> >>
> >>  Otis
> >> --
> >> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> >> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> >>
> >>
> >>
> >> ----- Original Message ----
> >>> From: Joe Calderon <ca...@gmail.com>
> >>> To: solr-user@lucene.apache.org
> >>> Sent: Friday, July 31, 2009 5:06:48 PM
> >>> Subject: dealing with duplicates
> >>>
> >>> hello all, i have a collection of a few million documents; i have many
> >>> duplicates in this collection. they have been clustered with a simple
> >>> algorithm, i have a field called 'duplicate' which is 0 or 1 and a
> >>> fields called 'description, tags, meta', documents are clustered on
> >>> different criteria and the text i search against could be very
> >>> different among members of a cluster.
> >>>
> >>> im currently using a dismax handler to search across the text fields
> >>> with different boosts, and a filter query to restrict to masters
> >>> (duplicate: 0)
> >>>
> >>> my question is then, how do i best query for documents which are
> >>> masters OR match text but are not included in the matched set of
> >>> masters?
> >>>
> >>> does this make sense?
> >>
> >>
> >
>

Re: dealing with duplicates

Posted by Joe Calderon <ca...@gmail.com>.
so in the case someone can help me with the query syntax, the
relational query i would use for this would be something like:

SELECT * FROM videos
WHERE
title LIKE 'family guy'
AND desc LIKE 'stewie%'
AND (
  ( is_dup = 0 )
  OR
  ( is_dup = 1 AND id NOT IN
    (
    SELECT id FROM videos
    WHERE
    title LIKE 'family guy'
    AND desc LIKE 'stewie%'
    AND is_dup = 0
    )
  )
)
ORDER BY views
LIMIT 10

can a similar query be written in lucene or do i need to structure my
index differently to be able to do such a query?

thx much

--joe


On Sat, Aug 1, 2009 at 9:15 AM, Joe Calderon<ca...@gmail.com> wrote:
> hello, thanks for the response, i did take a look at that document but
> in my application i actually want the duplicates, as i mentioned, the
> matching text could be very different among cluster members, what
> joins them together is a similar set of numeric features.
>
> currently i do a query with fq=duplicate:0 and show a link to
> optionally show the "dupes" via by querying for all dupes of the
> master id, however im currently missing any documents that matched the
> query but are duplicates of other masters not included in that result
> set.
>
> in a relational database (fulltext indexing aside) i would use a
> subquery, i imagine a similar approach could be used with lucene, i
> just dont know the syntax
>
> best,
>
> --joe
>
> On Fri, Jul 31, 2009 at 11:32 PM, Otis
> Gospodnetic<ot...@yahoo.com> wrote:
>> Joe,
>>
>> Maybe we can take a step back first.  Would it be better if your index was cleaner and didn't have flagged duplicates in the first place?  If so, have you tried using http://wiki.apache.org/solr/Deduplication ?
>>
>>  Otis
>> --
>> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>>
>>
>>
>> ----- Original Message ----
>>> From: Joe Calderon <ca...@gmail.com>
>>> To: solr-user@lucene.apache.org
>>> Sent: Friday, July 31, 2009 5:06:48 PM
>>> Subject: dealing with duplicates
>>>
>>> hello all, i have a collection of a few million documents; i have many
>>> duplicates in this collection. they have been clustered with a simple
>>> algorithm, i have a field called 'duplicate' which is 0 or 1 and a
>>> fields called 'description, tags, meta', documents are clustered on
>>> different criteria and the text i search against could be very
>>> different among members of a cluster.
>>>
>>> im currently using a dismax handler to search across the text fields
>>> with different boosts, and a filter query to restrict to masters
>>> (duplicate: 0)
>>>
>>> my question is then, how do i best query for documents which are
>>> masters OR match text but are not included in the matched set of
>>> masters?
>>>
>>> does this make sense?
>>
>>
>

Re: dealing with duplicates

Posted by Joe Calderon <ca...@gmail.com>.
hello, thanks for the response, i did take a look at that document but
in my application i actually want the duplicates, as i mentioned, the
matching text could be very different among cluster members, what
joins them together is a similar set of numeric features.

currently i do a query with fq=duplicate:0 and show a link to
optionally show the "dupes" via by querying for all dupes of the
master id, however im currently missing any documents that matched the
query but are duplicates of other masters not included in that result
set.

in a relational database (fulltext indexing aside) i would use a
subquery, i imagine a similar approach could be used with lucene, i
just dont know the syntax

best,

--joe

On Fri, Jul 31, 2009 at 11:32 PM, Otis
Gospodnetic<ot...@yahoo.com> wrote:
> Joe,
>
> Maybe we can take a step back first.  Would it be better if your index was cleaner and didn't have flagged duplicates in the first place?  If so, have you tried using http://wiki.apache.org/solr/Deduplication ?
>
>  Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
>
> ----- Original Message ----
>> From: Joe Calderon <ca...@gmail.com>
>> To: solr-user@lucene.apache.org
>> Sent: Friday, July 31, 2009 5:06:48 PM
>> Subject: dealing with duplicates
>>
>> hello all, i have a collection of a few million documents; i have many
>> duplicates in this collection. they have been clustered with a simple
>> algorithm, i have a field called 'duplicate' which is 0 or 1 and a
>> fields called 'description, tags, meta', documents are clustered on
>> different criteria and the text i search against could be very
>> different among members of a cluster.
>>
>> im currently using a dismax handler to search across the text fields
>> with different boosts, and a filter query to restrict to masters
>> (duplicate: 0)
>>
>> my question is then, how do i best query for documents which are
>> masters OR match text but are not included in the matched set of
>> masters?
>>
>> does this make sense?
>
>