You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Gabriele Kahlout <ga...@mysimpatico.com> on 2011/07/05 21:59:33 UTC

Can I invert the inverted index?

Hello,

With an inverted index the term is the key, and the documents are the
values. Is it still however possible that given a document id I get the
terms indexed for that document?

-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: Can I invert the inverted index?

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.
>From you patch I see TermFreqVector  which provides the information I want.

I also found FieldInvertState.getLength() which seems to be exactly what I
want. I'm after the word count (sum of tf for every term in the doc). I'm
just not sure whether FieldInvertState.getLength() returns just the number
of terms (not multiplied by the frequency of each term - word count) or not
though. It seems as if it returns word count, but I've not tested it
sufficienctly.

On Wed, Jul 6, 2011 at 1:39 AM, Trey Grainger <th...@gmail.com>wrote:

> Gabriele,
>
> I created a patch that does this about a year ago.  See
> https://issues.apache.org/jira/browse/SOLR-1837.  It was written for Solr
> 1.4 and is based upon the Document Reconstructor in Luke.  The patch adds a
> link to the main solr admin page to a docinspector page which will
> reconstruct the document given a uniqueid (required).  Keep in mind that
> you're only looking at what's "in" the index for non-stored fields, not the
> original text.
>
> If you have any issues using this on the most recent release, let me know
> and I'd be happy to create a new patch for solr 3.3.  One of these days
> I'll
> remove the JSP dependency and this may eventually making it into trunk.
>
> Thanks,
>
> -Trey Grainger
> Search Technology Development Team Lead, Careerbuilder.com
> Site Architect, Celiaccess.com
>
>
> On Tue, Jul 5, 2011 at 3:59 PM, Gabriele Kahlout
> <ga...@mysimpatico.com>wrote:
>
> > Hello,
> >
> > With an inverted index the term is the key, and the documents are the
> > values. Is it still however possible that given a document id I get the
> > terms indexed for that document?
> >
> > --
> > Regards,
> > K. Gabriele
> >
> > --- unchanged since 20/9/10 ---
> > P.S. If the subject contains "[LON]" or the addressee acknowledges the
> > receipt within 48 hours then I don't resend the email.
> > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> > time(x)
> > < Now + 48h) ⇒ ¬resend(I, this).
> >
> > If an email is sent by a sender that is not a trusted contact or the
> email
> > does not contain a valid code then the email is not received. A valid
> code
> > starts with a hyphen and ends with "X".
> > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> > L(-[a-z]+[0-9]X)).
> >
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: Can I invert the inverted index?

Posted by Trey Grainger <th...@gmail.com>.
Gabriele,

I created a patch that does this about a year ago.  See
https://issues.apache.org/jira/browse/SOLR-1837.  It was written for Solr
1.4 and is based upon the Document Reconstructor in Luke.  The patch adds a
link to the main solr admin page to a docinspector page which will
reconstruct the document given a uniqueid (required).  Keep in mind that
you're only looking at what's "in" the index for non-stored fields, not the
original text.

If you have any issues using this on the most recent release, let me know
and I'd be happy to create a new patch for solr 3.3.  One of these days I'll
remove the JSP dependency and this may eventually making it into trunk.

Thanks,

-Trey Grainger
Search Technology Development Team Lead, Careerbuilder.com
Site Architect, Celiaccess.com


On Tue, Jul 5, 2011 at 3:59 PM, Gabriele Kahlout
<ga...@mysimpatico.com>wrote:

> Hello,
>
> With an inverted index the term is the key, and the documents are the
> values. Is it still however possible that given a document id I get the
> terms indexed for that document?
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x)
> < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>

Re: Can I invert the inverted index?

Posted by Rob Casson <ro...@gmail.com>.
sounds like the Luke request handler will get what you're after:

     http://wiki.apache.org/solr/LukeRequestHandler
     http://wiki.apache.org/solr/LukeRequestHandler#id

cheers,
rob

On Tue, Jul 5, 2011 at 3:59 PM, Gabriele Kahlout
<ga...@mysimpatico.com> wrote:
> Hello,
>
> With an inverted index the term is the key, and the documents are the
> values. Is it still however possible that given a document id I get the
> terms indexed for that document?
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
> < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>

Re: Can I invert the inverted index?

Posted by Erick Erickson <er...@gmail.com>.
You can do this, kind of, but it's a lossy process. Consider indexing
"the cat in the hat strikes back", with "the", "in" being stopwords and
strikes getting stemmed to "strike". At very best, you can reconstruct
that the original doc contained "cat", "hat", "strike", "back". Is
that sufficient?

And it's a very expensive process.

What is the problem you're trying to solve? Perhaps there are other ways
to get what you need.

Best
Erick

On Tue, Jul 5, 2011 at 4:22 PM, Gabriele Kahlout
<ga...@mysimpatico.com> wrote:
> I had looked an term vectors but don't understand them to solve my problem.
> Consider the following index entries:
>
> <t0, doc0, doc1>
> <t1, doc0>
>
> From the 2nd entry we know that t1 is only present in doc0.
> Now, my problem, given doc0 how can I know which terms occur in in (t0 and
> t1) (without storing the content)?
> One way is go over all terms in the index using the term dictionary.
>
>
> On Tue, Jul 5, 2011 at 10:14 PM, lboutros <bo...@gmail.com> wrote:
>
>> Hi Gabriele,
>>
>> I'm not sure to understand your problem, but the TermVectorComponent may
>> fit
>> your needs ?
>>
>> http://wiki.apache.org/solr/TermVectorComponent
>> http://wiki.apache.org/solr/TermVectorComponentExampleEnabled
>>
>> Ludovic.
>>
>> -----
>> Jouve
>> France.
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Can-I-invert-the-inverted-index-tp3142206p3142269.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
> < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>

Re: Can I invert the inverted index?

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.
I had looked an term vectors but don't understand them to solve my problem.
Consider the following index entries:

<t0, doc0, doc1>
<t1, doc0>

>From the 2nd entry we know that t1 is only present in doc0.
Now, my problem, given doc0 how can I know which terms occur in in (t0 and
t1) (without storing the content)?
One way is go over all terms in the index using the term dictionary.


On Tue, Jul 5, 2011 at 10:14 PM, lboutros <bo...@gmail.com> wrote:

> Hi Gabriele,
>
> I'm not sure to understand your problem, but the TermVectorComponent may
> fit
> your needs ?
>
> http://wiki.apache.org/solr/TermVectorComponent
> http://wiki.apache.org/solr/TermVectorComponentExampleEnabled
>
> Ludovic.
>
> -----
> Jouve
> France.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Can-I-invert-the-inverted-index-tp3142206p3142269.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: Can I invert the inverted index?

Posted by lboutros <bo...@gmail.com>.
Hi Gabriele,

I'm not sure to understand your problem, but the TermVectorComponent may fit
your needs ?

http://wiki.apache.org/solr/TermVectorComponent
http://wiki.apache.org/solr/TermVectorComponentExampleEnabled

Ludovic.

-----
Jouve
France.
--
View this message in context: http://lucene.472066.n3.nabble.com/Can-I-invert-the-inverted-index-tp3142206p3142269.html
Sent from the Solr - User mailing list archive at Nabble.com.