You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Phanindra R <ph...@gmail.com> on 2012/07/26 18:56:51 UTC

Getting terms from unstored fields, doc-wise

Hi,
     I've an index to analyze (manually). Unfortunately, I cannot rebuild
the index. Some of the fields are 'unstored'. I was wondering whether
there's any way to get the terms from an unstored field for each doc.
Positional information is not necessary. Lucene version is 3.5.

The reason am trying to get those terms is that I can add that field to my
own index for every doc. And, yes, there's another id-type-field which
allows me to recognize the document in both indices.

Any guidance is highly appeciated.

Thanks,
Phani

Re: Getting terms from unstored fields, doc-wise

Posted by Phanindra R <ph...@gmail.com>.
Thanks a lot Aditya and Andrzej .. Your responses were really helpful.

On Fri, Jul 27, 2012 at 6:15 AM, Andrzej Bialecki <ab...@getopt.org> wrote:

> On 26/07/2012 22:04, Phanindra R wrote:
>
>> Thanks for the reply Abdul.
>>
>> I was exploring the API and I think we can retrieve all those words by
>> using a brute-force approach.
>>
>> 1) Get all the terms using indexReader.terms()
>>
>> 2) Process the term only if it belongs to the target field.
>>
>> 3) Get all the docs using indexReader.termDocs(term);
>>
>> 4) So, we have the term-doc pairs at this point.
>>
>
> This procedure is implemented in Luke (http://code.google.com/p/luke**)
> in the "Reconstruct & Edit" function. In case of larger indexes it's indeed
> a time-consuming procedure.
>
>
>
>> Is there any better approach other than the above forever-taking
>> procedure?
>>
>
> No. Indexing is usually a lossy process - some data is irretrievably lost
> - and the resulting data structure is not optimized for re-assembling the
> original content. If you need to retrieve the original content you have to
> store it, either using stored fields or in an external system.
>
>
> --
> Best regards,
> Andrzej Bialecki
> http://www.sigram.com, blog http://www.sigram.com/blog
>  ___.,___,___,___,_._. __________________<><_________**___________
> [___||.__|__/|__||\/|: Information Retrieval, System Integration
> ___|||__||..\|..||..|: Contact: info at sigram dot com
>
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<ja...@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.**org<ja...@lucene.apache.org>
>
>

Re: Getting terms from unstored fields, doc-wise

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 26/07/2012 22:04, Phanindra R wrote:
> Thanks for the reply Abdul.
>
> I was exploring the API and I think we can retrieve all those words by
> using a brute-force approach.
>
> 1) Get all the terms using indexReader.terms()
>
> 2) Process the term only if it belongs to the target field.
>
> 3) Get all the docs using indexReader.termDocs(term);
>
> 4) So, we have the term-doc pairs at this point.

This procedure is implemented in Luke (http://code.google.com/p/luke) in 
the "Reconstruct & Edit" function. In case of larger indexes it's indeed 
a time-consuming procedure.

>
> Is there any better approach other than the above forever-taking procedure?

No. Indexing is usually a lossy process - some data is irretrievably 
lost - and the resulting data structure is not optimized for 
re-assembling the original content. If you need to retrieve the original 
content you have to store it, either using stored fields or in an 
external system.


-- 
Best regards,
Andrzej Bialecki
http://www.sigram.com, blog http://www.sigram.com/blog
  ___.,___,___,___,_._. __________________<><____________________
[___||.__|__/|__||\/|: Information Retrieval, System Integration
___|||__||..\|..||..|: Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Getting terms from unstored fields, doc-wise

Posted by Aditya <fi...@gmail.com>.
Hi

If the data is not stored then it cannot be retrieved in the same format.
Using IndexReader as you listed you could retrieve the list of the terms
available in the doc. It may be analyzed. You may not be getting exact data.

Regards
Aditya
www.findbestopensource.com

On Fri, Jul 27, 2012 at 1:34 AM, Phanindra R <ph...@gmail.com> wrote:

> Thanks for the reply Abdul.
>
> I was exploring the API and I think we can retrieve all those words by
> using a brute-force approach.
>
> 1) Get all the terms using indexReader.terms()
>
> 2) Process the term only if it belongs to the target field.
>
> 3) Get all the docs using indexReader.termDocs(term);
>
> 4) So, we have the term-doc pairs at this point.
>
> Is there any better approach other than the above forever-taking procedure?
>
> Thanks,
> Phanindra
>
>
>
> On Thu, Jul 26, 2012 at 11:46 AM, in.abdul <in...@gmail.com> wrote:
>
> > No , it's not possible to get the data which not stored ..
> > On Jul 26, 2012 10:27 PM, "Phanindra R [via Lucene]"
> > <ml-node+s472066n3997487h23@n3.nabble
> > >
> > > Hi,
> > >      I've an index to analyze (manually). Unfortunately, I cannot
> rebuild
> > > the index. Some of the fields are 'unstored'. I was wondering whether
> > > there's any way to get the terms from an unstored field for each doc.
> > > Positional information is not necessary. Lucene version is 3.5.
> > >
> > > The reason am trying to get those terms is that I can add that field to
> > my
> > > own index for every doc. And, yes, there's another id-type-field which
> > > allows me to recognize the document in both indices.
> > >
> > > Any guidance is highly appeciated.
> > >
> > > Thanks,
> > > Phani
> > >
> > >
> > > ------------------------------
> > >  If you reply to this email, your message will be added to the
> discussion
> > > below:
> > >
> > >
> >
> http://lucene.472066.n3.nabble.com/Getting-terms-from-unstored-fields-doc-wise-tp3997487.html
> > >  To unsubscribe from Lucene, click here<
> >
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=472066&code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw
> > >
> > > .
> > > NAML<
> >
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> > >
> > >
> >
> >
> >
> >
> > -----
> > THANKS AND REGARDS,
> > SYED ABDUL KATHER
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Getting-terms-from-unstored-fields-doc-wise-tp3997487p3997510.html
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>

Re: Getting terms from unstored fields, doc-wise

Posted by Phanindra R <ph...@gmail.com>.
Thanks for the reply Abdul.

I was exploring the API and I think we can retrieve all those words by
using a brute-force approach.

1) Get all the terms using indexReader.terms()

2) Process the term only if it belongs to the target field.

3) Get all the docs using indexReader.termDocs(term);

4) So, we have the term-doc pairs at this point.

Is there any better approach other than the above forever-taking procedure?

Thanks,
Phanindra



On Thu, Jul 26, 2012 at 11:46 AM, in.abdul <in...@gmail.com> wrote:

> No , it's not possible to get the data which not stored ..
> On Jul 26, 2012 10:27 PM, "Phanindra R [via Lucene]"
> <ml-node+s472066n3997487h23@n3.nabble
> >
> > Hi,
> >      I've an index to analyze (manually). Unfortunately, I cannot rebuild
> > the index. Some of the fields are 'unstored'. I was wondering whether
> > there's any way to get the terms from an unstored field for each doc.
> > Positional information is not necessary. Lucene version is 3.5.
> >
> > The reason am trying to get those terms is that I can add that field to
> my
> > own index for every doc. And, yes, there's another id-type-field which
> > allows me to recognize the document in both indices.
> >
> > Any guidance is highly appeciated.
> >
> > Thanks,
> > Phani
> >
> >
> > ------------------------------
> >  If you reply to this email, your message will be added to the discussion
> > below:
> >
> >
> http://lucene.472066.n3.nabble.com/Getting-terms-from-unstored-fields-doc-wise-tp3997487.html
> >  To unsubscribe from Lucene, click here<
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=472066&code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw
> >
> > .
> > NAML<
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >
>
>
>
>
> -----
> THANKS AND REGARDS,
> SYED ABDUL KATHER
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Getting-terms-from-unstored-fields-doc-wise-tp3997487p3997510.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.

Re: Getting terms from unstored fields, doc-wise

Posted by "in.abdul" <in...@gmail.com>.
No , it's not possible to get the data which not stored ..
On Jul 26, 2012 10:27 PM, "Phanindra R [via Lucene]"
<ml-node+s472066n3997487h23@n3.nabble
>
> Hi,
>      I've an index to analyze (manually). Unfortunately, I cannot rebuild
> the index. Some of the fields are 'unstored'. I was wondering whether
> there's any way to get the terms from an unstored field for each doc.
> Positional information is not necessary. Lucene version is 3.5.
>
> The reason am trying to get those terms is that I can add that field to my
> own index for every doc. And, yes, there's another id-type-field which
> allows me to recognize the document in both indices.
>
> Any guidance is highly appeciated.
>
> Thanks,
> Phani
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Getting-terms-from-unstored-fields-doc-wise-tp3997487.html
>  To unsubscribe from Lucene, click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=472066&code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw>
> .
> NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




-----
THANKS AND REGARDS,
SYED ABDUL KATHER
--
View this message in context: http://lucene.472066.n3.nabble.com/Getting-terms-from-unstored-fields-doc-wise-tp3997487p3997510.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.