You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Alexandre Rafalovitch <ar...@gmail.com> on 2014/04/16 17:50:24 UTC

Can I reconstruct text from tokens?

Hello,

If I use very basic tokenizers, e.g. space based and no filters, can I
reconstruct the text from the tokenized form?

So, "This is a test" -> "This", "is", "a", "test" -> "This is a test"?

I know we store enough information, but I don't know internal API
enough to know what I should be looking at for reconstruction
algorithm.

Any hints?

The XY problem is that I want to store large amount of very repeatable
text into Solr. I want the index to be as small as possible, so
thought if I just pre-tokenized, my dictionary will be quite small.
And I will be reconstructing some final form anyway.

The other option is to just use compressed fields on stored field, but
I assume that does not take cross-document efficiencies into account.
And, it will be a read-only index after build, so I don't care about
updates messing things up.

Regards,
   Alex

Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency

Re: Can I reconstruct text from tokens?

Posted by Erick Erickson <er...@gmail.com>.
Luke actually does this, or attempts to. The doc you assemble is lossy
though....

It doesn't have stop words
All capitalization is lost
original terms for synonyms are lost
all punctuation is lost
I don't  think you can do this unless you store term information.
it's slow.
original words that are stemmed are lost
Anything you do with, say, ngrams will definitely be strange.
etc.

Basically, all the filters in the analysis chain may change what goes
into the index, that's their job. Each step may lose information.

FWIW,
Erick


On Fri, Apr 18, 2014 at 12:36 PM, Ramkumar R. Aiyengar
<an...@gmail.com> wrote:
> Sorry, didn't think this through. You're right, still the same problem..
> On 16 Apr 2014 17:40, "Alexandre Rafalovitch" <ar...@gmail.com> wrote:
>
>> Why? I want stored=false, at which point multivalued field is just offset
>> values in the dictionary. Still have to reconstruct from offsets.
>>
>> Or am I missing something?
>>
>> Regards,
>>      Alex
>> On 16/04/2014 10:59 pm, "Ramkumar R. Aiyengar" <an...@gmail.com>
>> wrote:
>>
>> > Logically if you tokenize and put the results in a multivalued field, you
>> > should be able to get all values in sequence?
>> > On 16 Apr 2014 16:51, "Alexandre Rafalovitch" <ar...@gmail.com>
>> wrote:
>> >
>> > > Hello,
>> > >
>> > > If I use very basic tokenizers, e.g. space based and no filters, can I
>> > > reconstruct the text from the tokenized form?
>> > >
>> > > So, "This is a test" -> "This", "is", "a", "test" -> "This is a test"?
>> > >
>> > > I know we store enough information, but I don't know internal API
>> > > enough to know what I should be looking at for reconstruction
>> > > algorithm.
>> > >
>> > > Any hints?
>> > >
>> > > The XY problem is that I want to store large amount of very repeatable
>> > > text into Solr. I want the index to be as small as possible, so
>> > > thought if I just pre-tokenized, my dictionary will be quite small.
>> > > And I will be reconstructing some final form anyway.
>> > >
>> > > The other option is to just use compressed fields on stored field, but
>> > > I assume that does not take cross-document efficiencies into account.
>> > > And, it will be a read-only index after build, so I don't care about
>> > > updates messing things up.
>> > >
>> > > Regards,
>> > >    Alex
>> > >
>> > > Personal website: http://www.outerthoughts.com/
>> > > Current project: http://www.solr-start.com/ - Accelerating your Solr
>> > > proficiency
>> > >
>> >
>>

Re: Can I reconstruct text from tokens?

Posted by "Ramkumar R. Aiyengar" <an...@gmail.com>.
Sorry, didn't think this through. You're right, still the same problem..
On 16 Apr 2014 17:40, "Alexandre Rafalovitch" <ar...@gmail.com> wrote:

> Why? I want stored=false, at which point multivalued field is just offset
> values in the dictionary. Still have to reconstruct from offsets.
>
> Or am I missing something?
>
> Regards,
>      Alex
> On 16/04/2014 10:59 pm, "Ramkumar R. Aiyengar" <an...@gmail.com>
> wrote:
>
> > Logically if you tokenize and put the results in a multivalued field, you
> > should be able to get all values in sequence?
> > On 16 Apr 2014 16:51, "Alexandre Rafalovitch" <ar...@gmail.com>
> wrote:
> >
> > > Hello,
> > >
> > > If I use very basic tokenizers, e.g. space based and no filters, can I
> > > reconstruct the text from the tokenized form?
> > >
> > > So, "This is a test" -> "This", "is", "a", "test" -> "This is a test"?
> > >
> > > I know we store enough information, but I don't know internal API
> > > enough to know what I should be looking at for reconstruction
> > > algorithm.
> > >
> > > Any hints?
> > >
> > > The XY problem is that I want to store large amount of very repeatable
> > > text into Solr. I want the index to be as small as possible, so
> > > thought if I just pre-tokenized, my dictionary will be quite small.
> > > And I will be reconstructing some final form anyway.
> > >
> > > The other option is to just use compressed fields on stored field, but
> > > I assume that does not take cross-document efficiencies into account.
> > > And, it will be a read-only index after build, so I don't care about
> > > updates messing things up.
> > >
> > > Regards,
> > >    Alex
> > >
> > > Personal website: http://www.outerthoughts.com/
> > > Current project: http://www.solr-start.com/ - Accelerating your Solr
> > > proficiency
> > >
> >
>

Re: Can I reconstruct text from tokens?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Why? I want stored=false, at which point multivalued field is just offset
values in the dictionary. Still have to reconstruct from offsets.

Or am I missing something?

Regards,
     Alex
On 16/04/2014 10:59 pm, "Ramkumar R. Aiyengar" <an...@gmail.com>
wrote:

> Logically if you tokenize and put the results in a multivalued field, you
> should be able to get all values in sequence?
> On 16 Apr 2014 16:51, "Alexandre Rafalovitch" <ar...@gmail.com> wrote:
>
> > Hello,
> >
> > If I use very basic tokenizers, e.g. space based and no filters, can I
> > reconstruct the text from the tokenized form?
> >
> > So, "This is a test" -> "This", "is", "a", "test" -> "This is a test"?
> >
> > I know we store enough information, but I don't know internal API
> > enough to know what I should be looking at for reconstruction
> > algorithm.
> >
> > Any hints?
> >
> > The XY problem is that I want to store large amount of very repeatable
> > text into Solr. I want the index to be as small as possible, so
> > thought if I just pre-tokenized, my dictionary will be quite small.
> > And I will be reconstructing some final form anyway.
> >
> > The other option is to just use compressed fields on stored field, but
> > I assume that does not take cross-document efficiencies into account.
> > And, it will be a read-only index after build, so I don't care about
> > updates messing things up.
> >
> > Regards,
> >    Alex
> >
> > Personal website: http://www.outerthoughts.com/
> > Current project: http://www.solr-start.com/ - Accelerating your Solr
> > proficiency
> >
>

Re: Can I reconstruct text from tokens?

Posted by "Ramkumar R. Aiyengar" <an...@gmail.com>.
Logically if you tokenize and put the results in a multivalued field, you
should be able to get all values in sequence?
On 16 Apr 2014 16:51, "Alexandre Rafalovitch" <ar...@gmail.com> wrote:

> Hello,
>
> If I use very basic tokenizers, e.g. space based and no filters, can I
> reconstruct the text from the tokenized form?
>
> So, "This is a test" -> "This", "is", "a", "test" -> "This is a test"?
>
> I know we store enough information, but I don't know internal API
> enough to know what I should be looking at for reconstruction
> algorithm.
>
> Any hints?
>
> The XY problem is that I want to store large amount of very repeatable
> text into Solr. I want the index to be as small as possible, so
> thought if I just pre-tokenized, my dictionary will be quite small.
> And I will be reconstructing some final form anyway.
>
> The other option is to just use compressed fields on stored field, but
> I assume that does not take cross-document efficiencies into account.
> And, it will be a read-only index after build, so I don't care about
> updates messing things up.
>
> Regards,
>    Alex
>
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr
> proficiency
>

Re: Can I reconstruct text from tokens?

Posted by Michael Sokolov <ms...@safaribooksonline.com>.
I believe you could use term vectors to retrieve all the terms in a 
document, with their offsets.  Retrieving them from the inverted index 
would be expensive since the index is term-oriented, not 
document-oriented.  Without tv, I think you essentially have to scan the 
entire term dictionary looking for terms in your document. So that will 
cost you probably more than it's worth?

-Mike

On 04/16/2014 11:50 AM, Alexandre Rafalovitch wrote:
> Hello,
>
> If I use very basic tokenizers, e.g. space based and no filters, can I
> reconstruct the text from the tokenized form?
>
> So, "This is a test" -> "This", "is", "a", "test" -> "This is a test"?
>
> I know we store enough information, but I don't know internal API
> enough to know what I should be looking at for reconstruction
> algorithm.
>
> Any hints?
>
> The XY problem is that I want to store large amount of very repeatable
> text into Solr. I want the index to be as small as possible, so
> thought if I just pre-tokenized, my dictionary will be quite small.
> And I will be reconstructing some final form anyway.
>
> The other option is to just use compressed fields on stored field, but
> I assume that does not take cross-document efficiencies into account.
> And, it will be a read-only index after build, so I don't care about
> updates messing things up.
>
> Regards,
>     Alex
>
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency