You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Jürgen Jakobitsch <ju...@semantic-web.com> on 2017/03/09 20:22:46 UTC

codec: accessing term dictionary

hi,

i'd like to ask users for their experiences with the fastest way to access
the term dictionary.

what i want to do is to implement some algorithms to find phrases (e.g.
mutual rank ratio [1])
(and other statistics on term distribution, generally: corpus related stuff)

the idea would be to do statistics on numbers (i.e. long from the term
dictionary) to minimize memory usage. i did try this with termsEnum +
ordinal number of terms, which are easily retrievable, but getting a term
by ord then throws UnsupportedOperationException [2]. i see there's also a
codec blocktreeord [3].

now before diving deeper into this (i.e. changing codecs for my indexes),
i'd like to ask if a workflow like described above is considered at least
semi smart or if i'm on the wrong track with this and there's a smarter way
to be able to not having to calculate collocations based an actualy strings
or byteRefs?

any pointer really appreciated.

kind regard jürgen

[1] http://www.google.ch/patents/US20100250238
[2]
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.4.0/lucene/core/src/java/org/apache/lucene/codecs/blocktree/SegmentTermsEnum.java
[3]
https://github.com/apache/lucene-solr/blob/master/lucene/codecs/src/java/org/apache/lucene/codecs/blocktreeords/OrdsSegmentTermsEnum.java

*Jürgen Jakobitsch*
Innovation Director
Semantic Web Company GmbH
EU: +43-1-4021235-0
Mobile: +43-676-6212710 <+43%20676%206212710>
http://www.semantic-web.at
http://www.poolparty.biz



PERSONAL INFORMATION
| web       : http://www.turnguard.com
| foaf      : http://www.turnguard.com/turnguard
| g+        : https://plus.google.com/111233759991616358206/posts
| skype     : jakobitsch-punkt
| xmlns:tg  = "http://www.turnguard.com/turnguard#"
| blockchain : https://onename.com/turnguard

Re: codec: accessing term dictionary

Posted by Jürgen Jakobitsch <ju...@semantic-web.com>.

michael, thanks for your input..

i already extended the defaultCodec to return the
BlockTreeOrdsPostingFormat for testing and this works nicely and i can
access terms via ordinal.

speed is not really the issue ( some things simply take a while... ;-) ) .
i also don't want to index shingles, because i can get them via positions
anyway..

so what i gonna do for a first test is to loop over docs/terms + positions
to accumulate shingles of size n as arrays of longs.. do the math and then
retrieve terms via those ordinals..

let's see... ;-)

kr j



*Jürgen Jakobitsch*
Innovation Director
Semantic Web Company GmbH
EU: +43-1-4021235-0
Mobile: +43-676-6212710 <+43%20676%206212710>
http://www.semantic-web.at
http://www.poolparty.biz



PERSONAL INFORMATION
| web       : http://www.turnguard.com
| foaf      : http://www.turnguard.com/turnguard
| g+        : https://plus.google.com/111233759991616358206/posts
| skype     : jakobitsch-punkt
| xmlns:tg  = "http://www.turnguard.com/turnguard#"
| blockchain : https://onename.com/turnguard

2017-03-10 11:41 GMT+01:00 Michael McCandless <lu...@mikemccandless.com>:

> Yes, this is a reasonable way to use Lucene (to see terms statistics
> across the corpus) but it may not be performant enough for your needs.
>
> E.g. wasting memory and making a giant hash table for one time or periodic
> corpus analysis may be faster.
>
> If you are looking for word N gram stats, you could index your text with
> ShingleFilter to make it faster to get ngram counts.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Thu, Mar 9, 2017 at 3:22 PM, Jürgen Jakobitsch <
> juergen.jakobitsch@semantic-web.com> wrote:
>
>> hi,
>>
>> i'd like to ask users for their experiences with the fastest way to access
>> the term dictionary.
>>
>> what i want to do is to implement some algorithms to find phrases (e.g.
>> mutual rank ratio [1])
>> (and other statistics on term distribution, generally: corpus related
>> stuff)
>>
>> the idea would be to do statistics on numbers (i.e. long from the term
>> dictionary) to minimize memory usage. i did try this with termsEnum +
>> ordinal number of terms, which are easily retrievable, but getting a term
>> by ord then throws UnsupportedOperationException [2]. i see there's also a
>> codec blocktreeord [3].
>>
>> now before diving deeper into this (i.e. changing codecs for my indexes),
>> i'd like to ask if a workflow like described above is considered at least
>> semi smart or if i'm on the wrong track with this and there's a smarter
>> way
>> to be able to not having to calculate collocations based an actualy
>> strings
>> or byteRefs?
>>
>> any pointer really appreciated.
>>
>> kind regard jürgen
>>
>> [1] http://www.google.ch/patents/US20100250238
>> [2]
>> https://github.com/apache/lucene-solr/blob/releases/lucene-
>> solr/6.4.0/lucene/core/src/java/org/apache/lucene/codecs/
>> blocktree/SegmentTermsEnum.java
>> [3]
>> https://github.com/apache/lucene-solr/blob/master/lucene/
>> codecs/src/java/org/apache/lucene/codecs/blocktreeords/Or
>> dsSegmentTermsEnum.java
>>
>> *Jürgen Jakobitsch*
>> Innovation Director
>> Semantic Web Company GmbH
>> EU: +43-1-4021235-0
>> Mobile: +43-676-6212710 <+43%20676%206212710>
>> http://www.semantic-web.at
>> http://www.poolparty.biz
>>
>>
>>
>> PERSONAL INFORMATION
>> | web       : http://www.turnguard.com
>> | foaf      : http://www.turnguard.com/turnguard
>> | g+        : https://plus.google.com/111233759991616358206/posts
>> | skype     : jakobitsch-punkt
>> | xmlns:tg  = "http://www.turnguard.com/turnguard#"
>> | blockchain : https://onename.com/turnguard
>>
>
>

Re: codec: accessing term dictionary

Posted by Jürgen Jakobitsch <ju...@semantic-web.com>.

david, thanks for your input..

initially i was hoping to be able to use FST somehow in this process, but
my knowledge in this area is fairly manageable..
i will give it a second thought anyway... ;-)

krj

*Jürgen Jakobitsch*
Innovation Director
Semantic Web Company GmbH
EU: +43-1-4021235-0
Mobile: +43-676-6212710 <+43%20676%206212710>
http://www.semantic-web.at
http://www.poolparty.biz



PERSONAL INFORMATION
| web       : http://www.turnguard.com
| foaf      : http://www.turnguard.com/turnguard
| g+        : https://plus.google.com/111233759991616358206/posts
| skype     : jakobitsch-punkt
| xmlns:tg  = "http://www.turnguard.com/turnguard#"
| blockchain : https://onename.com/turnguard

2017-03-10 11:49 GMT+01:00 Dawid Weiss <da...@gmail.com>:

> Or you could encode those term/ ngram frequencies one FST and then
> reuse it. This would be memory-saving and fairly fast (~comparable to
> a hash table).
>
> Dawid
>
> On Fri, Mar 10, 2017 at 11:41 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
> > Yes, this is a reasonable way to use Lucene (to see terms statistics
> across
> > the corpus) but it may not be performant enough for your needs.
> >
> > E.g. wasting memory and making a giant hash table for one time or
> periodic
> > corpus analysis may be faster.
> >
> > If you are looking for word N gram stats, you could index your text with
> > ShingleFilter to make it faster to get ngram counts.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Thu, Mar 9, 2017 at 3:22 PM, Jürgen Jakobitsch <
> > juergen.jakobitsch@semantic-web.com> wrote:
> >
> >> hi,
> >>
> >> i'd like to ask users for their experiences with the fastest way to
> access
> >> the term dictionary.
> >>
> >> what i want to do is to implement some algorithms to find phrases (e.g.
> >> mutual rank ratio [1])
> >> (and other statistics on term distribution, generally: corpus related
> >> stuff)
> >>
> >> the idea would be to do statistics on numbers (i.e. long from the term
> >> dictionary) to minimize memory usage. i did try this with termsEnum +
> >> ordinal number of terms, which are easily retrievable, but getting a
> term
> >> by ord then throws UnsupportedOperationException [2]. i see there's
> also a
> >> codec blocktreeord [3].
> >>
> >> now before diving deeper into this (i.e. changing codecs for my
> indexes),
> >> i'd like to ask if a workflow like described above is considered at
> least
> >> semi smart or if i'm on the wrong track with this and there's a smarter
> way
> >> to be able to not having to calculate collocations based an actualy
> strings
> >> or byteRefs?
> >>
> >> any pointer really appreciated.
> >>
> >> kind regard jürgen
> >>
> >> [1] http://www.google.ch/patents/US20100250238
> >> [2]
> >> https://github.com/apache/lucene-solr/blob/releases/
> >> lucene-solr/6.4.0/lucene/core/src/java/org/apache/lucene/
> codecs/blocktree/
> >> SegmentTermsEnum.java
> >> [3]
> >> https://github.com/apache/lucene-solr/blob/master/
> >> lucene/codecs/src/java/org/apache/lucene/codecs/blocktreeords/
> >> OrdsSegmentTermsEnum.java
> >>
> >> *Jürgen Jakobitsch*
> >> Innovation Director
> >> Semantic Web Company GmbH
> >> EU: +43-1-4021235-0
> >> Mobile: +43-676-6212710 <+43%20676%206212710>
> >> http://www.semantic-web.at
> >> http://www.poolparty.biz
> >>
> >>
> >>
> >> PERSONAL INFORMATION
> >> | web       : http://www.turnguard.com
> >> | foaf      : http://www.turnguard.com/turnguard
> >> | g+        : https://plus.google.com/111233759991616358206/posts
> >> | skype     : jakobitsch-punkt
> >> | xmlns:tg  = "http://www.turnguard.com/turnguard#"
> >> | blockchain : https://onename.com/turnguard
> >>
>

Re: codec: accessing term dictionary

Posted by Dawid Weiss <da...@gmail.com>.

Or you could encode those term/ ngram frequencies one FST and then
reuse it. This would be memory-saving and fairly fast (~comparable to
a hash table).

Dawid

On Fri, Mar 10, 2017 at 11:41 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> Yes, this is a reasonable way to use Lucene (to see terms statistics across
> the corpus) but it may not be performant enough for your needs.
>
> E.g. wasting memory and making a giant hash table for one time or periodic
> corpus analysis may be faster.
>
> If you are looking for word N gram stats, you could index your text with
> ShingleFilter to make it faster to get ngram counts.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Thu, Mar 9, 2017 at 3:22 PM, Jürgen Jakobitsch <
> juergen.jakobitsch@semantic-web.com> wrote:
>
>> hi,
>>
>> i'd like to ask users for their experiences with the fastest way to access
>> the term dictionary.
>>
>> what i want to do is to implement some algorithms to find phrases (e.g.
>> mutual rank ratio [1])
>> (and other statistics on term distribution, generally: corpus related
>> stuff)
>>
>> the idea would be to do statistics on numbers (i.e. long from the term
>> dictionary) to minimize memory usage. i did try this with termsEnum +
>> ordinal number of terms, which are easily retrievable, but getting a term
>> by ord then throws UnsupportedOperationException [2]. i see there's also a
>> codec blocktreeord [3].
>>
>> now before diving deeper into this (i.e. changing codecs for my indexes),
>> i'd like to ask if a workflow like described above is considered at least
>> semi smart or if i'm on the wrong track with this and there's a smarter way
>> to be able to not having to calculate collocations based an actualy strings
>> or byteRefs?
>>
>> any pointer really appreciated.
>>
>> kind regard jürgen
>>
>> [1] http://www.google.ch/patents/US20100250238
>> [2]
>> https://github.com/apache/lucene-solr/blob/releases/
>> lucene-solr/6.4.0/lucene/core/src/java/org/apache/lucene/codecs/blocktree/
>> SegmentTermsEnum.java
>> [3]
>> https://github.com/apache/lucene-solr/blob/master/
>> lucene/codecs/src/java/org/apache/lucene/codecs/blocktreeords/
>> OrdsSegmentTermsEnum.java
>>
>> *Jürgen Jakobitsch*
>> Innovation Director
>> Semantic Web Company GmbH
>> EU: +43-1-4021235-0
>> Mobile: +43-676-6212710 <+43%20676%206212710>
>> http://www.semantic-web.at
>> http://www.poolparty.biz
>>
>>
>>
>> PERSONAL INFORMATION
>> | web       : http://www.turnguard.com
>> | foaf      : http://www.turnguard.com/turnguard
>> | g+        : https://plus.google.com/111233759991616358206/posts
>> | skype     : jakobitsch-punkt
>> | xmlns:tg  = "http://www.turnguard.com/turnguard#"
>> | blockchain : https://onename.com/turnguard
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: codec: accessing term dictionary

Posted by Michael McCandless <lu...@mikemccandless.com>.

Yes, this is a reasonable way to use Lucene (to see terms statistics across
the corpus) but it may not be performant enough for your needs.

E.g. wasting memory and making a giant hash table for one time or periodic
corpus analysis may be faster.

If you are looking for word N gram stats, you could index your text with
ShingleFilter to make it faster to get ngram counts.

Mike McCandless

http://blog.mikemccandless.com

On Thu, Mar 9, 2017 at 3:22 PM, Jürgen Jakobitsch <
juergen.jakobitsch@semantic-web.com> wrote:

> hi,
>
> i'd like to ask users for their experiences with the fastest way to access
> the term dictionary.
>
> what i want to do is to implement some algorithms to find phrases (e.g.
> mutual rank ratio [1])
> (and other statistics on term distribution, generally: corpus related
> stuff)
>
> the idea would be to do statistics on numbers (i.e. long from the term
> dictionary) to minimize memory usage. i did try this with termsEnum +
> ordinal number of terms, which are easily retrievable, but getting a term
> by ord then throws UnsupportedOperationException [2]. i see there's also a
> codec blocktreeord [3].
>
> now before diving deeper into this (i.e. changing codecs for my indexes),
> i'd like to ask if a workflow like described above is considered at least
> semi smart or if i'm on the wrong track with this and there's a smarter way
> to be able to not having to calculate collocations based an actualy strings
> or byteRefs?
>
> any pointer really appreciated.
>
> kind regard jürgen
>
> [1] http://www.google.ch/patents/US20100250238
> [2]
> https://github.com/apache/lucene-solr/blob/releases/
> lucene-solr/6.4.0/lucene/core/src/java/org/apache/lucene/codecs/blocktree/
> SegmentTermsEnum.java
> [3]
> https://github.com/apache/lucene-solr/blob/master/
> lucene/codecs/src/java/org/apache/lucene/codecs/blocktreeords/
> OrdsSegmentTermsEnum.java
>
> *Jürgen Jakobitsch*
> Innovation Director
> Semantic Web Company GmbH
> EU: +43-1-4021235-0
> Mobile: +43-676-6212710 <+43%20676%206212710>
> http://www.semantic-web.at
> http://www.poolparty.biz
>
>
>
> PERSONAL INFORMATION
> | web       : http://www.turnguard.com
> | foaf      : http://www.turnguard.com/turnguard
> | g+        : https://plus.google.com/111233759991616358206/posts
> | skype     : jakobitsch-punkt
> | xmlns:tg  = "http://www.turnguard.com/turnguard#"
> | blockchain : https://onename.com/turnguard
>