You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Anson Lau <al...@fulfil-net.com> on 2004/09/16 16:27:48 UTC

mg4j - Managing Gigabyte for Java

Hi All,

Has anyone seen the project MG4J (Managing Gigabyte for Java)
http://mg4j.dsi.unimi.it/ ?  Anybody knows enough about both Lucene
and MG4J to comment on how the two compares?

Thanks,

Anson

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


RE: mg4j - Managing Gigabyte for Java

Posted by Robert Engels <re...@ix.netcom.com>.
I think the best way to move in this direction is to make IndexReader and
IndexWriter pure interfaces.

It will go along way towards these sort of changes, since the api at the
interface level will need configuration (capability queries) methods in
order to support using any 'lucene tools' with any 'lucene index'.

I know it has been discussed before, but is this (interfaces for
IndexReaderWriter) going to make it on the list for 1.9/2.0 ?



-----Original Message-----
From: Doug Cutting [mailto:cutting@apache.org]
Sent: Thursday, September 16, 2004 1:56 PM
To: Lucene Developers List
Subject: Re: mg4j - Managing Gigabyte for Java


Antonio Gulli wrote:
> Just a question: my personal experience with a commercial engine i
> partly developed is the the "continuation bit" (aka altavista solution)
> is a good and efficient solution w.r.t gamma code, delta code and other
> codes used for variable lenght int rappresentation (see MG).
>
> Given an int say n, continuation bit is just to consider a byte as 7 bit
> + 1 bit used to say if the next byte is also used to rappresent n.

This is what Lucene uses for the reasons you mention: it is a good
compromise between compression and performance.

Long-term I'd like to make Lucene's posting format extensible.  In
addition to altering the compression method, the granularity of the
index should be flexible.  Currently postings for all indexed fields
consist of  <document, frequency, <position*> > tuples.  Instead, folks
should be able to have postings like:
   . <document> for pure boolean matching only
   . <document, weight> for vector matching, no phrases
   . <document, frequency, <position, weight>* > for boosting term
occurrences by, e.g., position in document, bolding, headings, etc.

Extending Lucene to efficiently and flexibly support this will be a
design challenge, but I think it will benefit lots of applications.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: mg4j - Managing Gigabyte for Java

Posted by Doug Cutting <cu...@apache.org>.
Antonio Gulli wrote:
> Just a question: my personal experience with a commercial engine i 
> partly developed is the the "continuation bit" (aka altavista solution)  
> is a good and efficient solution w.r.t gamma code, delta code and other 
> codes used for variable lenght int rappresentation (see MG).
> 
> Given an int say n, continuation bit is just to consider a byte as 7 bit 
> + 1 bit used to say if the next byte is also used to rappresent n.

This is what Lucene uses for the reasons you mention: it is a good 
compromise between compression and performance.

Long-term I'd like to make Lucene's posting format extensible.  In 
addition to altering the compression method, the granularity of the 
index should be flexible.  Currently postings for all indexed fields 
consist of  <document, frequency, <position*> > tuples.  Instead, folks 
should be able to have postings like:
   . <document> for pure boolean matching only
   . <document, weight> for vector matching, no phrases
   . <document, frequency, <position, weight>* > for boosting term 
occurrences by, e.g., position in document, bolding, headings, etc.

Extending Lucene to efficiently and flexibly support this will be a 
design challenge, but I think it will benefit lots of applications.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: mg4j - Managing Gigabyte for Java

Posted by Antonio Gulli <gu...@di.unipi.it>.
David Spencer wrote:

> Anson Lau wrote:
>
>> Hi All,
>>
>> Has anyone seen the project MG4J (Managing Gigabyte for Java)
>> http://mg4j.dsi.unimi.it/ ?  Anybody knows enough about both Lucene
>> and MG4J to comment on how the two compares?
>
>
> I've wondered if Lucene does comparable (key/index) compression to 
> what the related book (Managing Gigabytes, excellent BTW) describes...

Just a question: my personal experience with a commercial engine i 
partly developed is the the "continuation bit" (aka altavista solution)  
is a good and efficient solution w.r.t gamma code, delta code and other 
codes used for variable lenght int rappresentation (see MG).

Given an int say n, continuation bit is just to consider a byte as 7 bit 
+ 1 bit used to say if the next byte is also used to rappresent n.

On the average you will loose some bit on small gaps between contiguos 
integer in the posting list, but they are not that much since on large 
collections gaps are large. But you can operate on machine oriented word 
lenght instead of bit operations which are much more expensive.

I saw a small increment on the index size, but a big saving on query 
time. Any similiar / opposite experience?

-- 
"We have no credible evidence that Iraq and Al Qaeda 
cooperated on attacks against the United States."
Staff report of the commission investigating the Sept. 
11 attacks.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: mg4j - Managing Gigabyte for Java

Posted by David Spencer <da...@tropo.com>.
Anson Lau wrote:

> Hi All,
> 
> Has anyone seen the project MG4J (Managing Gigabyte for Java)
> http://mg4j.dsi.unimi.it/ ?  Anybody knows enough about both Lucene
> and MG4J to comment on how the two compares?

I've wondered if Lucene does comparable (key/index) compression to what 
the related book (Managing Gigabytes, excellent BTW) describes...


> 
> Thanks,
> 
> Anson
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: mg4j - Managing Gigabyte for Java

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi Anson,

It's not quite correct to comparing MG4J and Lucene directly.  Lucene
is a toolkit whose primary goal is to let you create an index and
search it, while MG4J is really a library of Java classes that people
implementing an IR library (such as Lucene, for example) may find
useful.  You cannot create a searchable index with MG4J alone.

Otis


--- Anson Lau <al...@fulfil-net.com> wrote:

> Hi All,
> 
> Has anyone seen the project MG4J (Managing Gigabyte for Java)
> http://mg4j.dsi.unimi.it/ ?  Anybody knows enough about both Lucene
> and MG4J to comment on how the two compares?
> 
> Thanks,
> 
> Anson
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org