You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Rafael Turk <ra...@gmail.com> on 2008/04/23 13:25:49 UTC

Lucene and Google Web 1T 5 Gram

Hi Folks,

   I´m trying to load Google Web 1T 5 Gram to Lucene. (This corpus contains
English word n-grams and their observed frequency counts. The length of the
n-grams ranges from unigrams(single words) to five-grams)

   I´m loading each ngram (each row is a ngram) as an individual Document.
This way I´ll be able to search for each ngram separated, but I´m ending
with huge indexes witch makes them very hard to load and read the index.

  Is there a better way to load and read ngrams to a Lucene index? Maybe
using lower level api?


More Info about Google Web 1T 5 Gram corpus at:
<http://www.ldc.upenn.edu/Catalog/docs/LDC2006T13/readme.txt>

Thanks,

Rafael

Re: Lucene and Google Web 1T 5 Gram

Posted by Rafael Turk <ra...@gmail.com>.

Thanks Julien,

 I´ll definitely give it a try!!!

[]s

Rafael

On Wed, Apr 23, 2008 at 8:38 AM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Hi Raphael,
>
> We initially tried to do the same but ended up developing our own API for
> querying the Web 1T. You can find more details on
> http://digitalpebble.com/resources.html
> There could be a way to reuse elements from Lucene e.g. the Term index
> only
> but I could not find an obvious way to achieve that.
>
> Best,
>
> Julien
>
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
>
> On 23/04/2008, Rafael Turk <ra...@gmail.com> wrote:
> >
> > Hi Folks,
> >
> >    I´m trying to load Google Web 1T 5 Gram to Lucene. (This corpus
> > contains
> > English word n-grams and their observed frequency counts. The length of
> > the
> > n-grams ranges from unigrams(single words) to five-grams)
> >
> >    I´m loading each ngram (each row is a ngram) as an individual
> Document.
> > This way I´ll be able to search for each ngram separated, but I´m ending
> > with huge indexes witch makes them very hard to load and read the index.
> >
> >   Is there a better way to load and read ngrams to a Lucene index? Maybe
> > using lower level api?
> >
> >
> > More Info about Google Web 1T 5 Gram corpus at:
> > <http://www.ldc.upenn.edu/Catalog/docs/LDC2006T13/readme.txt>
> >
> > Thanks,
> >
> >
> > Rafael
> >
>

Re: Lucene and Google Web 1T 5 Gram

Posted by Julien Nioche <li...@gmail.com>.

Hi Raphael,

We initially tried to do the same but ended up developing our own API for
querying the Web 1T. You can find more details on
http://digitalpebble.com/resources.html
There could be a way to reuse elements from Lucene e.g. the Term index only
but I could not find an obvious way to achieve that.

Best,

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com


On 23/04/2008, Rafael Turk <ra...@gmail.com> wrote:
>
> Hi Folks,
>
>    I´m trying to load Google Web 1T 5 Gram to Lucene. (This corpus
> contains
> English word n-grams and their observed frequency counts. The length of
> the
> n-grams ranges from unigrams(single words) to five-grams)
>
>    I´m loading each ngram (each row is a ngram) as an individual Document.
> This way I´ll be able to search for each ngram separated, but I´m ending
> with huge indexes witch makes them very hard to load and read the index.
>
>   Is there a better way to load and read ngrams to a Lucene index? Maybe
> using lower level api?
>
>
> More Info about Google Web 1T 5 Gram corpus at:
> <http://www.ldc.upenn.edu/Catalog/docs/LDC2006T13/readme.txt>
>
> Thanks,
>
>
> Rafael
>

Re: Lucene and Google Web 1T 5 Gram

Posted by Mathieu Lecarme <ma...@garambrogne.net>.

Rafael Turk a écrit :
> Hi Mathieu,
>
> *What do you wont to do?*
>
> An spell checker and related keyword suggestion
>
>   
Here is a spell checker wich I try to finalize :
https://admin.garambrogne.net/projets/revuedepresse/browser/trunk/src/java

> If you wont an ngram => popularity map, just use a berkley DB, and use this
> information in your Lucene application. Lucene is a reversed index, Berkeley
> DB an index.
>
> *Great ideia! Berkeley DB is definitely a try, simple and effective, but
> I'll have to work the data previously. I was hopping to take advantage of
> Lucene's built in features*
>   
Lucene provides nice tools without the need to index. Analyzer and 
TokenFilter can help you, i guess.

M.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene and Google Web 1T 5 Gram

Posted by Karl Wettin <ka...@gmail.com>.

Rafael Turk skrev:
> 
> *Great ideia! Berkeley DB is definitely a try, simple and effective, but
> I'll have to work the data previously. 

JDBM has a more appealing license if you ask ASF.


            karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene and Google Web 1T 5 Gram

Posted by Rafael Turk <ra...@gmail.com>.

Hi Mathieu,

*What do you wont to do?*

An spell checker and related keyword suggestion

If you wont an ngram => popularity map, just use a berkley DB, and use this
information in your Lucene application. Lucene is a reversed index, Berkeley
DB an index.

*Great ideia! Berkeley DB is definitely a try, simple and effective, but
I'll have to work the data previously. I was hopping to take advantage of
Lucene's built in features*
**
*[]s*
**
On Wed, Apr 23, 2008 at 10:16 AM, Mathieu Lecarme <ma...@garambrogne.net>
wrote:

> Rafael Turk a écrit :
>
> Hi Folks,
> >
> >   I´m trying to load Google Web 1T 5 Gram to Lucene. (This corpus
> > contains
> > English word n-grams and their observed frequency counts. The length of
> > the
> > n-grams ranges from unigrams(single words) to five-grams)
> >
> >   I´m loading each ngram (each row is a ngram) as an individual
> > Document.
> > This way I´ll be able to search for each ngram separated, but I´m ending
> > with huge indexes witch makes them very hard to load and read the index.
> >
> >  Is there a better way to load and read ngrams to a Lucene index? Maybe
> > using lower level api?
> >
> >
> > More Info about Google Web 1T 5 Gram corpus at:
> > <http://www.ldc.upenn.edu/Catalog/docs/LDC2006T13/readme.txt>
> >
> > Thanks,
> >
> > Rafael
> >
> >
> >
>
> What do you wont to do?
> If you wont an ngram => popularity map, just use a berkley DB, and use
> this information in your Lucene application. Lucene is a reversed index,
> Berkeley DB an index.
>
> M.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Lucene and Google Web 1T 5 Gram

Posted by Mathieu Lecarme <ma...@garambrogne.net>.

Rafael Turk a écrit :
> Hi Folks,
>
>    I´m trying to load Google Web 1T 5 Gram to Lucene. (This corpus contains
> English word n-grams and their observed frequency counts. The length of the
> n-grams ranges from unigrams(single words) to five-grams)
>
>    I´m loading each ngram (each row is a ngram) as an individual Document.
> This way I´ll be able to search for each ngram separated, but I´m ending
> with huge indexes witch makes them very hard to load and read the index.
>
>   Is there a better way to load and read ngrams to a Lucene index? Maybe
> using lower level api?
>
>
> More Info about Google Web 1T 5 Gram corpus at:
> <http://www.ldc.upenn.edu/Catalog/docs/LDC2006T13/readme.txt>
>
> Thanks,
>
> Rafael
>
>   

What do you wont to do?
If you wont an ngram => popularity map, just use a berkley DB, and use 
this information in your Lucene application. Lucene is a reversed index, 
Berkeley DB an index.

M.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org