You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Rafael Turk <ra...@gmail.com> on 2008/04/23 13:25:49 UTC
Lucene and Google Web 1T 5 Gram
Hi Folks,
I´m trying to load Google Web 1T 5 Gram to Lucene. (This corpus contains
English word n-grams and their observed frequency counts. The length of the
n-grams ranges from unigrams(single words) to five-grams)
I´m loading each ngram (each row is a ngram) as an individual Document.
This way I´ll be able to search for each ngram separated, but I´m ending
with huge indexes witch makes them very hard to load and read the index.
Is there a better way to load and read ngrams to a Lucene index? Maybe
using lower level api?
More Info about Google Web 1T 5 Gram corpus at:
<http://www.ldc.upenn.edu/Catalog/docs/LDC2006T13/readme.txt>
Thanks,
Rafael
Re: Lucene and Google Web 1T 5 Gram
Posted by Rafael Turk <ra...@gmail.com>.
Thanks Julien,
I´ll definitely give it a try!!!
[]s
Rafael
On Wed, Apr 23, 2008 at 8:38 AM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:
> Hi Raphael,
>
> We initially tried to do the same but ended up developing our own API for
> querying the Web 1T. You can find more details on
> http://digitalpebble.com/resources.html
> There could be a way to reuse elements from Lucene e.g. the Term index
> only
> but I could not find an obvious way to achieve that.
>
> Best,
>
> Julien
>
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
>
> On 23/04/2008, Rafael Turk <ra...@gmail.com> wrote:
> >
> > Hi Folks,
> >
> > I´m trying to load Google Web 1T 5 Gram to Lucene. (This corpus
> > contains
> > English word n-grams and their observed frequency counts. The length of
> > the
> > n-grams ranges from unigrams(single words) to five-grams)
> >
> > I´m loading each ngram (each row is a ngram) as an individual
> Document.
> > This way I´ll be able to search for each ngram separated, but I´m ending
> > with huge indexes witch makes them very hard to load and read the index.
> >
> > Is there a better way to load and read ngrams to a Lucene index? Maybe
> > using lower level api?
> >
> >
> > More Info about Google Web 1T 5 Gram corpus at:
> > <http://www.ldc.upenn.edu/Catalog/docs/LDC2006T13/readme.txt>
> >
> > Thanks,
> >
> >
> > Rafael
> >
>
Re: Lucene and Google Web 1T 5 Gram
Posted by Julien Nioche <li...@gmail.com>.
Hi Raphael,
We initially tried to do the same but ended up developing our own API for
querying the Web 1T. You can find more details on
http://digitalpebble.com/resources.html
There could be a way to reuse elements from Lucene e.g. the Term index only
but I could not find an obvious way to achieve that.
Best,
Julien
--
DigitalPebble Ltd
http://www.digitalpebble.com
On 23/04/2008, Rafael Turk <ra...@gmail.com> wrote:
>
> Hi Folks,
>
> I´m trying to load Google Web 1T 5 Gram to Lucene. (This corpus
> contains
> English word n-grams and their observed frequency counts. The length of
> the
> n-grams ranges from unigrams(single words) to five-grams)
>
> I´m loading each ngram (each row is a ngram) as an individual Document.
> This way I´ll be able to search for each ngram separated, but I´m ending
> with huge indexes witch makes them very hard to load and read the index.
>
> Is there a better way to load and read ngrams to a Lucene index? Maybe
> using lower level api?
>
>
> More Info about Google Web 1T 5 Gram corpus at:
> <http://www.ldc.upenn.edu/Catalog/docs/LDC2006T13/readme.txt>
>
> Thanks,
>
>
> Rafael
>
Re: Lucene and Google Web 1T 5 Gram
Posted by Mathieu Lecarme <ma...@garambrogne.net>.
Rafael Turk a écrit :
> Hi Mathieu,
>
> *What do you wont to do?*
>
> An spell checker and related keyword suggestion
>
>
Here is a spell checker wich I try to finalize :
https://admin.garambrogne.net/projets/revuedepresse/browser/trunk/src/java
> If you wont an ngram => popularity map, just use a berkley DB, and use this
> information in your Lucene application. Lucene is a reversed index, Berkeley
> DB an index.
>
> *Great ideia! Berkeley DB is definitely a try, simple and effective, but
> I'll have to work the data previously. I was hopping to take advantage of
> Lucene's built in features*
>
Lucene provides nice tools without the need to index. Analyzer and
TokenFilter can help you, i guess.
M.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene and Google Web 1T 5 Gram
Posted by Karl Wettin <ka...@gmail.com>.
Rafael Turk skrev:
>
> *Great ideia! Berkeley DB is definitely a try, simple and effective, but
> I'll have to work the data previously.
JDBM has a more appealing license if you ask ASF.
karl
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene and Google Web 1T 5 Gram
Posted by Rafael Turk <ra...@gmail.com>.
Hi Mathieu,
*What do you wont to do?*
An spell checker and related keyword suggestion
If you wont an ngram => popularity map, just use a berkley DB, and use this
information in your Lucene application. Lucene is a reversed index, Berkeley
DB an index.
*Great ideia! Berkeley DB is definitely a try, simple and effective, but
I'll have to work the data previously. I was hopping to take advantage of
Lucene's built in features*
**
*[]s*
**
On Wed, Apr 23, 2008 at 10:16 AM, Mathieu Lecarme <ma...@garambrogne.net>
wrote:
> Rafael Turk a écrit :
>
> Hi Folks,
> >
> > I´m trying to load Google Web 1T 5 Gram to Lucene. (This corpus
> > contains
> > English word n-grams and their observed frequency counts. The length of
> > the
> > n-grams ranges from unigrams(single words) to five-grams)
> >
> > I´m loading each ngram (each row is a ngram) as an individual
> > Document.
> > This way I´ll be able to search for each ngram separated, but I´m ending
> > with huge indexes witch makes them very hard to load and read the index.
> >
> > Is there a better way to load and read ngrams to a Lucene index? Maybe
> > using lower level api?
> >
> >
> > More Info about Google Web 1T 5 Gram corpus at:
> > <http://www.ldc.upenn.edu/Catalog/docs/LDC2006T13/readme.txt>
> >
> > Thanks,
> >
> > Rafael
> >
> >
> >
>
> What do you wont to do?
> If you wont an ngram => popularity map, just use a berkley DB, and use
> this information in your Lucene application. Lucene is a reversed index,
> Berkeley DB an index.
>
> M.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Lucene and Google Web 1T 5 Gram
Posted by Mathieu Lecarme <ma...@garambrogne.net>.
Rafael Turk a écrit :
> Hi Folks,
>
> I´m trying to load Google Web 1T 5 Gram to Lucene. (This corpus contains
> English word n-grams and their observed frequency counts. The length of the
> n-grams ranges from unigrams(single words) to five-grams)
>
> I´m loading each ngram (each row is a ngram) as an individual Document.
> This way I´ll be able to search for each ngram separated, but I´m ending
> with huge indexes witch makes them very hard to load and read the index.
>
> Is there a better way to load and read ngrams to a Lucene index? Maybe
> using lower level api?
>
>
> More Info about Google Web 1T 5 Gram corpus at:
> <http://www.ldc.upenn.edu/Catalog/docs/LDC2006T13/readme.txt>
>
> Thanks,
>
> Rafael
>
>
What do you wont to do?
If you wont an ngram => popularity map, just use a berkley DB, and use
this information in your Lucene application. Lucene is a reversed index,
Berkeley DB an index.
M.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org