You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/05/10 14:45:07 UTC

[GitHub] [lucene] mikemccand commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

mikemccand commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-836788286


   > > I also tried to run wikibigall as well, which seems to require enwiki-20100302-pages-articles-lines.txt but it's not downloaded by the util. It appears the archive should be coming from http://home.apache.org/~mikemccand/enwiki-20100302-pages-articles-lines.txt.bz2, but it's giving 404 now.
   > 
   > Hmm good question, I had downloaded this file years ago, I'm not sure where to find it nowadays. @mikemccand Do you know where to find it? Otherwise I'll upload mine somewhere.
   
   Egads, it is indeed missing!  I will re-upload it.  Hmm, I do not seem to have that exact file locally cached. Confusing ;)  I have the `-1kb` version (medium sized docs), but not the `big` docs.  @jpountz could you please post somewhere, maybe your `home.apache.org/~jpountz`, using `sftp`?  Then I'll download and copy it up to `/~mikemccand`.
   
   The nightly benchmarks uses the binary form of `wikibigall`, to reduce thread bottleneck when reading/parsing documents to index.  Hmm it is sampled from a different date (01/15/2011) ... OK I am uploading that one to https://home.apache.org/~mikemccand/enwiki-20110115-lines.bin (ETA ~20 minutes more).
   
   BTW there is a [`luceneutil` issue to re-sample the Wikipedia export](https://github.com/mikemccand/luceneutil/issues/91), but, alas, it is snagged up because the latest `enwiki` download, after converting XML -> text, is SMALLER than the sample we pulled 11 years ago!  Which I could not yet explain ...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org