You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Steven Parkes (JIRA)" <ji...@apache.org> on 2007/03/28 19:09:25 UTC

[jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

     [ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Parkes updated LUCENE-848:
---------------------------------

      Description: Add support for using Wikipedia for benchmarking.  (was: Add support for using Wikipedia for benchmarking. If no one is working on this, I'll start soon.)
    Lucene Fields:   (was: [New])
          Summary: Add supported for Wikipedia English as a corpus in the benchmarker stuff  (was: Add supported for Wikipediea English as a corpus in the benchmarker stuff)

Can't leave the typo in the title. It's bugging me.

Karl, it looks like your stuff grabs individual articles, right? I'm gong to have it download the bzip2 snapshots they provide (and that they prefer you use, if you're getting much).

Question (for Doron and anyone else): the file is xml and it's big, so DOM isn't going to work. I could still use something SAX based but since the format is so tightly controlled, I'm thinking regular expressions would be sufficient and have less dependences. Anyone have opinions on this? 

> Add supported for Wikipedia English as a corpus in the benchmarker stuff
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/benchmark
>            Reporter: Steven Parkes
>         Assigned To: Steven Parkes
>            Priority: Minor
>             Fix For: 2.2
>
>         Attachments: WikipediaHarvester.java
>
>
> Add support for using Wikipedia for benchmarking.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Apr 2, 2007, at 2:50 PM, Steven Parkes wrote:

> On the one hand, creating separate per-article files is "clean" in  
> that
> when you then ingest, you only have disk i/o that's going to affect  
> the
> ingest performance (as opposed to, say, uncompressing/parsing). On the
> other hand, that's a lot of disk i/o (compresses by about 5X) and a  
> lot
> of directory lookups.

One reason I was expanding the elements into individual files was so  
that I could compare different libraries against Lucene, including  
those in other languages.  It was important to measure the engines  
themselves, not SGML parsers.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Posted by Steven Parkes <st...@esseff.org>.

Yes, indeed.  May not be necessary initially, but we could support  
XPath or something down the road to allow us to specify what things  
> I wouldn't worry about generalizing too much  
> to start with.  Once we have a couple collections then we can go that

> route.

My thoughts, too.

I've been looking at the Reuters stuff. It uncompressed the distribution
and then creates per-article files. I can't decide if I think that's a
good idea for Wikipedia. It's big (about 10G uncompressed) and has about
1.2M files (so I've heard; unverified).

On the one hand, creating separate per-article files is "clean" in that
when you then ingest, you only have disk i/o that's going to affect the
ingest performance (as opposed to, say, uncompressing/parsing). On the
other hand, that's a lot of disk i/o (compresses by about 5X) and a lot
of directory lookups.

Anybody have any opinions/relevant past experience?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Posted by Grant Ingersoll <gr...@gmail.com>.

On Apr 2, 2007, at 3:41 PM, Steven Parkes wrote:

> I checked and there are escape sequences in there. If it was ever
> debatable, I think that tips it in favor of SAX. xerces? The
> contrib/gdata stuff seems to use it.

Xerces should be fine, I think.

>
> I suppose if I'm careful and creative enough, we could share a lot of
> the code amongst benchmark ingesters that use XML, should there be  
> more
> ...
>

Yes, indeed.  May not be necessary initially, but we could support  
XPath or something down the road to allow us to specify what things  
we are interested in.  I wouldn't worry about generalizing too much  
to start with.  Once we have a couple collections then we can go that  
route.

> -----Original Message-----
> From: Grant Ingersoll [mailto:grant.ingersoll@gmail.com]
> Sent: Wednesday, March 28, 2007 10:44 AM
> To: java-dev@lucene.apache.org
> Subject: Re: [jira] Updated: (LUCENE-848) Add supported for Wikipedia
> English as a corpus in the benchmarker stuff
>
>
> On Mar 28, 2007, at 1:09 PM, Steven Parkes (JIRA) wrote:
>
>>
>>      [ https://issues.apache.org/jira/browse/LUCENE-848?
>> page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>>
>> Steven Parkes updated LUCENE-848:
>> ---------------------------------
>>
>>       Description: Add support for using Wikipedia for
>> benchmarking.  (was: Add support for using Wikipedia for
>> benchmarking. If no one is working on this, I'll start soon.)
>>     Lucene Fields:   (was: [New])
>>           Summary: Add supported for Wikipedia English as a corpus
>> in the benchmarker stuff  (was: Add supported for Wikipediea
>> English as a corpus in the benchmarker stuff)
>>
>> Can't leave the typo in the title. It's bugging me.
>>
>> Karl, it looks like your stuff grabs individual articles, right?
>> I'm gong to have it download the bzip2 snapshots they provide (and
>> that they prefer you use, if you're getting much).
>>
>> Question (for Doron and anyone else): the file is xml and it's big,
>> so DOM isn't going to work. I could still use something SAX based
>> but since the format is so tightly controlled, I'm thinking regular
>> expressions would be sufficient and have less dependences. Anyone
>> have opinions on this?
>
>
> Personally, I think SAX is the way to go, as you'll get handling of
> escape sequences, etc. out of the box.  And seems like it is easier
> to read/maintain????
>
>>
>>> Add supported for Wikipedia English as a corpus in the benchmarker
>>> stuff
>>> -------------------------------------------------------------------- 
>>> -
>
>>> ---
>>>
>>>                 Key: LUCENE-848
>>>                 URL: https://issues.apache.org/jira/browse/ 
>>> LUCENE-848
>>>             Project: Lucene - Java
>>>          Issue Type: New Feature
>>>          Components: contrib/benchmark
>>>            Reporter: Steven Parkes
>>>         Assigned To: Steven Parkes
>>>            Priority: Minor
>>>             Fix For: 2.2
>>>
>>>         Attachments: WikipediaHarvester.java
>>>
>>>
>>> Add support for using Wikipedia for benchmarking.
>>
>> -- 
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Posted by Steven Parkes <st...@esseff.org>.

I checked and there are escape sequences in there. If it was ever
debatable, I think that tips it in favor of SAX. xerces? The
contrib/gdata stuff seems to use it.

I suppose if I'm careful and creative enough, we could share a lot of
the code amongst benchmark ingesters that use XML, should there be more
... 

-----Original Message-----
From: Grant Ingersoll [mailto:grant.ingersoll@gmail.com] 
Sent: Wednesday, March 28, 2007 10:44 AM
To: java-dev@lucene.apache.org
Subject: Re: [jira] Updated: (LUCENE-848) Add supported for Wikipedia
English as a corpus in the benchmarker stuff


On Mar 28, 2007, at 1:09 PM, Steven Parkes (JIRA) wrote:

>
>      [ https://issues.apache.org/jira/browse/LUCENE-848? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Steven Parkes updated LUCENE-848:
> ---------------------------------
>
>       Description: Add support for using Wikipedia for  
> benchmarking.  (was: Add support for using Wikipedia for  
> benchmarking. If no one is working on this, I'll start soon.)
>     Lucene Fields:   (was: [New])
>           Summary: Add supported for Wikipedia English as a corpus  
> in the benchmarker stuff  (was: Add supported for Wikipediea  
> English as a corpus in the benchmarker stuff)
>
> Can't leave the typo in the title. It's bugging me.
>
> Karl, it looks like your stuff grabs individual articles, right?  
> I'm gong to have it download the bzip2 snapshots they provide (and  
> that they prefer you use, if you're getting much).
>
> Question (for Doron and anyone else): the file is xml and it's big,  
> so DOM isn't going to work. I could still use something SAX based  
> but since the format is so tightly controlled, I'm thinking regular  
> expressions would be sufficient and have less dependences. Anyone  
> have opinions on this?


Personally, I think SAX is the way to go, as you'll get handling of  
escape sequences, etc. out of the box.  And seems like it is easier  
to read/maintain????

>
>> Add supported for Wikipedia English as a corpus in the benchmarker  
>> stuff
>> ---------------------------------------------------------------------

>> ---
>>
>>                 Key: LUCENE-848
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>>             Project: Lucene - Java
>>          Issue Type: New Feature
>>          Components: contrib/benchmark
>>            Reporter: Steven Parkes
>>         Assigned To: Steven Parkes
>>            Priority: Minor
>>             Fix For: 2.2
>>
>>         Attachments: WikipediaHarvester.java
>>
>>
>> Add support for using Wikipedia for benchmarking.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Posted by Doron Cohen <DO...@il.ibm.com>.

Grant Ingersoll <gr...@gmail.com> wrote on 28/03/2007 10:44:08:

>
> On Mar 28, 2007, at 1:09 PM, Steven Parkes (JIRA) wrote:
>
> > Question (for Doron and anyone else): the file is xml and it's big,
> > so DOM isn't going to work. I could still use something SAX based
> > but since the format is so tightly controlled, I'm thinking regular
> > expressions would be sufficient and have less dependences. Anyone
> > have opinions on this?
>
>
> Personally, I think SAX is the way to go, as you'll get handling of
> escape sequences, etc. out of the box.  And seems like it is easier
> to read/maintain????

TrecDocMaker is relying on the strict structure of the input data - the
read() method there is "eating" the input stream until reaching points of
interest, and optionally collects (lines of) text, depending on the format
here you may be able to use a variation of this. If input here is not that
strictly defined, SAX would be better.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Posted by Grant Ingersoll <gr...@gmail.com>.

On Mar 28, 2007, at 1:09 PM, Steven Parkes (JIRA) wrote:

>
>      [ https://issues.apache.org/jira/browse/LUCENE-848? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Steven Parkes updated LUCENE-848:
> ---------------------------------
>
>       Description: Add support for using Wikipedia for  
> benchmarking.  (was: Add support for using Wikipedia for  
> benchmarking. If no one is working on this, I'll start soon.)
>     Lucene Fields:   (was: [New])
>           Summary: Add supported for Wikipedia English as a corpus  
> in the benchmarker stuff  (was: Add supported for Wikipediea  
> English as a corpus in the benchmarker stuff)
>
> Can't leave the typo in the title. It's bugging me.
>
> Karl, it looks like your stuff grabs individual articles, right?  
> I'm gong to have it download the bzip2 snapshots they provide (and  
> that they prefer you use, if you're getting much).
>
> Question (for Doron and anyone else): the file is xml and it's big,  
> so DOM isn't going to work. I could still use something SAX based  
> but since the format is so tightly controlled, I'm thinking regular  
> expressions would be sufficient and have less dependences. Anyone  
> have opinions on this?


Personally, I think SAX is the way to go, as you'll get handling of  
escape sequences, etc. out of the box.  And seems like it is easier  
to read/maintain????

>
>> Add supported for Wikipedia English as a corpus in the benchmarker  
>> stuff
>> --------------------------------------------------------------------- 
>> ---
>>
>>                 Key: LUCENE-848
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>>             Project: Lucene - Java
>>          Issue Type: New Feature
>>          Components: contrib/benchmark
>>            Reporter: Steven Parkes
>>         Assigned To: Steven Parkes
>>            Priority: Minor
>>             Fix For: 2.2
>>
>>         Attachments: WikipediaHarvester.java
>>
>>
>> Add support for using Wikipedia for benchmarking.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org