You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Dawn Zoë Raison <da...@digitorial.co.uk> on 2011/11/28 20:09:52 UTC

Analysers for newspaper pages...

Hi folks,

I'm researching the best options to use for analysing/storing newspaper 
pages in out online archive, and wondered if anyone has any good hints 
or tips on good practice for this type of media?

I'm currently thinking alone the lines of using a customised 
StandardAnalyser (no stop words + extra date token detection) wrapped 
with a Shingle filter and finally a Stopword filter - the thinking being 
this should reduce the impact of stop words but still allow "to be or 
not to be" searches...

A future aim is to add a synonym filter at search time.

We currently have ~2.5million pages - some of the older broadsheet pages 
can have a serious number of tokens.
We currently index using the SimpleAnalyser - a hangover from the 
previous developers I hope to remedy :-).

-- 

Rgds.
*Dawn Raison*
Technical Director, Digitorial Ltd.



Re: Analysers for newspaper pages...

Posted by Ian Lea <ia...@gmail.com>.
You can easily use just the CommonGrams stuff from Solr in your pure
lucene project.

There are a couple of useful docs on stop words and common grams et al at

http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-1
http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2

--
Ian.

On Mon, Nov 28, 2011 at 8:31 PM, Dawn Zoë Raison <da...@digitorial.co.uk> wrote:
> Hi Steve,
>
> On 28/11/2011 19:43, Steven A Rowe wrote:
>>
>> I assume that when you refer to "the impact of stop words," you're
>> concerned about query-time performance?  You should consider the possibility
>> that performance without removing stop words is good enough that you won't
>> have to take any steps to address the issue.
>
> Not to fussed about query-time performance; certainly no-one has complained
> so far. It's more the sheer number of junk pages we get searching on phrases
> that contain stop words - it can lead to hundreds of thousands of results,
> and the pedants among our userbase insist on paging through the lot :-|
>
> I'd much rather contain the stop words using a *gram based approach and
> offer a less populous but more accurate resultset.
>
>>
>> That said, there are two filters in Solr 3.X[1] that would do the
>> equivalent of what you have outlined:
>> CommonGramsFilter<http://lucene.apache.org/solr/api/org/apache/solr/analysis/CommonGramsFilter.html>
>>  and
>> CommonGramsQueryFilter<http://lucene.apache.org/solr/api/org/apache/solr/analysis/CommonGramsQueryFilter.html>.
>
> We use lucene directly, but I'll take a look - Thanks.
>
>> You can use these filters with a Lucene 3.X application by including the
>> (same-versioned) solr-core jar as a dependency.
>>
>> Steve
>
> --
>
> Rgds.
> *Dawn Raison*
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Analysers for newspaper pages...

Posted by Dawn Zoë Raison <da...@digitorial.co.uk>.
Hi Steve,

On 28/11/2011 19:43, Steven A Rowe wrote:
> I assume that when you refer to "the impact of stop words," you're concerned about query-time performance?  You should consider the possibility that performance without removing stop words is good enough that you won't have to take any steps to address the issue.
Not to fussed about query-time performance; certainly no-one has 
complained so far. It's more the sheer number of junk pages we get 
searching on phrases that contain stop words - it can lead to hundreds 
of thousands of results, and the pedants among our userbase insist on 
paging through the lot :-|

I'd much rather contain the stop words using a *gram based approach and 
offer a less populous but more accurate resultset.

>
> That said, there are two filters in Solr 3.X[1] that would do the equivalent of what you have outlined: CommonGramsFilter<http://lucene.apache.org/solr/api/org/apache/solr/analysis/CommonGramsFilter.html>  and CommonGramsQueryFilter<http://lucene.apache.org/solr/api/org/apache/solr/analysis/CommonGramsQueryFilter.html>.
We use lucene directly, but I'll take a look - Thanks.

> You can use these filters with a Lucene 3.X application by including the (same-versioned) solr-core jar as a dependency.
>
> Steve

-- 

Rgds.
*Dawn Raison*


RE: Analysers for newspaper pages...

Posted by Steven A Rowe <sa...@syr.edu>.
Hi Dawn,

I assume that when you refer to "the impact of stop words," you're concerned about query-time performance?  You should consider the possibility that performance without removing stop words is good enough that you won't have to take any steps to address the issue.

That said, there are two filters in Solr 3.X[1] that would do the equivalent of what you have outlined: CommonGramsFilter <http://lucene.apache.org/solr/api/org/apache/solr/analysis/CommonGramsFilter.html> and CommonGramsQueryFilter <http://lucene.apache.org/solr/api/org/apache/solr/analysis/CommonGramsQueryFilter.html>.

You can use these filters with a Lucene 3.X application by including the (same-versioned) solr-core jar as a dependency.

Steve

[1] In Lucene/Solr trunk, which will be released as 4.0, these filters have been moved to a shared Lucene/Solr module.

> -----Original Message-----
> From: Dawn Zoë Raison [mailto:dawn@digitorial.co.uk]
> Sent: Monday, November 28, 2011 2:10 PM
> To: java-user@lucene.apache.org
> Subject: Analysers for newspaper pages...
> 
> Hi folks,
> 
> I'm researching the best options to use for analysing/storing newspaper
> pages in out online archive, and wondered if anyone has any good hints
> or tips on good practice for this type of media?
> 
> I'm currently thinking alone the lines of using a customised
> StandardAnalyser (no stop words + extra date token detection) wrapped
> with a Shingle filter and finally a Stopword filter - the thinking being
> this should reduce the impact of stop words but still allow "to be or
> not to be" searches...
> 
> A future aim is to add a synonym filter at search time.
> 
> We currently have ~2.5million pages - some of the older broadsheet pages
> can have a serious number of tokens.
> We currently index using the SimpleAnalyser - a hangover from the
> previous developers I hope to remedy :-).
> 
> --
> 
> Rgds.
> *Dawn Raison*
> Technical Director, Digitorial Ltd.
>