You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@clerezza.apache.org by Stephane Gamard <st...@gamard.net> on 2013/10/02 16:04:27 UTC

Search in rdf.cris

Hi Team, 

My name's Stephane and I am currently participating to the Fusepool FP7 project. Within this project we are using stanbol & clerezza as key architectural components. Coming from a pure FullText search and Information Retrieval background I find myself in somewhat of a new territory.

But within that new territory there is a link to my area of expertise: Lucene/Solr in the rdf.cris package. This package turns out to be crucial for our project and I would gladly participate and contribute my knowledge as a Lucene and Solr developer. So here in a nutshell a list of "small contributions" to start with: 

- Abstraction Refactoring
Currently CRIS is using Lucene as its FT engine, but we might want to eventually go to Solr (or elasticsearch for XYZ reasons). First step would be to remove all Lucene dependencies in rdf.cris package and push implementation in rdf.cris.lucene package

- Lucene 4.x Branch
There are a large number of changes since the 2.x and 3.x branch of Lucene. I'd propose a small refactor and overhaul of the rdf.cris.lucene package to take advantage of Lucene's new features (Facets, SearchManager, …)

- Solr Implementation
In line with "in production" I strongly believe clerezza's CRIS component should be able to leverage established services without having to manage their scalability. That goes for FullText Search most obviously. The idea is to be able to use a remote Solr Server (Solr since it comes with the whole pseudo-rest servicing on top of Lucene).

- Fine Grained Search
As a logical evolution from the points above, it would be then perfect if clerezza's fulltext search capabilities could benefit from all the features of Lucene/Solr. I am especially thinking about: 
-- Field/Analyzer specialisation (we don't compare authors, dates and text in the same way in Lucene/Solr)
-- Boosting (For IR, the title of a document usually yields more important information than its footnotes)
-- Advanced facets (things like date-rage facets, pivot facets (called 2nd level facets in fusepool))
-- Geolocalised searches (big thing in Lucene/Solr 4.x branch… would eventually be a nice to have)

I will execute this work over the next few weeks/months as part of the fusepool project, but most of all I would be pleased and interested to finally get a top-notch implementation of cross rdf-text solution. Very much looking forward for your feedback and hopefully support ;)

PS: who ever initiated the GraphIndexer implementation did an excellent job! Will hopefully follow in his footsteps! 

Cheers, 

_Stephane

Re: Search in rdf.cris

Posted by Tommaso Teofili <to...@gmail.com>.

Hi Stephane,

sorry for the late response.

2013/10/3 Stephane Gamard <st...@gamard.net>

> Thank you Tommaso,
>
> I might need help or at the very least simple pointers and debates over
> certain principles and guidelines.
>
> First one being: the choice to either abstract everything related to
> search (such as Sorting fields, query, filters and facets) or to use the
> Lucene native objects. Small overview of pros and cons (for the rdf.cris
> package, not the implemenation packages).
>

yes that's usually one of the biggest challenges when search is not part of
the core architecture infrastructure. I tend to prefer the more abstract
way of doing things, with an eye on having generic yet flexible APIs as
most as possible. At the same time having a number of use cases and
implementation features that one wants to leverage may be a good drive for
designing such APIs.


>
> *Native Lucene*
> + Objects already exists, well implemented (SortField, Facet, …)
> - Bounds to lucene semantics (fairly easy to use but certain impl
> providers will have to rewrite using Lucene translation… In case someone
> wants to make a "Fast" or GSA impl for clerezza). Note that Lucene, Solr
> and Elastic can fairly easily work with Native Lucene Objects
> +/- Should put all search-ability logic into helper classes as to not
> force external package to talk "Lucene"
>
> *Abstracted Classes*
> - LOT of re-coding concepts that are straight forward in Lucene
> + No Lucene dependancies and no need of helper classes
> + Not bound to anything impl, rewrite for possible solr, GSA, fast, … will
> not require basic knowledge of Lucene.
>
> I'd be interested on you POV on this. My Main goal is for ppl outside of
> the rdf.cris package never having to learn any specialised API while yet
> taking advantage of all the IR features of any search engine.
>
>
I think this last requirement goes in the direction of more abstract design.
Maybe a good compromise for starting would be sketching up an API, extend /
implement a couple of use cases with Lucene, enhance the API, and iterate a
bunch of times till we're satisfied with it.

My 2 cents,
Tommaso


> _Stephane
>
>
> On October 3, 2013 at 1:59:07 PM, Tommaso Teofili (
> tommaso.teofili@gmail.com) wrote:
>
> Hi Stephane,
>
> I don't have much time now but I just wanted to let you know that IMHO
> your
> list of goals / tasks sounds completely reasonable, in case you need it I
> may be able to give some help along the next weeks.
>
> Regards,
> Tommaso
>
>
> 2013/10/2 Stephane Gamard <st...@gamard.net>
>
> > Hi Team,
> >
> > My name's Stephane and I am currently participating to the Fusepool FP7
> > project. Within this project we are using stanbol & clerezza as key
> > architectural components. Coming from a pure FullText search and
> > Information Retrieval background I find myself in somewhat of a new
> > territory.
> >
> > But within that new territory there is a link to my area of expertise:
> > Lucene/Solr in the rdf.cris package. This package turns out to be
> crucial
> > for our project and I would gladly participate and contribute my
> knowledge
> > as a Lucene and Solr developer. So here in a nutshell a list of "small
> > contributions" to start with:
> >
> > - Abstraction Refactoring
> > Currently CRIS is using Lucene as its FT engine, but we might want to
> > eventually go to Solr (or elasticsearch for XYZ reasons). First step
> would
> > be to remove all Lucene dependencies in rdf.cris package and push
> > implementation in rdf.cris.lucene package
> >
> > - Lucene 4.x Branch
> > There are a large number of changes since the 2.x and 3.x branch of
> > Lucene. I'd propose a small refactor and overhaul of the rdf.cris.lucene
> > package to take advantage of Lucene's new features (Facets,
> SearchManager,
> > …)
> >
> > - Solr Implementation
> > In line with "in production" I strongly believe clerezza's CRIS
> component
> > should be able to leverage established services without having to manage
> > their scalability. That goes for FullText Search most obviously. The
> idea
> > is to be able to use a remote Solr Server (Solr since it comes with the
> > whole pseudo-rest servicing on top of Lucene).
> >
> > - Fine Grained Search
> > As a logical evolution from the points above, it would be then perfect
> if
> > clerezza's fulltext search capabilities could benefit from all the
> features
> > of Lucene/Solr. I am especially thinking about:
> > -- Field/Analyzer specialisation (we don't compare authors, dates and
> text
> > in the same way in Lucene/Solr)
> > -- Boosting (For IR, the title of a document usually yields more
> important
> > information than its footnotes)
> > -- Advanced facets (things like date-rage facets, pivot facets (called
> 2nd
> > level facets in fusepool))
> > -- Geolocalised searches (big thing in Lucene/Solr 4.x branch… would
> > eventually be a nice to have)
> >
> > I will execute this work over the next few weeks/months as part of the
> > fusepool project, but most of all I would be pleased and interested to
> > finally get a top-notch implementation of cross rdf-text solution. Very
> > much looking forward for your feedback and hopefully support ;)
> >
> > PS: who ever initiated the GraphIndexer implementation did an excellent
> > job! Will hopefully follow in his footsteps!
> >
> > Cheers,
> >
> > _Stephane
> >
> >
>
>

Re: Search in rdf.cris

Posted by Stephane Gamard <st...@gamard.net>.

Thank you Tommaso, 

I might need help or at the very least simple pointers and debates over certain principles and guidelines. 

First one being: the choice to either abstract everything related to search (such as Sorting fields, query, filters and facets) or to use the Lucene native objects. Small overview of pros and cons (for the rdf.cris package, not the implemenation packages). 

Native Lucene
+ Objects already exists, well implemented (SortField, Facet, …)
- Bounds to lucene semantics (fairly easy to use but certain impl providers will have to rewrite using Lucene translation… In case someone wants to make a "Fast" or GSA impl for clerezza). Note that Lucene, Solr and Elastic can fairly easily work with Native Lucene Objects
+/- Should put all search-ability logic into helper classes as to not force external package to talk "Lucene"

Abstracted Classes
- LOT of re-coding concepts that are straight forward in Lucene
+ No Lucene dependancies and no need of helper classes
+ Not bound to anything impl, rewrite for possible solr, GSA, fast, … will not require basic knowledge of Lucene.

I'd be interested on you POV on this. My Main goal is for ppl outside of the rdf.cris package never having to learn any specialised API while yet taking advantage of all the IR features of any search engine.

_Stephane


On October 3, 2013 at 1:59:07 PM, Tommaso Teofili (tommaso.teofili@gmail.com) wrote:

Hi Stephane,  

I don't have much time now but I just wanted to let you know that IMHO your  
list of goals / tasks sounds completely reasonable, in case you need it I  
may be able to give some help along the next weeks.  

Regards,  
Tommaso  


2013/10/2 Stephane Gamard <st...@gamard.net>  

> Hi Team,  
>  
> My name's Stephane and I am currently participating to the Fusepool FP7  
> project. Within this project we are using stanbol & clerezza as key  
> architectural components. Coming from a pure FullText search and  
> Information Retrieval background I find myself in somewhat of a new  
> territory.  
>  
> But within that new territory there is a link to my area of expertise:  
> Lucene/Solr in the rdf.cris package. This package turns out to be crucial  
> for our project and I would gladly participate and contribute my knowledge  
> as a Lucene and Solr developer. So here in a nutshell a list of "small  
> contributions" to start with:  
>  
> - Abstraction Refactoring  
> Currently CRIS is using Lucene as its FT engine, but we might want to  
> eventually go to Solr (or elasticsearch for XYZ reasons). First step would  
> be to remove all Lucene dependencies in rdf.cris package and push  
> implementation in rdf.cris.lucene package  
>  
> - Lucene 4.x Branch  
> There are a large number of changes since the 2.x and 3.x branch of  
> Lucene. I'd propose a small refactor and overhaul of the rdf.cris.lucene  
> package to take advantage of Lucene's new features (Facets, SearchManager,  
> …)  
>  
> - Solr Implementation  
> In line with "in production" I strongly believe clerezza's CRIS component  
> should be able to leverage established services without having to manage  
> their scalability. That goes for FullText Search most obviously. The idea  
> is to be able to use a remote Solr Server (Solr since it comes with the  
> whole pseudo-rest servicing on top of Lucene).  
>  
> - Fine Grained Search  
> As a logical evolution from the points above, it would be then perfect if  
> clerezza's fulltext search capabilities could benefit from all the features  
> of Lucene/Solr. I am especially thinking about:  
> -- Field/Analyzer specialisation (we don't compare authors, dates and text  
> in the same way in Lucene/Solr)  
> -- Boosting (For IR, the title of a document usually yields more important  
> information than its footnotes)  
> -- Advanced facets (things like date-rage facets, pivot facets (called 2nd  
> level facets in fusepool))  
> -- Geolocalised searches (big thing in Lucene/Solr 4.x branch… would  
> eventually be a nice to have)  
>  
> I will execute this work over the next few weeks/months as part of the  
> fusepool project, but most of all I would be pleased and interested to  
> finally get a top-notch implementation of cross rdf-text solution. Very  
> much looking forward for your feedback and hopefully support ;)  
>  
> PS: who ever initiated the GraphIndexer implementation did an excellent  
> job! Will hopefully follow in his footsteps!  
>  
> Cheers,  
>  
> _Stephane  
>  
>

Re: Search in rdf.cris

Posted by Tommaso Teofili <to...@gmail.com>.

Hi Stephane,

I don't have much time now but I just wanted to let you know that IMHO your
list of goals / tasks sounds completely reasonable, in case you need it I
may be able to give some help along the next weeks.

Regards,
Tommaso


2013/10/2 Stephane Gamard <st...@gamard.net>

> Hi Team,
>
> My name's Stephane and I am currently participating to the Fusepool FP7
> project. Within this project we are using stanbol & clerezza as key
> architectural components. Coming from a pure FullText search and
> Information Retrieval background I find myself in somewhat of a new
> territory.
>
> But within that new territory there is a link to my area of expertise:
> Lucene/Solr in the rdf.cris package. This package turns out to be crucial
> for our project and I would gladly participate and contribute my knowledge
> as a Lucene and Solr developer. So here in a nutshell a list of "small
> contributions" to start with:
>
> - Abstraction Refactoring
> Currently CRIS is using Lucene as its FT engine, but we might want to
> eventually go to Solr (or elasticsearch for XYZ reasons). First step would
> be to remove all Lucene dependencies in rdf.cris package and push
> implementation in rdf.cris.lucene package
>
> - Lucene 4.x Branch
> There are a large number of changes since the 2.x and 3.x branch of
> Lucene. I'd propose a small refactor and overhaul of the rdf.cris.lucene
> package to take advantage of Lucene's new features (Facets, SearchManager,
> …)
>
> - Solr Implementation
> In line with "in production" I strongly believe clerezza's CRIS component
> should be able to leverage established services without having to manage
> their scalability. That goes for FullText Search most obviously. The idea
> is to be able to use a remote Solr Server (Solr since it comes with the
> whole pseudo-rest servicing on top of Lucene).
>
> - Fine Grained Search
> As a logical evolution from the points above, it would be then perfect if
> clerezza's fulltext search capabilities could benefit from all the features
> of Lucene/Solr. I am especially thinking about:
> -- Field/Analyzer specialisation (we don't compare authors, dates and text
> in the same way in Lucene/Solr)
> -- Boosting (For IR, the title of a document usually yields more important
> information than its footnotes)
> -- Advanced facets (things like date-rage facets, pivot facets (called 2nd
> level facets in fusepool))
> -- Geolocalised searches (big thing in Lucene/Solr 4.x branch… would
> eventually be a nice to have)
>
> I will execute this work over the next few weeks/months as part of the
> fusepool project, but most of all I would be pleased and interested to
> finally get a top-notch implementation of cross rdf-text solution. Very
> much looking forward for your feedback and hopefully support ;)
>
> PS: who ever initiated the GraphIndexer implementation did an excellent
> job! Will hopefully follow in his footsteps!
>
> Cheers,
>
> _Stephane
>
>