You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@stanbol.apache.org by Luca Dini <di...@celi.it> on 2012/03/02 11:28:47 UTC

CV Mining Which CMS

Hi Andreas,
that's a good question. So far we have been using no CMS. Just a SOLR 
based application with faceting in the style of ajax-solr or vuFind. Now 
we would like to shift to a CMS to allow real CV management and not Just 
uploading and searching. So the basic element for the choice are:
1) Faceted Search
===============
We noticed that users of this kind of application become rapidly 
confident in this kind of search modality. However it looks like neither 
Alfresco nor Nuxeo have something comparable. Nuxeo has a kind of 
faceted navigation implementation, but it is not really that, in the 
sense that it provides for each field all facets to be selectable, 
irrespective of the fact that it returns documents or not. Moreover it 
does not provide the number of documents which would be returned by 
selecting a certain facet. And it does not seems that NEs are integrated 
in this kind of search. Alfresco seems to support facets only via the 
LucidImagination integration, which is proprietary. However it is based 
on SOLR, thus it might not be impossible to integrate a view with ajax-solr
2) Standbol Integration
===================
The interesting claim that we would like to verify is that by using 
stanbol, than the integration between a semantic system and a CMS is 
much easier. Thus we need to evaluate how far the various integrations 
Stanbol/CMS which already took place, can support a central use of data 
coming from stanbol and not just an accessory enrichment.

3) Non functional constraints
=========================
Even in the best world, I think it would be over-optimistic to assume 
that no intervention on the source code of the CMS will be required. As 
we are mainly java programmers, the fact that the CMS is implemented in 
java is a prerequisite.

In the last week we have been investigating in all these directions and 
it seems so far that probably Alfresco with some specially designed 
interface is a wise choice. But we didn't make a choice yet, so any 
advise would be absolutely precious.
Many thanks,
Luca


On 01/03/2012 20:52, Andreas Kuckartz wrote:
> Hi Luca,
>
> which CMS do you intend to use for the project?
>
> Cheers,
> Andreas
> ---
>
> On 01.03.2012 15:44, Luca Dini wrote:
>> Dear All,
>> Please let me introduce a new early adopter project, in which we will
>> be involved. I hope in a great and intellectually inspiring
>> communication with you all.
>> Kind regards,
>> Luca
>>
>> The project (run by CELI under the umbrella of the  IKS early adopter
>> program) aims  to integrate Stanbol technology with a specific context
>> of use, i.e. CV management via CMS and semantic technologies. The
>> crucial challenge of this integration is the parametrization of
>> Stanbol to deal with information which has been automatically
>> extracted from CV. Besides the direct integration results, which will
>> be distributed at the same conditions as Stanbol software, the early
>> adoption project will produce two additional by-products:
>>
>>      The provision to Stanbol of classes allowing the connection with
>> Linguagrid (www.linguagrid.org) and possibly LanguageGrid
>> (http://langrid.org/en/index.html).
>>      The verification of the extensibility of Stanbol to languages
>> other than English (The project will concern CVs written in French).
>>
>> We envisage two prototypical use cases, which are described in the
>> following:
>> Use-Case 1: Human Resources Department
>>
>> The context is the one of a Human Resource Department of a big company
>> or any recruitment company. The basic goal is to provide them with an
>> open source document management system able to deal in an intelligent
>> way with non structured CV (or "resumes"), i.e. CVs which comes in
>> Microsoft Word, pdf, Open Office etc. Each time a new CV arrives it is
>> inserted in the document base. Behind the scene this is not just
>> adding a document but passing it to a Standbol server which enhances
>> it with structured information.
>>
>> This might represent:
>>
>>      experiences of the candidate
>>      skills of the candidate
>>      Education level
>>      reference data (name, address etc.)
>>      contact data
>>
>> Some of these data might be slightly more structured than just named
>> entities, but definitely in the representation power of rdf. Some of
>> them could be even more semantically enriched, by providing external
>> information on companies, places, specific technologies etc.
>>
>> As a result of this personnel at the HR department would be able to
>> formulate queries such as (just an exemplification):
>>
>>      All CV of people living in Paris older then 27 years
>>      All CV of people with skills in SQL server and Java
>>      All people who have worked in an high tech company since november
>> 2011.
>>
>> ....
>>
>> In terms of GUI the user will be confronted with a system that allows
>> easy search and easy population of CV data.
>>
>>
>> Use-Case 2: Employment Administration
>>
>> In the second use case we are keeping into account the needs of public
>> agencies with the institutional role of re-integrating in the labor
>> market persons which loose their job or that are looking for their
>> first job. In particular we are considering institutions such as the
>> French Pôle emploi (http://www.pole-emploi.fr/accueil/ ,
>> http://fr.wikipedia.org/wiki/P%C3%B4le_emploi). This institution is in
>> charge of crossing the demand and the offer on the labor market, in
>> particular by addressing candidates to the right potential employer,
>> suggesting possible educational training, by shaping their skills,
>> etc. In many cases these agencies are managed at a local rather than a
>> national level, as the market of labor is affected by regional
>> constraints. In this use case the parametrized CMS has a double goal:
>>
>>      Much like in the previous case to allow the fast and intelligent
>> retrieval of CVs out of the document base in order to answer potential
>> employer needs.
>>      To be able to perform Business Intelligence like tasks over the
>> structured information provided by the mass of analyzed CVs. Of course
>> performing BI analysis is out of the scope of this proposal, but the
>> structuring of CV information into ontology based classes is
>> definitely the first step towards this direction.
>>
>>
>>
>>
>> Challenges
>>
>>  From a technical point of view the most interesting challenge consists
>> in integrating the set of Stanbol enhancer, with the semantic web
>> services provided at www.linguagrid.org. In principle it should not be
>> a different integration than what has already been made with
>> OpenCalais WS and Zemanta WS. However there are at least two major
>> challenges:
>>
>>      Multilinguality. The extraction will consider French documents
>> rather than English ones. Moreover, in a second phase (not covered by
>> the present project, the whole system could be extended to Italian and
>> French.
>>      Ontological extension. While CVs typically contains quite a lot of
>> named entities which are already covered by Stanbol (e.g. geographical
>> names, time expressions, Company names, person names) there are
>> entities which will need some ontology extension such as skills and
>> education.
>>      Structural Complexity. In a CV instances of entities are linked
>> each other in a structurally complex way. For instance places are not
>> just a flat list of geographical entities, but their are likely to be
>> connected with periods, with job types, with companies, etc. Handling
>> this structural complexity represents an important challenge.
>>
>>
>>
>

Re: CV Mining Which CMS

Posted by Olivier Grisel <ol...@ensta.org>.

2012/3/2 Luca Dini <di...@celi.it>:
> Hi Andreas,
> that's a good question. So far we have been using no CMS. Just a SOLR based
> application with faceting in the style of ajax-solr or vuFind. Now we would
> like to shift to a CMS to allow real CV management and not Just uploading
> and searching. So the basic element for the choice are:
> 1) Faceted Search
> ===============
> We noticed that users of this kind of application become rapidly confident
> in this kind of search modality. However it looks like neither Alfresco nor
> Nuxeo have something comparable. Nuxeo has a kind of faceted navigation
> implementation, but it is not really that, in the sense that it provides for
> each field all facets to be selectable, irrespective of the fact that it
> returns documents or not. Moreover it does not provide the number of
> documents which would be returned by selecting a certain facet.

Yes, we could not integrate with Solr up until recently because it
would have been impossible to implement ACL inheritence efficiently in
a scalable way. However there is a new JoinQueryParserPlugin in Solr
trunk (will be released in 4.0) and we started investigating with solr
integration: this is scheduled for Nuxeo 6.0:

https://github.com/nuxeo/nuxeo-solr/tree/master/architecture
https://jira.nuxeo.com/secure/IssueNavigator.jspa?reset=true&jqlQuery=cf[10080]+%3D+solr

> And it does not seems that NEs are integrated in this kind of search.

This is my objective for the end of the IKS year.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel