You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by Cédric Ulmer <ce...@francelabs.com> on 2016/11/28 14:41:15 UTC
RE: Architecture options for truncating large documents

Hi all,

Just in case, no one has opinions about it ? As a reminder, here is our architecture question:

We are currently looking at  the possibility to truncate large objects before indexing them, at the MCF level. For this, we have an  architecture dilemma, and we are open to the wisdom of the community:

*         What we want to achieve: Whenever a document is too large, instead  of just dropping it completely, we want to be able to index its metada.

*         How we can achieve that: 
Option 1. : We create a transformation connector that empties the stream, and keep only the metadata. Pros: we don’t modify the code of MCF. Cons: anytime we install MCF somewhere, although we can script the reload of the transfo connector, we need to manually reconfigure the binding jobs as there is way no way to automatically bind transfo connectors jobs.

Option 2. : We modify the standard behavior of the original connector (say the file connector). Instead of proposing the option to drop a document if it’s larger than size X, we modify it so that it proposes to drop its content but keep the metadata if larger than size X. Pros: it is in the MCF code once and for all, thus available whenever we install a new MCF somewhere. Cons: it may not be inline with the spirit of transformation connectors, and it requires to do it for any original connector that we are targeting.

Regards,

Cédric Ulmer
Président
France Labs – Les experts du Search 
Vainqueur du challenge Internal Search de EY à Viva Technologies 2016
www.francelabs.com
Tel : +33 (0) 662576490

-----Original Message-----
From: Cédric Ulmer [mailto:cedric.ulmer@francelabs.com] 
Sent: mardi 25 octobre 2016 22:37
To: dev@manifoldcf.apache.org
Subject: RE: Architecture options for truncating large documents

Hi Manifoldians,

No more comments on my question ? I'd be really interested to have your architecture opinions, especially if we intend to contribute back what we do !

Regards,

Cedric

-----Message d'origine-----
De : Cédric Ulmer [mailto:cedric.ulmer@francelabs.com]
Envoyé : vendredi 14 octobre 2016 17:49
À : dev@manifoldcf.apache.org
Objet : RE: Architecture options for truncating large documents

Hi Muhammed,

Thanks for your contribution. Just to make it clearier because I stated it incorrectly in my first email. We can indeed use a script to have the transfo connector, but the binding to the jobs need to be done manually. And since it's a functionality that we consider as relevant for almost all customers (they definitely prefer having at least the metadata rather than nothing at all), we still have the issue to manually do the binding for almost all the jobs.

Regards,

Cedric

-----Message d'origine-----
De : Muhammed Olgun [mailto:mh.olgun@gmail.com] Envoyé : vendredi 14 octobre 2016 17:37 À : dev@manifoldcf.apache.org Objet : Re: Architecture options for truncating large documents

Hi Cedric,

I would choose the option 1 and create a bash or python script to automatically reconfigure MCF for that connector. Even we can make that script open source so everyone easily add their custom connectors.

Thanks!
Muhammed
14 Eki 2016 Cum, saat 18:02 tarihinde Cédric Ulmer < cedric.ulmer@francelabs.com> şunu yazdı:

> Hi all,
>
>
>
> We are currently looking at  the possibility to truncate large objects 
> before indexing them, at the MCF level. For this, we have an 
> architecture dilemma, and we are open to the wisdom of the community:
>
>
>
> *         What we want to achieve: Whenever a document is too large,
> instead
> of just dropping it completely, we want to be able to index its metada.
>
>
>
> *         How we can achieve that:
>
> Option 1. : We create transformation connector that empties the 
> stream, and keep only the metadata. Pros: we don’t modify the code of MCF. Cons:
> anytime
> we install MCF somewhere, we need to manually reconfigure the transfo 
> connector as there is way no way to automatically import 
> transformation conenctors.
>
>
>
> Option 2. : We modify the standard behavior of the original connector 
> (say the file connector). Instead of proposing the option to drop a 
> document if it’s larger than size X, we modify it so that it proposes 
> to drop its content but keep the metadata if larger than size X. Pros:
> it is in the MCF code once and for all, thus available whenever we 
> install a new MCF somewhere. Cons: it may not be inline with the 
> spirit of transformation connectors, and it requires to do it for any 
> original connector that we are targeting.
>
>
>
> Can you share your thoughts on that?
>
>
>
> Regards,
>
>
>
> Cedric
>
>
>
> Président
>
> France Labs – Les experts du Search
>
> Vainqueur du challenge Internal Search de EY à 
> <http://www.vivatechnologyparis.com/> Viva Technologies 2016
>
>  <http://www.francelabs.com/> www.francelabs.com
>
>