You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@manifoldcf.apache.org by Rafa Haro <rh...@apache.org> on 2014/07/01 19:04:51 UTC

Configuration Management at Transformation Connectors

Hi guys,

I'm trying to develop my first Transformation Connector. Before starting 
to code, I have tried to read first enough documentation and I have also 
studied the Tika extractor as transformation connector example. 
Currently, I'm just trying to implement an initial version of my 
connector, starting with something simple to later complicate the things 
a little bit. The first problem I'm facing is the configuration 
management, where I'm probably missing something. In my case, I need a 
fixed configuration while creating an instance of the connector and a 
extended configuration per job. Let's say that the connector 
configuration has to setup a service and the job configuration will 
define how the service should work for each job. With both 
configurations, I need to create an object which is expensive to 
instantiate. Here is where the doubts raise:

1. I would like to initialize the configuration object only once per job 
execution. Because the configuration is not supposed to be changed 
during a job execution, I would like to be able to take the 
configuration parameters from ConfigParams and from Specification 
objects and create a unique instance of my configuration object.

2. The getPipelineDescription method is quite confusing for me. In the 
Tika Extractor, this method is used to pack in a string the 
configuration of the Tika processor. Then this string is again unpacked 
in the addOrReplaceDocumentWithException method to read the 
documentation. My question is why?. As far as I understand, the 
configuration can't change during the job execution and according to the 
documentation "the contents of the document cannot be considered by this 
method, and that a different version string (defined in 
IRepositoryConnector) is used to describe the version of the actual 
document". So, if only configuration data can be used to create the 
output version string, probably this version string can be checked by 
the system before starting the job and not produced and checked per 
document because basically all the documents are going to produce the 
same exact output version string. Probably I'm missing something but, 
for example, looking at Tika Transformation connector seems to be pretty 
clear that there would be no difference between output version strings 
for all the documents because it is using only configuration data to 
create the string.

3.In the addOrReplaceDocumentWithException, why is the 
pipelineDescription passed by parameter instead of the connector 
Specification to ease the developer to access the configuration without 
marshalling and unmarshalling it?

4. Is there a way to reuse a single configuration object per job 
execution? In the Output processor connector, I used to initialize my 
custom stuff in the connect method (I'm not sure if this strategy is 
valid anyway), but for the Transformation connectors I'm not even sure 
if this method is called.

Thanks a lot for your help beforehand. Please note that the questions of 
course are not intended to be criticism. This mail is just a dump of 
doubts that probably will help me to better understand the workflows in 
manifold