You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by Rafa Haro <rh...@apache.org> on 2014/06/30 12:48:47 UTC

Testing Pipelines. Conclusions so far and Some Doubts

Hi,

I have spent a couple of hours testing the Pipelines in ManifoldCF 1.7. 
Before exposing the problems I have experimented and before asking some 
questions, I would like to explain the kind of test I have performed so 
far:

1. Testing with a simple File system connector for simplicity

2. Using 2 instances of Solr Output Connector for testing Multiple 
output. The final Solr instance is the same and each output connector 
has been configured with 2 different solr cores (collection1 and 
collection2)

3. Using Allowed Documents and Tika Extractor as Transformation 
connectors. Allowed Documents has been configured to allow only PDF 
files (mimetype + extension)

4. The processing pipeline I wanted to configure is quite simple: Filter 
and extract content (with Tika) for collection1 and a normal crawling 
for collection2. Let me explain better: both transformation connectors 
were configured for collection1 Solr Output and no transformation 
connector were configured for collection2. I have two files in the 
configured repository path for the File system connector: a PDF file and 
an ODS file. I was expecting only the PDF file to be indexed in 
collection1 and both files in collection2.

The result of the experiment has been the following:

1. All the files have been indexed in both collections. Apparently the 
Allowed Documents transformation connector doesn't work with filesystem 
repository connector.

2. For collection1 Output Connector, I first changed the Update Handler 
from /update/extract to /update because Tika Extractor was going to be 
configured for it. This change produces an error in Solr while indexing 
(Unsupported ContentType: application/octet-stream Not in: 
[application/xml, text/csv, text/json, application/csv, 
application/javabin, text/xml, application/json]).

3. Therefore, I configured again the update handler as /update/extract. 
Because the same exact content is being indexed for both cores, I don't 
have a way to know if the Tika transformation connector is working 
properly or not.

Up to here the testing outcomes. Now I would like to expose some 
conclusions from the point of view of our use case. Although the 
pipeline approach is great, as far as I have understood it, we can't 
still use it for our purposes. Specifically, what we would is somehow to 
create different repository documents in any moment of the chain and 
send them to different output connector. Let me put an easy use case:

We want to process the documents to extract Named Entities: Persons, 
Places and Organizations. The first transformation of the pipeline can 
use any NER system to extract the name entities. Then I want to have 
separates repositories (outputs): one for the raw content and one for 
each type of entity. Let's say 4 different solr cores. Of course with 
current approach I could send the same repository document to all the 
outputs and respectively filtering, but doesn't sound to me as a good 
solution.

Cheers,
Rafa