You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by Rafa Haro <rh...@apache.org> on 2014/06/30 12:48:47 UTC
Testing Pipelines. Conclusions so far and Some Doubts
Hi,
I have spent a couple of hours testing the Pipelines in ManifoldCF 1.7.
Before exposing the problems I have experimented and before asking some
questions, I would like to explain the kind of test I have performed so
far:
1. Testing with a simple File system connector for simplicity
2. Using 2 instances of Solr Output Connector for testing Multiple
output. The final Solr instance is the same and each output connector
has been configured with 2 different solr cores (collection1 and
collection2)
3. Using Allowed Documents and Tika Extractor as Transformation
connectors. Allowed Documents has been configured to allow only PDF
files (mimetype + extension)
4. The processing pipeline I wanted to configure is quite simple: Filter
and extract content (with Tika) for collection1 and a normal crawling
for collection2. Let me explain better: both transformation connectors
were configured for collection1 Solr Output and no transformation
connector were configured for collection2. I have two files in the
configured repository path for the File system connector: a PDF file and
an ODS file. I was expecting only the PDF file to be indexed in
collection1 and both files in collection2.
The result of the experiment has been the following:
1. All the files have been indexed in both collections. Apparently the
Allowed Documents transformation connector doesn't work with filesystem
repository connector.
2. For collection1 Output Connector, I first changed the Update Handler
from /update/extract to /update because Tika Extractor was going to be
configured for it. This change produces an error in Solr while indexing
(Unsupported ContentType: application/octet-stream Not in:
[application/xml, text/csv, text/json, application/csv,
application/javabin, text/xml, application/json]).
3. Therefore, I configured again the update handler as /update/extract.
Because the same exact content is being indexed for both cores, I don't
have a way to know if the Tika transformation connector is working
properly or not.
Up to here the testing outcomes. Now I would like to expose some
conclusions from the point of view of our use case. Although the
pipeline approach is great, as far as I have understood it, we can't
still use it for our purposes. Specifically, what we would is somehow to
create different repository documents in any moment of the chain and
send them to different output connector. Let me put an easy use case:
We want to process the documents to extract Named Entities: Persons,
Places and Organizations. The first transformation of the pipeline can
use any NER system to extract the name entities. Then I want to have
separates repositories (outputs): one for the raw content and one for
each type of entity. Let's say 4 different solr cores. Of course with
current approach I could send the same repository document to all the
outputs and respectively filtering, but doesn't sound to me as a good
solution.
Cheers,
Rafa