You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by Karl Wright <da...@gmail.com> on 2014/06/11 16:09:51 UTC

Call for trunk pipeline testers

Hi folks,

ManifoldCF finally has a pipeline!  All tests pass.  Now I'm looking for
people to try things out by hand to see if there are any rough edges,
before we get too far along in the 1.7 development cycle to fix them.

Trunk has all the necessary moving parts and documentation as well.  There
are two transformation connectors available -- one that does nothing but
pass data through, and one that forces metadata (just like the framework
"Forced metadata" tab).  But since you can have more than one of each kind
of connector in a pipeline, this should be enough to exercise things fairly
completely.

We still need to address a couple of things in the medium and long term.
First, we need a Tika transformation connector, that extracts metadata from
binary files.  There's an existing ticket for that: CONNECTORS-954.  If
anyone wants to take a crack at that, please let me know.  (Takumi Yoshida
would be the obvious choice.)  Second, we need to come up with a strategy
of removing obsolete tabs/features, like the aforementioned general job
Forced Metadata tab.  We've got a fair number of such features around now.
These kinds of things cannot be removed without either a comprehensive
automatic upgrade, or loss of backwards compatibility.  I am thinking maybe
we break with backwards compatibility and work towards cleaning out
duplicate features for ManifoldCF 2.0.

Thoughts?

Karl

Re: Call for trunk pipeline testers

Posted by Karl Wright <da...@gmail.com>.
I've created CONNECTORS-962 to track the "multiple output" idea.

Karl


On Wed, Jun 11, 2014 at 12:21 PM, Karl Wright <da...@gmail.com> wrote:

> Hi Rafa,
>
> We would be very interested in a contribution that addresses
> CONNECTORS-954.  As far as changing the Solr connector to not use the
> extracting update handler, as long as that is only one of many options that
> contribution too would be welcome.  Please consider opening a ticket
> specifically for that change.
>
> Output to multiple indexes at the same time has come up before, but this
> is more of a challenge because in theory we'd want to keep a different
> record in the ingeststatus table for each document for each individual
> output index.  With pipeline support, each output index would also no doubt
> need a distinct pipeline as well.  Nevertheless, I'm not opposed to adding
> this feature if I can work out a good way to do it.
>
> So let's start with CONNECTORS-954 and Solr connector changes, and see how
> far we get.
>
> Karl
>
>
>
> On Wed, Jun 11, 2014 at 12:10 PM, Rafa Haro <rh...@apache.org> wrote:
>
>> Hi Karl,
>>
>> We (in Zaizi) had also this requirement. We initially addressed it by
>> creating a sort of "Processor Connector" mainly for semantically enhancing
>> the repository documents before indexing them. We would be very happy to
>> give this a try and provide feedback because our current approach is
>> totally temporal. Apart from processing the document, we also had an
>> special requirement that is to produce different instances of repository
>> documents because we populate more than one index at the same time. We
>> would need to check also how we can do exactly the same with this
>> processing pipeline.
>>
>> Apart from this Karl, we can also take care of the Tika integration
>> (actually we already did it) and eventually take care of CONNECTORS-954
>> then. Because we already use Tika as "processor connector", we are going to
>> also modify the solr connector for not using the extract update handler
>> which present some problems also. Would that be interesting also for the
>> community?
>>
>> Cheers,
>> Rafa
>>
>> El 11/06/14 16:09, Karl Wright escribió:
>>
>>  Hi folks,
>>>
>>> ManifoldCF finally has a pipeline!  All tests pass.  Now I'm looking for
>>> people to try things out by hand to see if there are any rough edges,
>>> before we get too far along in the 1.7 development cycle to fix them.
>>>
>>> Trunk has all the necessary moving parts and documentation as well.
>>>  There
>>> are two transformation connectors available -- one that does nothing but
>>> pass data through, and one that forces metadata (just like the framework
>>> "Forced metadata" tab).  But since you can have more than one of each
>>> kind
>>> of connector in a pipeline, this should be enough to exercise things
>>> fairly
>>> completely.
>>>
>>> We still need to address a couple of things in the medium and long term.
>>> First, we need a Tika transformation connector, that extracts metadata
>>> from
>>> binary files.  There's an existing ticket for that: CONNECTORS-954.  If
>>> anyone wants to take a crack at that, please let me know.  (Takumi
>>> Yoshida
>>> would be the obvious choice.)  Second, we need to come up with a strategy
>>> of removing obsolete tabs/features, like the aforementioned general job
>>> Forced Metadata tab.  We've got a fair number of such features around
>>> now.
>>> These kinds of things cannot be removed without either a comprehensive
>>> automatic upgrade, or loss of backwards compatibility.  I am thinking
>>> maybe
>>> we break with backwards compatibility and work towards cleaning out
>>> duplicate features for ManifoldCF 2.0.
>>>
>>> Thoughts?
>>>
>>> Karl
>>>
>>>
>>
>

Re: Call for trunk pipeline testers

Posted by Karl Wright <da...@gmail.com>.
Hi Rafa,

We would be very interested in a contribution that addresses
CONNECTORS-954.  As far as changing the Solr connector to not use the
extracting update handler, as long as that is only one of many options that
contribution too would be welcome.  Please consider opening a ticket
specifically for that change.

Output to multiple indexes at the same time has come up before, but this is
more of a challenge because in theory we'd want to keep a different record
in the ingeststatus table for each document for each individual output
index.  With pipeline support, each output index would also no doubt need a
distinct pipeline as well.  Nevertheless, I'm not opposed to adding this
feature if I can work out a good way to do it.

So let's start with CONNECTORS-954 and Solr connector changes, and see how
far we get.

Karl



On Wed, Jun 11, 2014 at 12:10 PM, Rafa Haro <rh...@apache.org> wrote:

> Hi Karl,
>
> We (in Zaizi) had also this requirement. We initially addressed it by
> creating a sort of "Processor Connector" mainly for semantically enhancing
> the repository documents before indexing them. We would be very happy to
> give this a try and provide feedback because our current approach is
> totally temporal. Apart from processing the document, we also had an
> special requirement that is to produce different instances of repository
> documents because we populate more than one index at the same time. We
> would need to check also how we can do exactly the same with this
> processing pipeline.
>
> Apart from this Karl, we can also take care of the Tika integration
> (actually we already did it) and eventually take care of CONNECTORS-954
> then. Because we already use Tika as "processor connector", we are going to
> also modify the solr connector for not using the extract update handler
> which present some problems also. Would that be interesting also for the
> community?
>
> Cheers,
> Rafa
>
> El 11/06/14 16:09, Karl Wright escribió:
>
>  Hi folks,
>>
>> ManifoldCF finally has a pipeline!  All tests pass.  Now I'm looking for
>> people to try things out by hand to see if there are any rough edges,
>> before we get too far along in the 1.7 development cycle to fix them.
>>
>> Trunk has all the necessary moving parts and documentation as well.  There
>> are two transformation connectors available -- one that does nothing but
>> pass data through, and one that forces metadata (just like the framework
>> "Forced metadata" tab).  But since you can have more than one of each kind
>> of connector in a pipeline, this should be enough to exercise things
>> fairly
>> completely.
>>
>> We still need to address a couple of things in the medium and long term.
>> First, we need a Tika transformation connector, that extracts metadata
>> from
>> binary files.  There's an existing ticket for that: CONNECTORS-954.  If
>> anyone wants to take a crack at that, please let me know.  (Takumi Yoshida
>> would be the obvious choice.)  Second, we need to come up with a strategy
>> of removing obsolete tabs/features, like the aforementioned general job
>> Forced Metadata tab.  We've got a fair number of such features around now.
>> These kinds of things cannot be removed without either a comprehensive
>> automatic upgrade, or loss of backwards compatibility.  I am thinking
>> maybe
>> we break with backwards compatibility and work towards cleaning out
>> duplicate features for ManifoldCF 2.0.
>>
>> Thoughts?
>>
>> Karl
>>
>>
>

Re: Call for trunk pipeline testers

Posted by Karl Wright <da...@gmail.com>.
Hi Ahmet,

For 1.x we can't drop backwards compatibility, so we can't drop our custom
solr server implementation.  For 2.0 we might be able to drop things that
are no longer useful, though.

Karl



On Wed, Jun 11, 2014 at 6:42 PM, Ahmet Arslan <io...@yahoo.com.invalid>
wrote:

> Hi,
>
> bq. we are going to also modify the solr connector for not using the
> extract update handler
>
> +1 to this. With this, we can support wide range of solr versions by just
> sending xml update messages. Solr setups will be simpler i.e. Don't need to
> have solr cell jars. We can drop our custom solr server implementation.
>
> Ahmet
>
>
>
> On Wednesday, June 11, 2014 7:10 PM, Rafa Haro <rh...@apache.org> wrote:
> Hi Karl,
>
> We (in Zaizi) had also this requirement. We initially addressed it by
> creating a sort of "Processor Connector" mainly for semantically
> enhancing the repository documents before indexing them. We would be
> very happy to give this a try and provide feedback because our current
> approach is totally temporal. Apart from processing the document, we
> also had an special requirement that is to produce different instances
> of repository documents because we populate more than one index at the
> same time. We would need to check also how we can do exactly the same
> with this processing pipeline.
>
> Apart from this Karl, we can also take care of the Tika integration
> (actually we already did it) and eventually take care of CONNECTORS-954
> then. Because we already use Tika as "processor connector", we are going
> to also modify the solr connector for not using the extract update
> handler which present some problems also. Would that be interesting also
> for the community?
>
> Cheers,
> Rafa
>
>
>
>
> El 11/06/14 16:09, Karl Wright escribió:
> > Hi folks,
> >
> > ManifoldCF finally has a pipeline!  All tests pass.  Now I'm looking for
> > people to try things out by hand to see if there are any rough edges,
> > before we get too far along in the 1.7 development cycle to fix them.
> >
> > Trunk has all the necessary moving parts and documentation as well.
> There
> > are two transformation connectors available -- one that does nothing but
> > pass data through, and one that forces metadata (just like the framework
> > "Forced metadata" tab).  But since you can have more than one of each
> kind
> > of connector in a pipeline, this should be enough to exercise things
> fairly
> > completely.
> >
> > We still need to address a couple of things in the medium and long term.
> > First, we need a Tika transformation connector, that extracts metadata
> from
> > binary files.  There's an existing ticket for that: CONNECTORS-954.  If
> > anyone wants to take a crack at that, please let me know.  (Takumi
> Yoshida
> > would be the obvious choice.)  Second, we need to come up with a strategy
> > of removing obsolete tabs/features, like the aforementioned general job
> > Forced Metadata tab.  We've got a fair number of such features around
> now.
> > These kinds of things cannot be removed without either a comprehensive
> > automatic upgrade, or loss of backwards compatibility.  I am thinking
> maybe
> > we break with backwards compatibility and work towards cleaning out
> > duplicate features for ManifoldCF 2.0.
> >
> > Thoughts?
> >
> > Karl
> >
>

Re: Call for trunk pipeline testers

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi,

bq. we are going to also modify the solr connector for not using the extract update handler 

+1 to this. With this, we can support wide range of solr versions by just sending xml update messages. Solr setups will be simpler i.e. Don't need to have solr cell jars. We can drop our custom solr server implementation.

Ahmet



On Wednesday, June 11, 2014 7:10 PM, Rafa Haro <rh...@apache.org> wrote:
Hi Karl,

We (in Zaizi) had also this requirement. We initially addressed it by 
creating a sort of "Processor Connector" mainly for semantically 
enhancing the repository documents before indexing them. We would be 
very happy to give this a try and provide feedback because our current 
approach is totally temporal. Apart from processing the document, we 
also had an special requirement that is to produce different instances 
of repository documents because we populate more than one index at the 
same time. We would need to check also how we can do exactly the same 
with this processing pipeline.

Apart from this Karl, we can also take care of the Tika integration 
(actually we already did it) and eventually take care of CONNECTORS-954 
then. Because we already use Tika as "processor connector", we are going 
to also modify the solr connector for not using the extract update 
handler which present some problems also. Would that be interesting also 
for the community?

Cheers,
Rafa




El 11/06/14 16:09, Karl Wright escribió:
> Hi folks,
>
> ManifoldCF finally has a pipeline!  All tests pass.  Now I'm looking for
> people to try things out by hand to see if there are any rough edges,
> before we get too far along in the 1.7 development cycle to fix them.
>
> Trunk has all the necessary moving parts and documentation as well.  There
> are two transformation connectors available -- one that does nothing but
> pass data through, and one that forces metadata (just like the framework
> "Forced metadata" tab).  But since you can have more than one of each kind
> of connector in a pipeline, this should be enough to exercise things fairly
> completely.
>
> We still need to address a couple of things in the medium and long term.
> First, we need a Tika transformation connector, that extracts metadata from
> binary files.  There's an existing ticket for that: CONNECTORS-954.  If
> anyone wants to take a crack at that, please let me know.  (Takumi Yoshida
> would be the obvious choice.)  Second, we need to come up with a strategy
> of removing obsolete tabs/features, like the aforementioned general job
> Forced Metadata tab.  We've got a fair number of such features around now.
> These kinds of things cannot be removed without either a comprehensive
> automatic upgrade, or loss of backwards compatibility.  I am thinking maybe
> we break with backwards compatibility and work towards cleaning out
> duplicate features for ManifoldCF 2.0.
>
> Thoughts?
>
> Karl
>

Re: Call for trunk pipeline testers

Posted by Rafa Haro <rh...@apache.org>.
Hi Karl,

We (in Zaizi) had also this requirement. We initially addressed it by 
creating a sort of "Processor Connector" mainly for semantically 
enhancing the repository documents before indexing them. We would be 
very happy to give this a try and provide feedback because our current 
approach is totally temporal. Apart from processing the document, we 
also had an special requirement that is to produce different instances 
of repository documents because we populate more than one index at the 
same time. We would need to check also how we can do exactly the same 
with this processing pipeline.

Apart from this Karl, we can also take care of the Tika integration 
(actually we already did it) and eventually take care of CONNECTORS-954 
then. Because we already use Tika as "processor connector", we are going 
to also modify the solr connector for not using the extract update 
handler which present some problems also. Would that be interesting also 
for the community?

Cheers,
Rafa

El 11/06/14 16:09, Karl Wright escribió:
> Hi folks,
>
> ManifoldCF finally has a pipeline!  All tests pass.  Now I'm looking for
> people to try things out by hand to see if there are any rough edges,
> before we get too far along in the 1.7 development cycle to fix them.
>
> Trunk has all the necessary moving parts and documentation as well.  There
> are two transformation connectors available -- one that does nothing but
> pass data through, and one that forces metadata (just like the framework
> "Forced metadata" tab).  But since you can have more than one of each kind
> of connector in a pipeline, this should be enough to exercise things fairly
> completely.
>
> We still need to address a couple of things in the medium and long term.
> First, we need a Tika transformation connector, that extracts metadata from
> binary files.  There's an existing ticket for that: CONNECTORS-954.  If
> anyone wants to take a crack at that, please let me know.  (Takumi Yoshida
> would be the obvious choice.)  Second, we need to come up with a strategy
> of removing obsolete tabs/features, like the aforementioned general job
> Forced Metadata tab.  We've got a fair number of such features around now.
> These kinds of things cannot be removed without either a comprehensive
> automatic upgrade, or loss of backwards compatibility.  I am thinking maybe
> we break with backwards compatibility and work towards cleaning out
> duplicate features for ManifoldCF 2.0.
>
> Thoughts?
>
> Karl
>