You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@manifoldcf.apache.org by Gustavo Beneitez <gu...@gmail.com> on 2018/07/25 21:57:26 UTC

Create a new ACTIVITY_FETCH from a transformation

Hi all,

I need to extract and analyse crawled urls because they may contain certain
parameters such as "?redirectURL=" that could point to new Documents to be
fetched and indexed.

First I was trying to create a subclass that extends

public class RedirectExtractor extends
org.apache.manifoldcf.agents.transformation.BaseTransformationConnector

and add a "RedirectExtractor" transformation step to the fetch process in
ManifoldCF, but it only allows me to modify current Document, not to create
a new FETCH from the extracted parameter.

I was investigating manifoldCF source code and I found something that may
be in hand

activities.recordActivity(null,ACTIVITY_FETCH,
                null,urlValue,Integer.toString(-2),"Robots exclusion",null);

from the IProcessActivity interface, which is used by the Connectors. I
didn't want to create a new connector since it is a bit complex but, do you
see an alternative or this is the only way?

Thanks in advance.

Re: Create a new ACTIVITY_FETCH from a transformation

Posted by Gustavo Beneitez <gu...@gmail.com>.

Thanks, I suspected that while I was reviewing the code but I was hoping
there was an alternative :)

Regards.

El jue., 26 jul. 2018 a las 12:11, Karl Wright (<da...@gmail.com>)
escribió:

> ManifoldCF has the concept of "compound document", but all the independent
> "components" of the document must be identified at the root level (that is,
> in the Repository Connector).
>
> I'm therefore afraid there is no good mapping from ManifoldCF concepts to
> what you want to do without writing your own Repository Connector.
>
> Karl
>
>
> On Thu, Jul 26, 2018 at 5:06 AM Gustavo Beneitez <
> gustavo.beneitez@gmail.com>
> wrote:
>
> > Hi Karl,
> >
> > I made a quick picture of what I really need (attached)
> >
> >  Certain URLs coming from repository could be split into two: URL1 and
> > URL2.
> >
> > Normal flow acts as only one is present, URL, but writing a new transform
> > I could realise also that there is another one: URL2.
> > My complain now is: "well, I have URL2 , how can then inject it to the
> > flow in order to become a new URL from the repository (and then fetched,
> > processed and ingested like others do)?".
> >
> > Thanks.
> >
> >
> >
> > El jue., 26 jul. 2018 a las 0:35, Karl Wright (<da...@gmail.com>)
> > escribió:
> >
> >> The crawled URL is transmitted as part of the RepositoryDocument object
> to
> >> the output connector.  If this is going to Solr, it's used as the
> >> document's ID.  You can therefore customize Solr (or ElasticSearch) to
> >> extract the data you need at the indexing end.
> >>
> >> If this doesn't make any sense to you, then please be more specific
> about
> >> what the disposition of each crawled document is.
> >>
> >> Thanks,
> >> Karl
> >>
> >>
> >> On Wed, Jul 25, 2018 at 5:57 PM Gustavo Beneitez <
> >> gustavo.beneitez@gmail.com>
> >> wrote:
> >>
> >> > Hi all,
> >> >
> >> > I need to extract and analyse crawled urls because they may contain
> >> certain
> >> > parameters such as "?redirectURL=" that could point to new Documents
> to
> >> be
> >> > fetched and indexed.
> >> >
> >> > First I was trying to create a subclass that extends
> >> >
> >> > public class RedirectExtractor extends
> >> >
> org.apache.manifoldcf.agents.transformation.BaseTransformationConnector
> >> >
> >> > and add a "RedirectExtractor" transformation step to the fetch process
> >> in
> >> > ManifoldCF, but it only allows me to modify current Document, not to
> >> create
> >> > a new FETCH from the extracted parameter.
> >> >
> >> > I was investigating manifoldCF source code and I found something that
> >> may
> >> > be in hand
> >> >
> >> > activities.recordActivity(null,ACTIVITY_FETCH,
> >> >                 null,urlValue,Integer.toString(-2),"Robots
> >> > exclusion",null);
> >> >
> >> > from the IProcessActivity interface, which is used by the Connectors.
> I
> >> > didn't want to create a new connector since it is a bit complex but,
> do
> >> you
> >> > see an alternative or this is the only way?
> >> >
> >> > Thanks in advance.
> >> >
> >>
> >
>

Re: Create a new ACTIVITY_FETCH from a transformation

Posted by Karl Wright <da...@gmail.com>.

ManifoldCF has the concept of "compound document", but all the independent
"components" of the document must be identified at the root level (that is,
in the Repository Connector).

I'm therefore afraid there is no good mapping from ManifoldCF concepts to
what you want to do without writing your own Repository Connector.

Karl


On Thu, Jul 26, 2018 at 5:06 AM Gustavo Beneitez <gu...@gmail.com>
wrote:

> Hi Karl,
>
> I made a quick picture of what I really need (attached)
>
>  Certain URLs coming from repository could be split into two: URL1 and
> URL2.
>
> Normal flow acts as only one is present, URL, but writing a new transform
> I could realise also that there is another one: URL2.
> My complain now is: "well, I have URL2 , how can then inject it to the
> flow in order to become a new URL from the repository (and then fetched,
> processed and ingested like others do)?".
>
> Thanks.
>
>
>
> El jue., 26 jul. 2018 a las 0:35, Karl Wright (<da...@gmail.com>)
> escribió:
>
>> The crawled URL is transmitted as part of the RepositoryDocument object to
>> the output connector.  If this is going to Solr, it's used as the
>> document's ID.  You can therefore customize Solr (or ElasticSearch) to
>> extract the data you need at the indexing end.
>>
>> If this doesn't make any sense to you, then please be more specific about
>> what the disposition of each crawled document is.
>>
>> Thanks,
>> Karl
>>
>>
>> On Wed, Jul 25, 2018 at 5:57 PM Gustavo Beneitez <
>> gustavo.beneitez@gmail.com>
>> wrote:
>>
>> > Hi all,
>> >
>> > I need to extract and analyse crawled urls because they may contain
>> certain
>> > parameters such as "?redirectURL=" that could point to new Documents to
>> be
>> > fetched and indexed.
>> >
>> > First I was trying to create a subclass that extends
>> >
>> > public class RedirectExtractor extends
>> > org.apache.manifoldcf.agents.transformation.BaseTransformationConnector
>> >
>> > and add a "RedirectExtractor" transformation step to the fetch process
>> in
>> > ManifoldCF, but it only allows me to modify current Document, not to
>> create
>> > a new FETCH from the extracted parameter.
>> >
>> > I was investigating manifoldCF source code and I found something that
>> may
>> > be in hand
>> >
>> > activities.recordActivity(null,ACTIVITY_FETCH,
>> >                 null,urlValue,Integer.toString(-2),"Robots
>> > exclusion",null);
>> >
>> > from the IProcessActivity interface, which is used by the Connectors. I
>> > didn't want to create a new connector since it is a bit complex but, do
>> you
>> > see an alternative or this is the only way?
>> >
>> > Thanks in advance.
>> >
>>
>

Re: Create a new ACTIVITY_FETCH from a transformation

Posted by Gustavo Beneitez <gu...@gmail.com>.

Hi Karl,

I made a quick picture of what I really need (attached)

 Certain URLs coming from repository could be split into two: URL1 and
URL2.

Normal flow acts as only one is present, URL, but writing a new transform I
could realise also that there is another one: URL2.
My complain now is: "well, I have URL2 , how can then inject it to the flow
in order to become a new URL from the repository (and then fetched,
processed and ingested like others do)?".

Thanks.



El jue., 26 jul. 2018 a las 0:35, Karl Wright (<da...@gmail.com>)
escribió:

> The crawled URL is transmitted as part of the RepositoryDocument object to
> the output connector.  If this is going to Solr, it's used as the
> document's ID.  You can therefore customize Solr (or ElasticSearch) to
> extract the data you need at the indexing end.
>
> If this doesn't make any sense to you, then please be more specific about
> what the disposition of each crawled document is.
>
> Thanks,
> Karl
>
>
> On Wed, Jul 25, 2018 at 5:57 PM Gustavo Beneitez <
> gustavo.beneitez@gmail.com>
> wrote:
>
> > Hi all,
> >
> > I need to extract and analyse crawled urls because they may contain
> certain
> > parameters such as "?redirectURL=" that could point to new Documents to
> be
> > fetched and indexed.
> >
> > First I was trying to create a subclass that extends
> >
> > public class RedirectExtractor extends
> > org.apache.manifoldcf.agents.transformation.BaseTransformationConnector
> >
> > and add a "RedirectExtractor" transformation step to the fetch process in
> > ManifoldCF, but it only allows me to modify current Document, not to
> create
> > a new FETCH from the extracted parameter.
> >
> > I was investigating manifoldCF source code and I found something that may
> > be in hand
> >
> > activities.recordActivity(null,ACTIVITY_FETCH,
> >                 null,urlValue,Integer.toString(-2),"Robots
> > exclusion",null);
> >
> > from the IProcessActivity interface, which is used by the Connectors. I
> > didn't want to create a new connector since it is a bit complex but, do
> you
> > see an alternative or this is the only way?
> >
> > Thanks in advance.
> >
>

Re: Create a new ACTIVITY_FETCH from a transformation

Posted by Karl Wright <da...@gmail.com>.

The crawled URL is transmitted as part of the RepositoryDocument object to
the output connector.  If this is going to Solr, it's used as the
document's ID.  You can therefore customize Solr (or ElasticSearch) to
extract the data you need at the indexing end.

If this doesn't make any sense to you, then please be more specific about
what the disposition of each crawled document is.

Thanks,
Karl

On Wed, Jul 25, 2018 at 5:57 PM Gustavo Beneitez <gu...@gmail.com>
wrote:

> Hi all,
>
> I need to extract and analyse crawled urls because they may contain certain
> parameters such as "?redirectURL=" that could point to new Documents to be
> fetched and indexed.
>
> First I was trying to create a subclass that extends
>
> public class RedirectExtractor extends
> org.apache.manifoldcf.agents.transformation.BaseTransformationConnector
>
> and add a "RedirectExtractor" transformation step to the fetch process in
> ManifoldCF, but it only allows me to modify current Document, not to create
> a new FETCH from the extracted parameter.
>
> I was investigating manifoldCF source code and I found something that may
> be in hand
>
> activities.recordActivity(null,ACTIVITY_FETCH,
>                 null,urlValue,Integer.toString(-2),"Robots
> exclusion",null);
>
> from the IProcessActivity interface, which is used by the Connectors. I
> didn't want to create a new connector since it is a bit complex but, do you
> see an alternative or this is the only way?
>
> Thanks in advance.
>