You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by Antonio David Pérez Morales <ad...@gmail.com> on 2015/05/26 17:10:14 UTC

[GSoC] Confluence connector project status after bonding period

Hi all

During the bonding period and these days I have been taking a look and
familiarizing with Confluence API,
doing some tests using CURL before start the implementation of the
repository connector which is the first step as stated in the proposal.

I have deployed a local instance of Confluence as well, so that I can do
the development and tests using that instance.

As stated in the proposal, Confluence is migrating its old APIs (rpc-xml,
rpc-json) to the new REST API, so all the methods are not migrated yet.
For getting the changes, fields and content of the documents there won't be
any problem, but for permissions I have to check more in deep if the new
REST API already support it.
If not, we will have to do a mix using the methods provided by the rpc-json
api for that, and update it when the REST API contains all the methods.

After the first tests, there is no easy way to retrieve the user
permissions because they are tied to documents and/or spaces. So in order
to retrieve the user permissions,
documentId or SpaceId and user have to be provided. I proposed two
approaches to tackle this, one not so efficient, making many requests to
Confluence and the other developing a Confluence plugin to get them
(because at least at Java API level it is possible, but don't know yet what
kind of permissions it returns)

So I think, for that part, we can start using (trying) permissions at Space
level and then try to go finer at document level.
These problems are mainly related to the second part of the project
(Authority Connector) but I think it is interesting to put here some
results after the first overall tests I have performed.

Regards

Re: [GSoC] Confluence connector project status after bonding period

Posted by Rafa Haro <rh...@apache.org>.
Hi Antonio, devs




Thanks for the update. Glad to see this rolling after bonding period. The next step could be probably to setup a github repo. We can also use a wiki to follow the development. 

As planned in the proposal, let's start with the repository connector. It would be nice to have a first overview of the configuration that might be needed (authentication, crawling options and so on)




Happy coding!




:-)



—
Enviado desde Mailbox

On Wed, May 27, 2015 at 10:15 AM, Karl Wright <da...@gmail.com> wrote:

> Hi Antonio,
> I agree that it's pretty important to understand pretty much what will be
> needed before actually beginning coding.  Thanks!
> Karl
> On Tue, May 26, 2015 at 11:10 AM, Antonio David Pérez Morales <
> adperezmorales@gmail.com> wrote:
>> Hi all
>>
>> During the bonding period and these days I have been taking a look and
>> familiarizing with Confluence API,
>> doing some tests using CURL before start the implementation of the
>> repository connector which is the first step as stated in the proposal.
>>
>> I have deployed a local instance of Confluence as well, so that I can do
>> the development and tests using that instance.
>>
>> As stated in the proposal, Confluence is migrating its old APIs (rpc-xml,
>> rpc-json) to the new REST API, so all the methods are not migrated yet.
>> For getting the changes, fields and content of the documents there won't be
>> any problem, but for permissions I have to check more in deep if the new
>> REST API already support it.
>> If not, we will have to do a mix using the methods provided by the rpc-json
>> api for that, and update it when the REST API contains all the methods.
>>
>> After the first tests, there is no easy way to retrieve the user
>> permissions because they are tied to documents and/or spaces. So in order
>> to retrieve the user permissions,
>> documentId or SpaceId and user have to be provided. I proposed two
>> approaches to tackle this, one not so efficient, making many requests to
>> Confluence and the other developing a Confluence plugin to get them
>> (because at least at Java API level it is possible, but don't know yet what
>> kind of permissions it returns)
>>
>> So I think, for that part, we can start using (trying) permissions at Space
>> level and then try to go finer at document level.
>> These problems are mainly related to the second part of the project
>> (Authority Connector) but I think it is interesting to put here some
>> results after the first overall tests I have performed.
>>
>> Regards
>>

Re: [GSoC] Confluence connector project status after bonding period

Posted by Karl Wright <da...@gmail.com>.
Hi Antonio,

I agree that it's pretty important to understand pretty much what will be
needed before actually beginning coding.  Thanks!

Karl


On Tue, May 26, 2015 at 11:10 AM, Antonio David Pérez Morales <
adperezmorales@gmail.com> wrote:

> Hi all
>
> During the bonding period and these days I have been taking a look and
> familiarizing with Confluence API,
> doing some tests using CURL before start the implementation of the
> repository connector which is the first step as stated in the proposal.
>
> I have deployed a local instance of Confluence as well, so that I can do
> the development and tests using that instance.
>
> As stated in the proposal, Confluence is migrating its old APIs (rpc-xml,
> rpc-json) to the new REST API, so all the methods are not migrated yet.
> For getting the changes, fields and content of the documents there won't be
> any problem, but for permissions I have to check more in deep if the new
> REST API already support it.
> If not, we will have to do a mix using the methods provided by the rpc-json
> api for that, and update it when the REST API contains all the methods.
>
> After the first tests, there is no easy way to retrieve the user
> permissions because they are tied to documents and/or spaces. So in order
> to retrieve the user permissions,
> documentId or SpaceId and user have to be provided. I proposed two
> approaches to tackle this, one not so efficient, making many requests to
> Confluence and the other developing a Confluence plugin to get them
> (because at least at Java API level it is possible, but don't know yet what
> kind of permissions it returns)
>
> So I think, for that part, we can start using (trying) permissions at Space
> level and then try to go finer at document level.
> These problems are mainly related to the second part of the project
> (Authority Connector) but I think it is interesting to put here some
> results after the first overall tests I have performed.
>
> Regards
>

Re: [GSoC] Confluence connector project status after bonding period

Posted by Rafa Haro <rh...@apache.org>.
Hi Antonio,

First of all, it is nice to see such good progress here. I will fork the
repo at github and will give it a try very soon because I'm starting to
need it for a real use case, so I will be able to provide user feedback
very soon. Let me make some comments over your email:

On Fri, Jun 5, 2015 at 6:48 PM, Antonio David Pérez Morales <
adperezmorales@gmail.com> wrote:

> Hi devs and all
>
> As part of the development of the Atlassian Confluence connector for
> Manifold, I have created a repository [1] on my GitHub account
> Moreover I have developed and pushed the first version of the Confluence
> repository connector on a branch called 'feature/repository-connector'.
> This first version of the repository connector allows to crawl all the
> Confluence pages contained in the spaces (only pages), with the possibility
> to filter by a space.
> At this moment, only one space or all of them can be configured to be
> crawled, but I will continue improving the connector adding more features
> like configuring more than one space or others you may see interesting to
> be added.
>

Do we know something about pages metadata? How is the content retrieved, in
a raw format or is it possible to get also formatted content?


> The ACLs of the documents crawled are the space it belongs to, so that it
> is aligned with the proposal for the Authority connector starting at Space
> level and then going more fine grain if necessary.
> There are no tests included at this moment, because I'm working on them.
>

Ok, perfect. Probably it is too soon for having tests, but it would be
great to include them progressively while you are developing the connector.


>
> To see if the proposed Authority connector could be developed in a good
> way, I have done a quick proof of concept with the logic of it and I was
> able to get the spaces which the user has permission to access. So I can
> confirm that at space level, we can add permissions to the documents
> crawled in order to filter them later on the search engine being used.
>
> One important thing is that the new Confluence REST Api does not include
> any endpoints for permissions yet, so legacy API's have to be used for that
> while Confluence developers port completely the legacy methods to the new
> api.
>

Let's hope they include them for the second part of the project and if not
we can always use the legacy API and update it after GSoC as a normal
contribution


>
> I will continue improving the repository connector and thinking new
> features to be added, but if you think some feature is interesting or good
> to have, please let me know and I will take a look in order to include it.
>
> As stated in the proposal, the Authority Connector is planned for the
> second part of the project, but I started to work a bit on it to make sure
> we can have a general working version and then iteratively improving it.
>

Ok, let's focus on the repository connector then until midterm evaluation.
Next step would be probably to include more filtering options and check
what kind of metadata can be extracted from the pages. I will check the
code, but the kind of this we should be thinking now are for example page's
URLs to be indexed, seeding strategy and so on.


>
> As always, comments and suggestions are more than welcome.
>

Nothing else from my side. Well done so far!


>
> Regards
>
>
> [1] https://github.com/adperezmorales/confluence-manifold-connector/
>
> 2015-05-26 17:10 GMT+02:00 Antonio David Pérez Morales <
> adperezmorales@gmail.com>:
>
> > Hi all
> >
> > During the bonding period and these days I have been taking a look and
> > familiarizing with Confluence API,
> > doing some tests using CURL before start the implementation of the
> > repository connector which is the first step as stated in the proposal.
> >
> > I have deployed a local instance of Confluence as well, so that I can do
> > the development and tests using that instance.
> >
> > As stated in the proposal, Confluence is migrating its old APIs (rpc-xml,
> > rpc-json) to the new REST API, so all the methods are not migrated yet.
> > For getting the changes, fields and content of the documents there won't
> > be any problem, but for permissions I have to check more in deep if the
> new
> > REST API already support it.
> > If not, we will have to do a mix using the methods provided by the
> > rpc-json api for that, and update it when the REST API contains all the
> > methods.
> >
> > After the first tests, there is no easy way to retrieve the user
> > permissions because they are tied to documents and/or spaces. So in order
> > to retrieve the user permissions,
> > documentId or SpaceId and user have to be provided. I proposed two
> > approaches to tackle this, one not so efficient, making many requests to
> > Confluence and the other developing a Confluence plugin to get them
> > (because at least at Java API level it is possible, but don't know yet
> what
> > kind of permissions it returns)
> >
> > So I think, for that part, we can start using (trying) permissions at
> > Space level and then try to go finer at document level.
> > These problems are mainly related to the second part of the project
> > (Authority Connector) but I think it is interesting to put here some
> > results after the first overall tests I have performed.
> >
> > Regards
> >
>

Re: [GSoC] Confluence connector project status after bonding period

Posted by Rafa Haro <rh...@apache.org>.
Hi Antonio,

Thanks a lot for the update and congratulations for reaching GSoC's midterm
successfully. I have been checking the proposal again carefully and, as I
have remarked in your evaluation, you are on schedule so far. Let me
comment your previous email:

On Sat, Jun 27, 2015 at 11:30 AM, Antonio David Pérez Morales <
adperezmorales@gmail.com> wrote:

> Hi all
>
> Continuing with the development of Confluence repository connector [1] I
> have added support for processing attachments (configurable per job) for
> pages, ability to crawl each kind of pages and extract the page labels if
> they have been set.
>

Good. During this week, I will pull the last version from github and will
be testing it against a real Confluence instance on my current company. I
will also take a look to the code to check if some refactor is needed.
After testing, I should have a better idea about possible changes regarding
configuration, performance, integration details and so on. I will provide
my feedback ASAP both to you directly and here in the ManifoldCF's
developers list. In that way, anyone also interested in the connector
within the community can also provide feedback, ideas... I won't delay too
much the testing because this is the right moment to accomplish possible
changes/additions before continuing with the second part of the project
that, according to the proposal planning, should be focused on the
Authority connector, unit testing and final documentation. In summary,
let's close the repository connector first :-).


> Besides, a complete refactor of the code and client has been done, so now
> the connector has a better code flow and the specific components have been
> simplified.
>

Perfect, I will take a look to the code and will provide feedback also.


> Right now, the connector is able to process Page and attachments, being
> Page each supported confluence page type (blog, table, etc). As an
> improvement I want to add specific support for each kind of Page, giving
> the connector the ability to process Page-specific features. For example,
> for Blog pages, the connector is extracting the default/common properties,
> but adding the specific support for Blog page model (like it has been done
> for attachments) would allow to extract Blog page-specific properties and
> set the specific type for that page to Blog instead of Page (by default).
>

Well, I suppose that we should decide a scope here, because to take into
account all types of Confluence Pages would probably complicate too much
the connector right now. This is something that we can progressively
improve or contribute after GSoC have finished. Which it is important now
is to come up with a good design for Confluence Pages in the connector in
order to ease the inclusion of concrete behavior for any kind of Page. Let
me take a look to the current code and also check how many different kind
of pages is currently supporting Confluence and will come back to you with
a suggestion.


>
> Another things that is being done in parallel is the development of the
> tests, to have a complete set of unit tests for the midterm.


> The code can be found at [1]. It is a git branch. After the midterm, I will
> merge that branch with master one and create another one for the
> development of the authority connector to keep them separated at the
> moment.
>

Ok. I aim the rest of the community to take a look also to the code and
provide feedback. I will try to do the same with the other ManifoldCF's
GSoC project for integrating Kafka. If it is make it easier, once you merge
the connector to the master branch I could setup a branch at ManifoldCF's
svn space for easing the testing.

Cheers,
Rafa


>
> If you have any question/suggestion/comment, please drop a message here
> which will be more than welcome.
>
> Regards
> --
>
> [1]
>
> https://github.com/adperezmorales/confluence-manifold-connector/tree/feature/repository-connector
>
> 2015-06-05 18:48 GMT+02:00 Antonio David Pérez Morales <
> adperezmorales@gmail.com>:
>
> > Hi devs and all
> >
> > As part of the development of the Atlassian Confluence connector for
> > Manifold, I have created a repository [1] on my GitHub account
> > Moreover I have developed and pushed the first version of the Confluence
> > repository connector on a branch called 'feature/repository-connector'.
> > This first version of the repository connector allows to crawl all the
> > Confluence pages contained in the spaces (only pages), with the
> possibility
> > to filter by a space.
> > At this moment, only one space or all of them can be configured to be
> > crawled, but I will continue improving the connector adding more features
> > like configuring more than one space or others you may see interesting to
> > be added.
> > The ACLs of the documents crawled are the space it belongs to, so that it
> > is aligned with the proposal for the Authority connector starting at
> Space
> > level and then going more fine grain if necessary.
> > There are no tests included at this moment, because I'm working on them.
> >
> > To see if the proposed Authority connector could be developed in a good
> > way, I have done a quick proof of concept with the logic of it and I was
> > able to get the spaces which the user has permission to access. So I can
> > confirm that at space level, we can add permissions to the documents
> > crawled in order to filter them later on the search engine being used.
> >
> > One important thing is that the new Confluence REST Api does not include
> > any endpoints for permissions yet, so legacy API's have to be used for
> that
> > while Confluence developers port completely the legacy methods to the new
> > api.
> >
> > I will continue improving the repository connector and thinking new
> > features to be added, but if you think some feature is interesting or
> good
> > to have, please let me know and I will take a look in order to include
> it.
> >
> > As stated in the proposal, the Authority Connector is planned for the
> > second part of the project, but I started to work a bit on it to make
> sure
> > we can have a general working version and then iteratively improving it.
> >
> > As always, comments and suggestions are more than welcome.
> >
> > Regards
> >
> >
> > [1] https://github.com/adperezmorales/confluence-manifold-connector/
> >
> > 2015-05-26 17:10 GMT+02:00 Antonio David Pérez Morales <
> > adperezmorales@gmail.com>:
> >
> >> Hi all
> >>
> >> During the bonding period and these days I have been taking a look and
> >> familiarizing with Confluence API,
> >> doing some tests using CURL before start the implementation of the
> >> repository connector which is the first step as stated in the proposal.
> >>
> >> I have deployed a local instance of Confluence as well, so that I can do
> >> the development and tests using that instance.
> >>
> >> As stated in the proposal, Confluence is migrating its old APIs
> (rpc-xml,
> >> rpc-json) to the new REST API, so all the methods are not migrated yet.
> >> For getting the changes, fields and content of the documents there won't
> >> be any problem, but for permissions I have to check more in deep if the
> new
> >> REST API already support it.
> >> If not, we will have to do a mix using the methods provided by the
> >> rpc-json api for that, and update it when the REST API contains all the
> >> methods.
> >>
> >> After the first tests, there is no easy way to retrieve the user
> >> permissions because they are tied to documents and/or spaces. So in
> order
> >> to retrieve the user permissions,
> >> documentId or SpaceId and user have to be provided. I proposed two
> >> approaches to tackle this, one not so efficient, making many requests to
> >> Confluence and the other developing a Confluence plugin to get them
> >> (because at least at Java API level it is possible, but don't know yet
> what
> >> kind of permissions it returns)
> >>
> >> So I think, for that part, we can start using (trying) permissions at
> >> Space level and then try to go finer at document level.
> >> These problems are mainly related to the second part of the project
> >> (Authority Connector) but I think it is interesting to put here some
> >> results after the first overall tests I have performed.
> >>
> >> Regards
> >>
> >
> >
>

Re: [GSoC] Confluence connector project status after bonding period

Posted by Antonio David Pérez Morales <ad...@gmail.com>.
Hi all

Continuing with the development of Confluence repository connector [1] I
have added support for processing attachments (configurable per job) for
pages, ability to crawl each kind of pages and extract the page labels if
they have been set.
Besides, a complete refactor of the code and client has been done, so now
the connector has a better code flow and the specific components have been
simplified.
Right now, the connector is able to process Page and attachments, being
Page each supported confluence page type (blog, table, etc). As an
improvement I want to add specific support for each kind of Page, giving
the connector the ability to process Page-specific features. For example,
for Blog pages, the connector is extracting the default/common properties,
but adding the specific support for Blog page model (like it has been done
for attachments) would allow to extract Blog page-specific properties and
set the specific type for that page to Blog instead of Page (by default).

Another things that is being done in parallel is the development of the
tests, to have a complete set of unit tests for the midterm.

The code can be found at [1]. It is a git branch. After the midterm, I will
merge that branch with master one and create another one for the
development of the authority connector to keep them separated at the moment.

If you have any question/suggestion/comment, please drop a message here
which will be more than welcome.

Regards
--

[1]
https://github.com/adperezmorales/confluence-manifold-connector/tree/feature/repository-connector

2015-06-05 18:48 GMT+02:00 Antonio David Pérez Morales <
adperezmorales@gmail.com>:

> Hi devs and all
>
> As part of the development of the Atlassian Confluence connector for
> Manifold, I have created a repository [1] on my GitHub account
> Moreover I have developed and pushed the first version of the Confluence
> repository connector on a branch called 'feature/repository-connector'.
> This first version of the repository connector allows to crawl all the
> Confluence pages contained in the spaces (only pages), with the possibility
> to filter by a space.
> At this moment, only one space or all of them can be configured to be
> crawled, but I will continue improving the connector adding more features
> like configuring more than one space or others you may see interesting to
> be added.
> The ACLs of the documents crawled are the space it belongs to, so that it
> is aligned with the proposal for the Authority connector starting at Space
> level and then going more fine grain if necessary.
> There are no tests included at this moment, because I'm working on them.
>
> To see if the proposed Authority connector could be developed in a good
> way, I have done a quick proof of concept with the logic of it and I was
> able to get the spaces which the user has permission to access. So I can
> confirm that at space level, we can add permissions to the documents
> crawled in order to filter them later on the search engine being used.
>
> One important thing is that the new Confluence REST Api does not include
> any endpoints for permissions yet, so legacy API's have to be used for that
> while Confluence developers port completely the legacy methods to the new
> api.
>
> I will continue improving the repository connector and thinking new
> features to be added, but if you think some feature is interesting or good
> to have, please let me know and I will take a look in order to include it.
>
> As stated in the proposal, the Authority Connector is planned for the
> second part of the project, but I started to work a bit on it to make sure
> we can have a general working version and then iteratively improving it.
>
> As always, comments and suggestions are more than welcome.
>
> Regards
>
>
> [1] https://github.com/adperezmorales/confluence-manifold-connector/
>
> 2015-05-26 17:10 GMT+02:00 Antonio David Pérez Morales <
> adperezmorales@gmail.com>:
>
>> Hi all
>>
>> During the bonding period and these days I have been taking a look and
>> familiarizing with Confluence API,
>> doing some tests using CURL before start the implementation of the
>> repository connector which is the first step as stated in the proposal.
>>
>> I have deployed a local instance of Confluence as well, so that I can do
>> the development and tests using that instance.
>>
>> As stated in the proposal, Confluence is migrating its old APIs (rpc-xml,
>> rpc-json) to the new REST API, so all the methods are not migrated yet.
>> For getting the changes, fields and content of the documents there won't
>> be any problem, but for permissions I have to check more in deep if the new
>> REST API already support it.
>> If not, we will have to do a mix using the methods provided by the
>> rpc-json api for that, and update it when the REST API contains all the
>> methods.
>>
>> After the first tests, there is no easy way to retrieve the user
>> permissions because they are tied to documents and/or spaces. So in order
>> to retrieve the user permissions,
>> documentId or SpaceId and user have to be provided. I proposed two
>> approaches to tackle this, one not so efficient, making many requests to
>> Confluence and the other developing a Confluence plugin to get them
>> (because at least at Java API level it is possible, but don't know yet what
>> kind of permissions it returns)
>>
>> So I think, for that part, we can start using (trying) permissions at
>> Space level and then try to go finer at document level.
>> These problems are mainly related to the second part of the project
>> (Authority Connector) but I think it is interesting to put here some
>> results after the first overall tests I have performed.
>>
>> Regards
>>
>
>

Re: [GSoC] Confluence connector project status after bonding period

Posted by Antonio David Pérez Morales <ad...@gmail.com>.
Hi devs and all

As part of the development of the Atlassian Confluence connector for
Manifold, I have created a repository [1] on my GitHub account
Moreover I have developed and pushed the first version of the Confluence
repository connector on a branch called 'feature/repository-connector'.
This first version of the repository connector allows to crawl all the
Confluence pages contained in the spaces (only pages), with the possibility
to filter by a space.
At this moment, only one space or all of them can be configured to be
crawled, but I will continue improving the connector adding more features
like configuring more than one space or others you may see interesting to
be added.
The ACLs of the documents crawled are the space it belongs to, so that it
is aligned with the proposal for the Authority connector starting at Space
level and then going more fine grain if necessary.
There are no tests included at this moment, because I'm working on them.

To see if the proposed Authority connector could be developed in a good
way, I have done a quick proof of concept with the logic of it and I was
able to get the spaces which the user has permission to access. So I can
confirm that at space level, we can add permissions to the documents
crawled in order to filter them later on the search engine being used.

One important thing is that the new Confluence REST Api does not include
any endpoints for permissions yet, so legacy API's have to be used for that
while Confluence developers port completely the legacy methods to the new
api.

I will continue improving the repository connector and thinking new
features to be added, but if you think some feature is interesting or good
to have, please let me know and I will take a look in order to include it.

As stated in the proposal, the Authority Connector is planned for the
second part of the project, but I started to work a bit on it to make sure
we can have a general working version and then iteratively improving it.

As always, comments and suggestions are more than welcome.

Regards


[1] https://github.com/adperezmorales/confluence-manifold-connector/

2015-05-26 17:10 GMT+02:00 Antonio David Pérez Morales <
adperezmorales@gmail.com>:

> Hi all
>
> During the bonding period and these days I have been taking a look and
> familiarizing with Confluence API,
> doing some tests using CURL before start the implementation of the
> repository connector which is the first step as stated in the proposal.
>
> I have deployed a local instance of Confluence as well, so that I can do
> the development and tests using that instance.
>
> As stated in the proposal, Confluence is migrating its old APIs (rpc-xml,
> rpc-json) to the new REST API, so all the methods are not migrated yet.
> For getting the changes, fields and content of the documents there won't
> be any problem, but for permissions I have to check more in deep if the new
> REST API already support it.
> If not, we will have to do a mix using the methods provided by the
> rpc-json api for that, and update it when the REST API contains all the
> methods.
>
> After the first tests, there is no easy way to retrieve the user
> permissions because they are tied to documents and/or spaces. So in order
> to retrieve the user permissions,
> documentId or SpaceId and user have to be provided. I proposed two
> approaches to tackle this, one not so efficient, making many requests to
> Confluence and the other developing a Confluence plugin to get them
> (because at least at Java API level it is possible, but don't know yet what
> kind of permissions it returns)
>
> So I think, for that part, we can start using (trying) permissions at
> Space level and then try to go finer at document level.
> These problems are mainly related to the second part of the project
> (Authority Connector) but I think it is interesting to put here some
> results after the first overall tests I have performed.
>
> Regards
>