You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tuscany.apache.org by Phillipe Ramalho <ph...@gmail.com> on 2009/04/01 09:02:02 UTC

Re: [GSoC 2009] Add search capability to index/search artifacts in the SCA domain

Thanks Luciano,

You might start thinking on how you are going to integrate with the
runtime, possibly the contribution processing as a new phase or a new
type of processor ?

OK, I will investigate more about that and add some details about this to my
proposal. I will let every
one knows when I update it.

Best Regards,
Phillipe Ramalho

On Tue, Mar 31, 2009 at 10:29 AM, Luciano Resende <lu...@gmail.com>wrote:

> On Tue, Mar 31, 2009 at 1:04 AM, Phillipe Ramalho
> <ph...@gmail.com> wrote:
> > Hi everyone,
> >
> > This is my proposal for the project "Add search capability to
> index/search
> > artifacts in the SCA domain" described at [1]. I already submitted the
> > proposal at gsoc webpage and added it to Tuscany Wiki proposals at [2].
> >
> > Any critic, suggestion, comments, review will be appreciated.
> >
> > I think there are some good points that could be improved on the proposal
> > and I'm still working on that, mainly those points I say that should be
> > discussed on the community, so, any comments about that will be also
> > appreciated : )
> >
>
>
> Looks really good, and very detailed....
>
> You might start thinking on how you are going to integrate with the
> runtime, possibly the contribution processing as a new phase or a new
> type of processor ?
>
> Anyway, +1 from me.
>
> > Thanks in advance,
> > Phillipe Ramalho
> >
>
>
>
> --
> Luciano Resende
> Apache Tuscany, Apache PhotArk
> http://people.apache.org/~lresende <http://people.apache.org/%7Elresende>
> http://lresende.blogspot.com/
>



-- 
Phillipe Ramalho

Re: [GSoC 2009] Add search capability to index/search artifacts in the SCA domain

Posted by Phillipe Ramalho <ph...@gmail.com>.

Hi everyone,

I just updated my proposals with the new features suggested by Luciano and
Adriano.

Any more comments will be appreciated ; )

Best Regards,
Phillipe Ramalho

On Wed, Apr 1, 2009 at 11:34 PM, Phillipe Ramalho <
phillipe.ramalho@gmail.com> wrote:

> Hi Adriano,
>
> Thanks for the comments, they are really helpful : )
>
> Some comments in line:
>
> In addition, with every artifact the indexed artifact is related to, an
> extra information can be added using a Lucene feature called payload, this
> information could tell what is the relationship between the elements.
>
> I liked about this relationship thing, have you thought about extending
> Lucene query parser so new syntax could be provided? We could extend and add
> support to something like: isreferenced("StoreCatalog") ...so every
> component that is referenced by StoreCatalog would be returned. Well, maybe
> we could also do this using Lucene field, it would be much faster. Anyway,
> there are cool features that could be done using payloads, we just need to
> come up with some good ideas : )
>
> I never did that on Lucene query parser, extending the query parser syntax,
> but I liked the idea too. I will do some more investigation about it and add
> it to the proposal ; )
>
> To handle different file types, file analyzers will be implemented to
> extract the texts from it. For example, a .class file is a binary file, but
> the method names (mainly the ones annotated with SCA Annotations) could be
> extracted using Java Reflection API. File analyzers could also call other
> analyzers recursively, for example, an .composite file could be analyzed
> using a CompositeAnalyzer and when it reaches the implementation.java node
> it could invoke JavaClassAnalyzer and etc. This way each type of file will
> have only its significant text indexed, otherwise, if the file is parsed
> using a common text file analyzer, every search for "component" would find
> every composite file, because it contains "<component>" node declaration.
>
> This is really what I had in mind, do something that only extracts the
> relevant information, because search is also about good results, it is not
> as simple as just finding them, otherwise Google would not be so famous and
> you probably would never be applying for GSoC : )...I think we should also
> implement an analyzer for compressed files, there are many jars on a domain,
> we cannot just ignore them.
>
> Good idea, so we could browse compressed files like browsing a folder. I
> will also add it to the proposal.
>
> Now, about the "searching" session of your proposal, it's fine, I think
> Lucene already give us a good query parser for user input. It's a good idea
> to implement everything as an SCA component, and one of the services it
> could provide is to search not only using a query text, but also accepting
> Lucene query objects as input. Some app using the search component could
> have a very user friendly interface where the user could check many
> checkboxes and other high level GUI component to refine a query, for this
> cases, when the app execute the search it would probably generate the Lucene
> objects directly instead of creating a query string.
>
> OK, I think it's going to be easy, the query text is converted to lucene
> query objects anyway, the only thing this new functionality needs to do is
> not parse the query, just execute the query objects directly against the
> index : )
>
> Hey, this is a good way to display a result, because in the results you can
> already see the artifacts relationship. Maybe we could work on expanding the
> result tree down to files inside compressed files or method inside class
> files. I think this display model could be extended not only for displaying
> results, but also to display every artifact on the domain manager web app.
>
> That's the idea, to expand down to every artifact we could parse and index
> : )
>
> I think you might want to double the "Implementing text and file analyzer
> for indexing" phase time.
>
> Agreed, I will do that ; )
>
> Regards,
> Phillipe Ramalho
>
>
> On Wed, Apr 1, 2009 at 1:27 AM, Adriano Crestani <
> adrianocrestani@apache.org> wrote:
>
>> Hi Phillipe,
>>
>> very good and detailed proposal : )
>>
>> In addition, with every artifact the indexed artifact is related to, an
>> extra information can be added using a Lucene feature called payload, this
>> information could tell what is the relationship between the elements.
>>
>> I liked about this relationship thing, have you thought about extending
>> Lucene query parser so new syntax could be provided? We could extend and add
>> support to something like: isreferenced("StoreCatalog") ...so every
>> component that is referenced by StoreCatalog would be returned. Well, maybe
>> we could also do this using Lucene field, it would be much faster. Anyway,
>> there are cool features that could be done using payloads, we just need to
>> come up with some good ideas : )
>>
>> To handle different file types, file analyzers will be implemented to
>> extract the texts from it. For example, a .class file is a binary file, but
>> the method names (mainly the ones annotated with SCA Annotations) could be
>> extracted using Java Reflection API. File analyzers could also call other
>> analyzers recursively, for example, an .composite file could be analyzed
>> using a CompositeAnalyzer and when it reaches the implementation.java node
>> it could invoke JavaClassAnalyzer and etc. This way each type of file will
>> have only its significant text indexed, otherwise, if the file is parsed
>> using a common text file analyzer, every search for "component" would find
>> every composite file, because it contains "<component>" node declaration.
>>
>> This is really what I had in mind, do something that only extracts the
>> relevant information, because search is also about good results, it is not
>> as simple as just finding them, otherwise Google would not be so famous and
>> you probably would never be applying for GSoC : )...I think we should also
>> implement an analyzer for compressed files, there are many jars on a domain,
>> we cannot just ignore them.
>>
>> Now, about the "searching" session of your proposal, it's fine, I think
>> Lucene already give us a good query parser for user input. It's a good idea
>> to implement everything as an SCA component, and one of the services it
>> could provide is to search not only using a query text, but also accepting
>> Lucene query objects as input. Some app using the search component could
>> have a very user friendly interface where the user could check many
>> checkboxes and other high level GUI component to refine a query, for this
>> cases, when the app execute the search it would probably generate the Lucene
>> objects directly instead of creating a query string.
>>
>> The results will be displayed using a tree layout, something like Eclipse
>> IDE does [see image below] on its text search results, but instead of a tree
>> like project -> package -> class -> fragment text that contains the searched
>> text, it would be, for example, node > contribution > component >
>> file.componsite file > fragment text that contains the searched text. This
>> is just an example, the way the results can be displayed can still be
>> discussed on the community mailing list.
>> Hey, this is a good way to display a result, because in the results you
>> can already see the artifacts relationship. Maybe we could work on expanding
>> the result tree down to files inside compressed files or method inside class
>> files. I think this display model could be extended not only for displaying
>> results, but also to display every artifact on the domain manager web app.
>>
>> I think you might want to double the "Implementing text and file analyzer
>> for indexing" phase time.
>>
>> +1 from me too :)
>>
>> Adriano Crestani
>>
>>
>>
>> On Wed, Apr 1, 2009 at 12:02 AM, Phillipe Ramalho <
>> phillipe.ramalho@gmail.com> wrote:
>>
>>> Thanks Luciano,
>>>
>>> You might start thinking on how you are going to integrate with the
>>> runtime, possibly the contribution processing as a new phase or a new
>>> type of processor ?
>>>
>>> OK, I will investigate more about that and add some details about this to
>>> my proposal. I will let every
>>> one knows when I update it.
>>>
>>> Best Regards,
>>> Phillipe Ramalho
>>>
>>>  On Tue, Mar 31, 2009 at 10:29 AM, Luciano Resende <luckbr1975@gmail.com
>>> > wrote:
>>>
>>>> On Tue, Mar 31, 2009 at 1:04 AM, Phillipe Ramalho
>>>> <ph...@gmail.com> wrote:
>>>> > Hi everyone,
>>>> >
>>>> > This is my proposal for the project "Add search capability to
>>>> index/search
>>>> > artifacts in the SCA domain" described at [1]. I already submitted the
>>>> > proposal at gsoc webpage and added it to Tuscany Wiki proposals at
>>>> [2].
>>>> >
>>>> > Any critic, suggestion, comments, review will be appreciated.
>>>> >
>>>> > I think there are some good points that could be improved on the
>>>> proposal
>>>> > and I'm still working on that, mainly those points I say that should
>>>> be
>>>> > discussed on the community, so, any comments about that will be also
>>>> > appreciated : )
>>>> >
>>>>
>>>>
>>>> Looks really good, and very detailed....
>>>>
>>>> You might start thinking on how you are going to integrate with the
>>>> runtime, possibly the contribution processing as a new phase or a new
>>>> type of processor ?
>>>>
>>>> Anyway, +1 from me.
>>>>
>>>> > Thanks in advance,
>>>> > Phillipe Ramalho
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Luciano Resende
>>>> Apache Tuscany, Apache PhotArk
>>>> http://people.apache.org/~lresende<http://people.apache.org/%7Elresende>
>>>> http://lresende.blogspot.com/
>>>>
>>>
>>>
>>>
>>> --
>>> Phillipe Ramalho
>>>
>>
>>
>
>
> --
> Phillipe Ramalho
>



-- 
Phillipe Ramalho

Re: [GSoC 2009] Add search capability to index/search artifacts in the SCA domain

Posted by Phillipe Ramalho <ph...@gmail.com>.

Hi Adriano,

Thanks for the comments, they are really helpful : )

Some comments in line:

In addition, with every artifact the indexed artifact is related to, an
extra information can be added using a Lucene feature called payload, this
information could tell what is the relationship between the elements.

I liked about this relationship thing, have you thought about extending
Lucene query parser so new syntax could be provided? We could extend and add
support to something like: isreferenced("StoreCatalog") ...so every
component that is referenced by StoreCatalog would be returned. Well, maybe
we could also do this using Lucene field, it would be much faster. Anyway,
there are cool features that could be done using payloads, we just need to
come up with some good ideas : )

I never did that on Lucene query parser, extending the query parser syntax,
but I liked the idea too. I will do some more investigation about it and add
it to the proposal ; )

To handle different file types, file analyzers will be implemented to
extract the texts from it. For example, a .class file is a binary file, but
the method names (mainly the ones annotated with SCA Annotations) could be
extracted using Java Reflection API. File analyzers could also call other
analyzers recursively, for example, an .composite file could be analyzed
using a CompositeAnalyzer and when it reaches the implementation.java node
it could invoke JavaClassAnalyzer and etc. This way each type of file will
have only its significant text indexed, otherwise, if the file is parsed
using a common text file analyzer, every search for "component" would find
every composite file, because it contains "<component>" node declaration.

This is really what I had in mind, do something that only extracts the
relevant information, because search is also about good results, it is not
as simple as just finding them, otherwise Google would not be so famous and
you probably would never be applying for GSoC : )...I think we should also
implement an analyzer for compressed files, there are many jars on a domain,
we cannot just ignore them.

Good idea, so we could browse compressed files like browsing a folder. I
will also add it to the proposal.

Now, about the "searching" session of your proposal, it's fine, I think
Lucene already give us a good query parser for user input. It's a good idea
to implement everything as an SCA component, and one of the services it
could provide is to search not only using a query text, but also accepting
Lucene query objects as input. Some app using the search component could
have a very user friendly interface where the user could check many
checkboxes and other high level GUI component to refine a query, for this
cases, when the app execute the search it would probably generate the Lucene
objects directly instead of creating a query string.

OK, I think it's going to be easy, the query text is converted to lucene
query objects anyway, the only thing this new functionality needs to do is
not parse the query, just execute the query objects directly against the
index : )

Hey, this is a good way to display a result, because in the results you can
already see the artifacts relationship. Maybe we could work on expanding the
result tree down to files inside compressed files or method inside class
files. I think this display model could be extended not only for displaying
results, but also to display every artifact on the domain manager web app.

That's the idea, to expand down to every artifact we could parse and index :
)

I think you might want to double the "Implementing text and file analyzer
for indexing" phase time.

Agreed, I will do that ; )

Regards,
Phillipe Ramalho

On Wed, Apr 1, 2009 at 1:27 AM, Adriano Crestani <adrianocrestani@apache.org
> wrote:

> Hi Phillipe,
>
> very good and detailed proposal : )
>
> In addition, with every artifact the indexed artifact is related to, an
> extra information can be added using a Lucene feature called payload, this
> information could tell what is the relationship between the elements.
>
> I liked about this relationship thing, have you thought about extending
> Lucene query parser so new syntax could be provided? We could extend and add
> support to something like: isreferenced("StoreCatalog") ...so every
> component that is referenced by StoreCatalog would be returned. Well, maybe
> we could also do this using Lucene field, it would be much faster. Anyway,
> there are cool features that could be done using payloads, we just need to
> come up with some good ideas : )
>
> To handle different file types, file analyzers will be implemented to
> extract the texts from it. For example, a .class file is a binary file, but
> the method names (mainly the ones annotated with SCA Annotations) could be
> extracted using Java Reflection API. File analyzers could also call other
> analyzers recursively, for example, an .composite file could be analyzed
> using a CompositeAnalyzer and when it reaches the implementation.java node
> it could invoke JavaClassAnalyzer and etc. This way each type of file will
> have only its significant text indexed, otherwise, if the file is parsed
> using a common text file analyzer, every search for "component" would find
> every composite file, because it contains "<component>" node declaration.
>
> This is really what I had in mind, do something that only extracts the
> relevant information, because search is also about good results, it is not
> as simple as just finding them, otherwise Google would not be so famous and
> you probably would never be applying for GSoC : )...I think we should also
> implement an analyzer for compressed files, there are many jars on a domain,
> we cannot just ignore them.
>
> Now, about the "searching" session of your proposal, it's fine, I think
> Lucene already give us a good query parser for user input. It's a good idea
> to implement everything as an SCA component, and one of the services it
> could provide is to search not only using a query text, but also accepting
> Lucene query objects as input. Some app using the search component could
> have a very user friendly interface where the user could check many
> checkboxes and other high level GUI component to refine a query, for this
> cases, when the app execute the search it would probably generate the Lucene
> objects directly instead of creating a query string.
>
> The results will be displayed using a tree layout, something like Eclipse
> IDE does [see image below] on its text search results, but instead of a tree
> like project -> package -> class -> fragment text that contains the searched
> text, it would be, for example, node > contribution > component >
> file.componsite file > fragment text that contains the searched text. This
> is just an example, the way the results can be displayed can still be
> discussed on the community mailing list.
> Hey, this is a good way to display a result, because in the results you can
> already see the artifacts relationship. Maybe we could work on expanding the
> result tree down to files inside compressed files or method inside class
> files. I think this display model could be extended not only for displaying
> results, but also to display every artifact on the domain manager web app.
>
> I think you might want to double the "Implementing text and file analyzer
> for indexing" phase time.
>
> +1 from me too :)
>
> Adriano Crestani
>
>
>
> On Wed, Apr 1, 2009 at 12:02 AM, Phillipe Ramalho <
> phillipe.ramalho@gmail.com> wrote:
>
>> Thanks Luciano,
>>
>> You might start thinking on how you are going to integrate with the
>> runtime, possibly the contribution processing as a new phase or a new
>> type of processor ?
>>
>> OK, I will investigate more about that and add some details about this to
>> my proposal. I will let every
>> one knows when I update it.
>>
>> Best Regards,
>> Phillipe Ramalho
>>
>>  On Tue, Mar 31, 2009 at 10:29 AM, Luciano Resende <lu...@gmail.com>wrote:
>>
>>> On Tue, Mar 31, 2009 at 1:04 AM, Phillipe Ramalho
>>> <ph...@gmail.com> wrote:
>>> > Hi everyone,
>>> >
>>> > This is my proposal for the project "Add search capability to
>>> index/search
>>> > artifacts in the SCA domain" described at [1]. I already submitted the
>>> > proposal at gsoc webpage and added it to Tuscany Wiki proposals at [2].
>>> >
>>> > Any critic, suggestion, comments, review will be appreciated.
>>> >
>>> > I think there are some good points that could be improved on the
>>> proposal
>>> > and I'm still working on that, mainly those points I say that should be
>>> > discussed on the community, so, any comments about that will be also
>>> > appreciated : )
>>> >
>>>
>>>
>>> Looks really good, and very detailed....
>>>
>>> You might start thinking on how you are going to integrate with the
>>> runtime, possibly the contribution processing as a new phase or a new
>>> type of processor ?
>>>
>>> Anyway, +1 from me.
>>>
>>> > Thanks in advance,
>>> > Phillipe Ramalho
>>> >
>>>
>>>
>>>
>>> --
>>> Luciano Resende
>>> Apache Tuscany, Apache PhotArk
>>> http://people.apache.org/~lresende<http://people.apache.org/%7Elresende>
>>> http://lresende.blogspot.com/
>>>
>>
>>
>>
>> --
>> Phillipe Ramalho
>>
>
>

-- 
Phillipe Ramalho

Re: [GSoC 2009] Add search capability to index/search artifacts in the SCA domain

Posted by Adriano Crestani <ad...@apache.org>.

Hi Phillipe,

very good and detailed proposal : )

In addition, with every artifact the indexed artifact is related to, an
extra information can be added using a Lucene feature called payload, this
information could tell what is the relationship between the elements.

I liked about this relationship thing, have you thought about extending
Lucene query parser so new syntax could be provided? We could extend and add
support to something like: isreferenced("StoreCatalog") ...so every
component that is referenced by StoreCatalog would be returned. Well, maybe
we could also do this using Lucene field, it would be much faster. Anyway,
there are cool features that could be done using payloads, we just need to
come up with some good ideas : )

To handle different file types, file analyzers will be implemented to
extract the texts from it. For example, a .class file is a binary file, but
the method names (mainly the ones annotated with SCA Annotations) could be
extracted using Java Reflection API. File analyzers could also call other
analyzers recursively, for example, an .composite file could be analyzed
using a CompositeAnalyzer and when it reaches the implementation.java node
it could invoke JavaClassAnalyzer and etc. This way each type of file will
have only its significant text indexed, otherwise, if the file is parsed
using a common text file analyzer, every search for "component" would find
every composite file, because it contains "<component>" node declaration.

This is really what I had in mind, do something that only extracts the
relevant information, because search is also about good results, it is not
as simple as just finding them, otherwise Google would not be so famous and
you probably would never be applying for GSoC : )...I think we should also
implement an analyzer for compressed files, there are many jars on a domain,
we cannot just ignore them.

Now, about the "searching" session of your proposal, it's fine, I think
Lucene already give us a good query parser for user input. It's a good idea
to implement everything as an SCA component, and one of the services it
could provide is to search not only using a query text, but also accepting
Lucene query objects as input. Some app using the search component could
have a very user friendly interface where the user could check many
checkboxes and other high level GUI component to refine a query, for this
cases, when the app execute the search it would probably generate the Lucene
objects directly instead of creating a query string.

The results will be displayed using a tree layout, something like Eclipse
IDE does [see image below] on its text search results, but instead of a tree
like project -> package -> class -> fragment text that contains the searched
text, it would be, for example, node > contribution > component >
file.componsite file > fragment text that contains the searched text. This
is just an example, the way the results can be displayed can still be
discussed on the community mailing list.
Hey, this is a good way to display a result, because in the results you can
already see the artifacts relationship. Maybe we could work on expanding the
result tree down to files inside compressed files or method inside class
files. I think this display model could be extended not only for displaying
results, but also to display every artifact on the domain manager web app.

I think you might want to double the "Implementing text and file analyzer
for indexing" phase time.

+1 from me too :)

Adriano Crestani

On Wed, Apr 1, 2009 at 12:02 AM, Phillipe Ramalho <
phillipe.ramalho@gmail.com> wrote:

> Thanks Luciano,
>
> You might start thinking on how you are going to integrate with the
> runtime, possibly the contribution processing as a new phase or a new
> type of processor ?
>
> OK, I will investigate more about that and add some details about this to
> my proposal. I will let every
> one knows when I update it.
>
> Best Regards,
> Phillipe Ramalho
>
> On Tue, Mar 31, 2009 at 10:29 AM, Luciano Resende <lu...@gmail.com>wrote:
>
>> On Tue, Mar 31, 2009 at 1:04 AM, Phillipe Ramalho
>> <ph...@gmail.com> wrote:
>> > Hi everyone,
>> >
>> > This is my proposal for the project "Add search capability to
>> index/search
>> > artifacts in the SCA domain" described at [1]. I already submitted the
>> > proposal at gsoc webpage and added it to Tuscany Wiki proposals at [2].
>> >
>> > Any critic, suggestion, comments, review will be appreciated.
>> >
>> > I think there are some good points that could be improved on the
>> proposal
>> > and I'm still working on that, mainly those points I say that should be
>> > discussed on the community, so, any comments about that will be also
>> > appreciated : )
>> >
>>
>>
>> Looks really good, and very detailed....
>>
>> You might start thinking on how you are going to integrate with the
>> runtime, possibly the contribution processing as a new phase or a new
>> type of processor ?
>>
>> Anyway, +1 from me.
>>
>> > Thanks in advance,
>> > Phillipe Ramalho
>> >
>>
>>
>>
>> --
>> Luciano Resende
>> Apache Tuscany, Apache PhotArk
>> http://people.apache.org/~lresende <http://people.apache.org/%7Elresende>
>> http://lresende.blogspot.com/
>>
>
>
>
> --
> Phillipe Ramalho
>