You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2003/09/04 13:18:23 UTC

Lucene app to index Java code

Hello,

Has anyone written an application that uses Lucene to index Java code,
either from the source .java files, or compiled .class files?

I need to create a searchable index for Java code, so that I can use
that index to check if classes or methods with certain functionality
have already been written.  This is an effort to remove code
duplication and do more code re-use.  If this application can also
index Javadocs, even better!

I think I heard of somebody doing this already.  Kevin Burton?
This is something that would fit nicely in Erik's Ant IndexTask in
Lucene Sandbox), I think.

Thank you,
Otis


__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

Re: Lucene app to index Java code

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
Otis Gospodnetic wrote:

>Hello,
>
>Has anyone written an application that uses Lucene to index Java code,
>either from the source .java files, or compiled .class files?
>
>I need to create a searchable index for Java code, so that I can use
>that index to check if classes or methods with certain functionality
>have already been written.  This is an effort to remove code
>duplication and do more code re-use.  If this application can also
>index Javadocs, even better!
>
>I think I heard of somebody doing this already.  Kevin Burton?
>  
>
I was playing with it... blogged about it here...

http://www.peerfear.org/rss/permalink/2003/07/23/LuceneForSourceManagement/

>This is something that would fit nicely in Erik's Ant IndexTask in
>Lucene Sandbox), I think.
>  
>
Yes... I was thinking about making an ant task for it or using someone 
else's.  One of the cool things would be direct integration within the IDE.

Also parsing the .java file into a token stream and then indexing the 
tokens would make a blazingly fast doc completion facility....

Kevin

-- 
Help Support NewsMonster Development!  Purchase NewsMonster PRO!

    http://www.newsmonster.org/download-pro.html

Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM - sfburtonator,  Web - http://www.peerfear.org/
GPG fingerprint: 4D20 40A0 C734 307E C7B4  DCAA 0303 3AC5 BD9D 7C4D
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



Re: Lucene app to index Java code

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Thursday, September 4, 2003, at 01:30  PM, Kevin A. Burton wrote:
>> - XDoclet could be used to sweep through Java code and build a 
>> text/XML file as richly as you'd like from the information there 
>> (complete with JavaDoc tags, which Zapata will miss :)), and then run 
>> Lucene on the generated files.  On a related note, the XDoclet2 
>> architecture would streamline this even further by eliminating the 
>> middle textual representation (QDox/XJavadoc reads Java as a "meta 
>> data provider" and then a Lucene "plugin" indexes things).  It could 
>> be done without the intermediate text representation even in XDoclet 
>> 1.2, but it would require coding a custom subtask and be slightly out 
>> of the norm for XDoclet subtasks (but would work just fine).
>
> It would be faster to write a native doclet as this would remove the 
> XML parse overhead...  The whole point of this thing is that it needs 
> to be fast!

Do you mean the Ant build file parsing?  That would be the only XML 
parsing in the equation I'm proposing, unless you did it the clunkiest 
XDoclet 1.2 way of having an intermediate XML file.

As for speed.... QDox, I've heard, is the fastest option.  javadoc is 
the slowest parsing of the three I know of (javadoc, xjavadoc, qdox).

	Erik


Re: Lucene app to index Java code

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
Erik Hatcher wrote:

> A couple of thoughts on this:
>
> - Eclipse uses Lucene for its code indexing/searching (I learned this 
> at the OSCON Keynote by Eclipse folks).  Perhaps looking at how 
> Eclipse does its thing would be useful even if not the solution.
>
> - XDoclet could be used to sweep through Java code and build a 
> text/XML file as richly as you'd like from the information there 
> (complete with JavaDoc tags, which Zapata will miss :)), and then run 
> Lucene on the generated files.  On a related note, the XDoclet2 
> architecture would streamline this even further by eliminating the 
> middle textual representation (QDox/XJavadoc reads Java as a "meta 
> data provider" and then a Lucene "plugin" indexes things).  It could 
> be done without the intermediate text representation even in XDoclet 
> 1.2, but it would require coding a custom subtask and be slightly out 
> of the norm for XDoclet subtasks (but would work just fine).

It would be faster to write a native doclet as this would remove the XML 
parse overhead...  The whole point of this thing is that it needs to be 
fast!

Kevin

-- 
Help Support NewsMonster Development!  Purchase NewsMonster PRO!

    http://www.newsmonster.org/download-pro.html

Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM - sfburtonator,  Web - http://www.peerfear.org/
GPG fingerprint: 4D20 40A0 C734 307E C7B4  DCAA 0303 3AC5 BD9D 7C4D
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



Re: StandardTokenizer problem

Posted by petite_abeille <pe...@mac.com>.
On Thursday, Sep 4, 2003, at 16:07 Europe/Zurich, Nicolas Maisonneuve 
wrote:

> "I.B.M" can be a host or acronym, so threre is a problem , no  ?

Perhaps as far as this parser goes... but... in practice... '.M' is not 
a valid TLD.

PA.


StandardTokenizer problem

Posted by Nicolas Maisonneuve <n....@HotPOP.com>.
hy ,
when i use standardTokenizer
for parse for example "I.B.M"
the type of the Token  is HOST and not ACRONYM

WHY ???

in StandardTokenizer.jj

 // acronyms: U.S.A., I.B.M., etc.
  // use a post-filter to remove dots
| <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >

  // hostname
| <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >

"I.B.M" can be a host or acronym, so threre is a problem , no  ?

----- Original Message ----- 
From: "petite_abeille" <pe...@mac.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Thursday, September 04, 2003 3:19 PM
Subject: Re: Lucene app to index Java code


> Hi Erik,
> 
> On Thursday, Sep 4, 2003, at 15:03 Europe/Zurich, Erik Hatcher wrote:
> 
> > - XDoclet could be used to sweep through Java code and build a 
> > text/XML file as richly as you'd like from the information there 
> > (complete with JavaDoc tags, which Zapata will miss :)),
> 
> Correct. This happen to be on purpose :) Does XDoclet build an 
> "intertwingled" object graph of your code along the way? Performing a 
> plain search on a code base is pretty trivial... what seems to be more 
> interesting would be to put that in context.
> 
> Zapata does something along the line of what MagicHat does for 
> Objective-C:
> 
> http://homepage.mac.com/petite_abeille/MagicHat/
> 
> But from the sound of what Otis is saying this is not what you guys are 
> looking for... back to the pampa then...
> 
> Cheers,
> 
> PA.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 



Re: Lucene app to index Java code

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Thursday, September 4, 2003, at 09:19  AM, petite_abeille wrote:
>> - XDoclet could be used to sweep through Java code and build a 
>> text/XML file as richly as you'd like from the information there 
>> (complete with JavaDoc tags, which Zapata will miss :)),
>
> Correct. This happen to be on purpose :) Does XDoclet build an 
> "intertwingled" object graph of your code along the way? Performing a 
> plain search on a code base is pretty trivial... what seems to be more 
> interesting would be to put that in context.

Yes, XDoclet builds a complete object graph of all the source files you 
hand it (as an Ant <fileset>).  It actually even does binary class 
interpretation for the information it needs to construct a full 
object-graph if some dependencies are in the classpath of the taskdef 
as well.

> Zapata does something along the line of what MagicHat does for 
> Objective-C:
>
> http://homepage.mac.com/petite_abeille/MagicHat/

Very cool.  You rock!

	Erik


Re: Lucene app to index Java code

Posted by petite_abeille <pe...@mac.com>.
Hi Erik,

On Thursday, Sep 4, 2003, at 15:03 Europe/Zurich, Erik Hatcher wrote:

> - XDoclet could be used to sweep through Java code and build a 
> text/XML file as richly as you'd like from the information there 
> (complete with JavaDoc tags, which Zapata will miss :)),

Correct. This happen to be on purpose :) Does XDoclet build an 
"intertwingled" object graph of your code along the way? Performing a 
plain search on a code base is pretty trivial... what seems to be more 
interesting would be to put that in context.

Zapata does something along the line of what MagicHat does for 
Objective-C:

http://homepage.mac.com/petite_abeille/MagicHat/

But from the sound of what Otis is saying this is not what you guys are 
looking for... back to the pampa then...

Cheers,

PA.


Re: Lucene app to index Java code

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
A couple of thoughts on this:

- Eclipse uses Lucene for its code indexing/searching (I learned this 
at the OSCON Keynote by Eclipse folks).  Perhaps looking at how Eclipse 
does its thing would be useful even if not the solution.

- XDoclet could be used to sweep through Java code and build a text/XML 
file as richly as you'd like from the information there (complete with 
JavaDoc tags, which Zapata will miss :)), and then run Lucene on the 
generated files.  On a related note, the XDoclet2 architecture would 
streamline this even further by eliminating the middle textual 
representation (QDox/XJavadoc reads Java as a "meta data provider" and 
then a Lucene "plugin" indexes things).  It could be done without the 
intermediate text representation even in XDoclet 1.2, but it would 
require coding a custom subtask and be slightly out of the norm for 
XDoclet subtasks (but would work just fine).

- My <index> task could be used, but it would be better to use 
something that built a complete object-graph of all the source code you 
want indexed, so that it can deal with base classes, inherited javadoc 
tags, and other such interactions between classes you might want to 
capture.

	Erik

On Thursday, September 4, 2003, at 07:18  AM, Otis Gospodnetic wrote:

> Hello,
>
> Has anyone written an application that uses Lucene to index Java code,
> either from the source .java files, or compiled .class files?
>
> I need to create a searchable index for Java code, so that I can use
> that index to check if classes or methods with certain functionality
> have already been written.  This is an effort to remove code
> duplication and do more code re-use.  If this application can also
> index Javadocs, even better!
>
> I think I heard of somebody doing this already.  Kevin Burton?
> This is something that would fit nicely in Erik's Ant IndexTask in
> Lucene Sandbox), I think.
>
> Thank you,
> Otis
>
>
> __________________________________
> Do you Yahoo!?
> Yahoo! SiteBuilder - Free, easy-to-use web site design software
> http://sitebuilder.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Lucene app to index Java code

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
Otis Gospodnetic wrote:

>What you describe sounds interesting, but I was thinking more along the
>lines of this:
>
>http://www.peerfear.org/rss/permalink/2003/07/23/LuceneForSourceManagement/
>
>An application that I could use to find out whether I already have a
>'getStudents' or 'getStudents*' method somewhere in the source code,
>for instance, before I start writing it.  As the code base grows
>larger, and as the team that works with it becomes bigger, this tools
>becomes more and more valuable.
>If this application could also index Javadocs, so that I can search for
>methods or classes that mention +student* +(database OR db) +update,
>that would be even better.
>
>Has anyone done this?
>Kevin Burton mentioned something similar to what I described above, at
>that URL, but it looks like he didn't make his application available.
>  
>
It's just two source files + Lucene plus I didn't do all the work to 
make it into an OSS package.  99% of OSS work isn't technical but 
political, maintenance, etc..

If someone wants to start an OSS project for this and do all the grunt 
work I will do the coding :)  I don't know what parser I wnat to use to 
tokenize the source but a Doclet would be perfect for this....  The only 
problem is that this wouldn't allow full differential builds and would 
slow down the generation....

Also it just dawned on me that the Emacs compile-internal function 
parses stdout in the form of file:line# so this would make a great way 
to integrate for us Emacs geeks.

Kevin

-- 
Help Support NewsMonster Development!  Purchase NewsMonster PRO!

    http://www.newsmonster.org/download-pro.html

Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM - sfburtonator,  Web - http://www.peerfear.org/
GPG fingerprint: 4D20 40A0 C734 307E C7B4  DCAA 0303 3AC5 BD9D 7C4D
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



Re: Lucene app to index Java code

Posted by Otis Gospodnetic <ot...@yahoo.com>.
What you describe sounds interesting, but I was thinking more along the
lines of this:

http://www.peerfear.org/rss/permalink/2003/07/23/LuceneForSourceManagement/

An application that I could use to find out whether I already have a
'getStudents' or 'getStudents*' method somewhere in the source code,
for instance, before I start writing it.  As the code base grows
larger, and as the team that works with it becomes bigger, this tools
becomes more and more valuable.
If this application could also index Javadocs, so that I can search for
methods or classes that mention +student* +(database OR db) +update,
that would be even better.

Has anyone done this?
Kevin Burton mentioned something similar to what I described above, at
that URL, but it looks like he didn't make his application available.

Thanks,
Otis

--- petite_abeille <pe...@mac.com> wrote:
> Hi Otis,
> 
> On Thursday, Sep 4, 2003, Otis Gospodnetic wrote:
> 
> > Has anyone written an application that uses Lucene to index Java
> code,
> > either from the source .java files, or compiled .class files?
> 
> If you are talking about my ultra secret project "Zapata: Coding 
> Mexican Style", then yes ;)
> 
> But... it uses runtime information to reach its devious ends and is 
> more like a documentation tool than anything else...
> 
> Anyway, this is how it goes:
> 
> Given a set of binary jar files it builds an object graph of the 
> bytecode: packages, classes, methods and so on. Complete with 
> interdependencies and other handy informations. The bytecode is also 
> run through a decompiler and pretty printed to normalize the source. 
> Code segments are attached and indexed alongside their owners (class
> or 
> method). All this fully indexed, searchable and cross referenced.
> 
> This is built upon the same engine used by ZOE, so the end result is 
> very much along the lines of what ZOE does for email, but for code 
> instead... fun, fun, fun ;)
> 
> Cheers,
> 
> PA.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

Re: Lucene app to index Java code

Posted by petite_abeille <pe...@mac.com>.
Hi Otis,

On Thursday, Sep 4, 2003, Otis Gospodnetic wrote:

> Has anyone written an application that uses Lucene to index Java code,
> either from the source .java files, or compiled .class files?

If you are talking about my ultra secret project "Zapata: Coding 
Mexican Style", then yes ;)

But... it uses runtime information to reach its devious ends and is 
more like a documentation tool than anything else...

Anyway, this is how it goes:

Given a set of binary jar files it builds an object graph of the 
bytecode: packages, classes, methods and so on. Complete with 
interdependencies and other handy informations. The bytecode is also 
run through a decompiler and pretty printed to normalize the source. 
Code segments are attached and indexed alongside their owners (class or 
method). All this fully indexed, searchable and cross referenced.

This is built upon the same engine used by ZOE, so the end result is 
very much along the lines of what ZOE does for email, but for code 
instead... fun, fun, fun ;)

Cheers,

PA.