You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by Kurt Starsinic <ks...@gmail.com> on 2016/04/14 18:53:54 UTC

[lucy-dev] Indexing source code

Hello Lucifers,

I want to use Lucy to index a bunch of source code (mostly Java, XML, Perl,
and C), and I haven't found any clear guidance in the docs. I'd much prefer
if the index were reasonably syntax-aware (at the very least, it should
distinguish a comment from not-a-comment, but I'd love to distinguish use
from mention).

Does anyone have a favored approach (for the general case of source code,
and/or the specific cases of the abovementioned languages)? I'm not
expecting extended hand-holding; a pointer or two would be appreciated.
I'll be happy to formally document this, once I get it working.

- Kurt

Re: [lucy-dev] Indexing source code

Posted by Peter Karman <pe...@peknet.com>.

Marvin Humphrey wrote on 4/14/16, 9:06 PM:
> On Thu, Apr 14, 2016 at 9:53 AM, Kurt Starsinic<ks...@gmail.com>  wrote:
>
>> I want to use Lucy to index a bunch of source code (mostly Java, XML, Perl,
>> and C), and I haven't found any clear guidance in the docs.
>
> The easy but not very powerful way is just to index source code as a bag of
> words, using a RegexTokenizer which matches `\w+`.  But that doesn't meet your
> needs...
>
>> I'd much prefer
>> if the index were reasonably syntax-aware (at the very least, it should
>> distinguish a comment from not-a-comment, but I'd love to distinguish use
>> from mention).
>
> So for that you're looking at some sort of lex/parse compiler front end for
> each language, which you then use to feed into different fields. You could
> potentially get quite fine grained.
>

If I were tackling this project, I would write a SWISH::Filter and use the Dezi 
system.

https://metacpan.org/pod/SWISH::Filter#WRITING-FILTERS

Basically, you would use a language-specific parser to convert everything to 
XML, which the Dezi system can parse natively.

It really all depends on the level of granularity you want for fields, and what 
kind of tokenization you want -- e.g. is "foo()" a single term? or is it "foo"?

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-dev] Indexing source code

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Thu, Apr 14, 2016 at 9:53 AM, Kurt Starsinic <ks...@gmail.com> wrote:

> I want to use Lucy to index a bunch of source code (mostly Java, XML, Perl,
> and C), and I haven't found any clear guidance in the docs.

The easy but not very powerful way is just to index source code as a bag of
words, using a RegexTokenizer which matches `\w+`.  But that doesn't meet your
needs...

> I'd much prefer
> if the index were reasonably syntax-aware (at the very least, it should
> distinguish a comment from not-a-comment, but I'd love to distinguish use
> from mention).

So for that you're looking at some sort of lex/parse compiler front end for
each language, which you then use to feed into different fields. You could
potentially get quite fine grained.

* package names
* class names
* imports
* comments
* base/extends/implements
* function bodies
* return types
* file name
* url
* content [i.e. all content together]
* ...

Each field would be ordinary flat text.  (You might want to insert some fake
separator token in between function bodies to prevent spurious phrase
matching.)  Exactly how you get flat text out of a compiler front end is going
to be specific to the module.

For parsing Perl source code, you presumably want PPI.  For XML, choose your
favorite XML module.  For Java/C, I don't know -- perhaps someone else has a
suggestion.

The next phase is designing a decent query interface.  Searching all fields
with default weighting is unlikely to yield optimum results, so you'll have to
tune it like you would any other search app.  Your users are probably
sophisticated and will also appreciate an "advanced" interface.

Finally, you'll want excerpting.  That's what you need the `content` field
for.  Hopefully Lucy's Highlighter will choose good excerpts out of the box.

Add a link from the URL field, and there you go!

> I'll be happy to formally document this, once I get it working.

It would be cool to get some sort of markdown document for the Lucy Cookbook,
similar to these!

https://github.com/apache/lucy/tree/apache-lucy-0.5.0/core/Lucy/Docs/Cookbook

Marvin Humphrey