You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Clemens Marschner <cm...@lanlab.de> on 2003/06/30 12:46:01 UTC

Re: Lucene crawler plan

There's an experimental webcrawler in the lucene-sandbox area called
larm-webcrawler (see
http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html),

and a project on Sourceforge (http://larm.sf.net) that tries to leverage
this on a higher level. I want to encourage you to go on that side and read
through the specs in sourceforge's CVS.

It concludes pretty much everything that Andy wrote in his proposal, and
more. The project only contains conceptual documents at this time, but if
you're willing to contribute actively, that's very appreciated.

Unfortunately I have to stop my efforts regarding LARM. Long story short: My
future employer says it's too close to their business. But in contrast to
other open source projects, there's already lots of ideas in that document
and lots of code in the old crawler. If you wish to contribute, it's now up
to you.

Clemens



----- Original Message ----- 
From: "Andrew C. Oliver" <ac...@apache.org>
To: "Peter Becker" <pb...@dstc.edu.au>
Cc: "Lucene Developers List" <lu...@jakarta.apache.org>
Sent: Friday, June 27, 2003 2:53 AM
Subject: Re: Lucene crawler plan


> On 6/26/03 8:33 PM, "Peter Becker" <pb...@dstc.edu.au> wrote:
>
> > Hi Andrew,
> >
> > are you the Andy signing this:
> > http://jakarta.apache.org/lucene/docs/luceneplan.html? If no -- do you
> > know who wrote the page and could you forward this email? Thanks. BTW:
> > your website link on http://jakarta.apache.org/lucene/docs/whoweare.html
> > is dead.
> >
>
> Yes I wrote it.
>
> >
> > The question is: is there some code already? If yes: can we get it? Can
> > we join the effort? If no: what are things we should consider doing to
> > increase our chances that you guys accept our code in the end? We are
> > not really interested in maintaining the crawler bits and pieces, our
> > interest is in the visualization. We are happy to get something going as
> > part of our little demonstrator, but then we'd give it to you and hope
> > someone picks up maintenance.
> >
>
> I never wrote any code, but there is code in lucene-contrib which realized
> most of what is in this document.  I was going to write code, but someone
> beat me to the punch and I was like "wow I have things I can do that
others
> won't do for me" and moved on :-)
>
> I'm cc'ing lucene developers list.  You'll find plenty of folks interested
> in working with you on this.
>
> -Andy
> > Is this all an option anyway? It is ok to say no ;-)
> >
> > Regards,
> >  Peter
> >
>
> -- 
> Andrew C. Oliver
> http://www.superlinksoftware.com/poi.jsp
> Custom enhancements and Commercial Implementation for Jakarta POI
>
> http://jakarta.apache.org/poi
> For Java and Excel, Got POI?
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Lucene crawler plan

Posted by Victor Hadianto <vi...@nuix.com.au>.
> >does anyone know of a Java implementation for file(1) magic?
>
> Peter,
> Can you explain what file(1) magic is?

From: $man 1 file

File  tests each argument in an attempt to classify it.  There are three sets 
of tests, performed in this order: filesystem tests, magic number tests, and 
language tests.  The first test that  succeeds causes the file type to be 
printed

victor


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Lucene crawler plan

Posted by Peter Becker <pb...@dstc.edu.au>.
Peter Becker wrote:

[...about the UNIX "file" command...]

> The idea is to recognize files by certain parts in them instead of 
> using the extensions. The result of the classic file command is a 
> user-readable string, although there have been extensions to MIME 
> types. Unfortunately I can't find a pointer for the latter. 

I should read the documents I cite :-) The man page for file contains 
the information about the MIME versions -- it is the "-i" option (see 
OPTIONS, or search for " -i ").

   Peter



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Lucene crawler plan

Posted by Peter Becker <pb...@dstc.edu.au>.
Sorry Jack,

after I did send my mail I realized that even many Unix users don't know 
that command and even less people from other platforms. Here are the 
relevant UNIX man pages:

  http://unixhelp.ed.ac.uk/CGI/man-cgi?file
  http://unixhelp.ed.ac.uk/CGI/man-cgi?magic+5

The idea is to recognize files by certain parts in them instead of using 
the extensions. The result of the classic file command is a 
user-readable string, although there have been extensions to MIME types. 
Unfortunately I can't find a pointer for the latter.

  Peter



Jack Park wrote:

> At 07:21 PM 6/30/2003, you wrote:
>
>> does anyone know of a Java implementation for file(1) magic?
>
>
> Peter,
> Can you explain what file(1) magic is?
> I feel dense. I'd like to help if I can.
> Thanks
> Jack
>
>
> ---------------------------------------------------------------------------
> XML Topic Maps: Creating and Using Topic Maps for the Web.
> Addison-Wesley. Jack Park, Editor. Sam Hunting, Technical Editor
>
> Build smarter kids, not smarter bombs.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Lucene crawler plan

Posted by Jack Park <ja...@thinkalong.com>.
At 07:21 PM 6/30/2003, you wrote:
>does anyone know of a Java implementation for file(1) magic?

Peter,
Can you explain what file(1) magic is?
I feel dense. I'd like to help if I can.
Thanks
Jack


---------------------------------------------------------------------------
XML Topic Maps: Creating and Using Topic Maps for the Web.
Addison-Wesley. Jack Park, Editor. Sam Hunting, Technical Editor

Build smarter kids, not smarter bombs.



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Parser Question

Posted by Peter Becker <pb...@dstc.edu.au>.
Tod Thomas wrote:

>Peter Becker wrote:
>
>  
>
>>Leo Galambos wrote:
>>
>>    
>>
>>>Peter Becker wrote:
>>>
>>>      
>>>
>>>>Hi Tod,
>>>>
>>>>as far as I know Lucene itself doesn't offer this (at least we failed
>>>>to find it). The closest thing available seem to be the Ant tasks.
>>>>
>>>>We are currently working on introducing this notion for our program,
>>>>which is open source. Beside the plugin mechanism there will be a
>>>>file filter mapping and a thread mechanism to maintain an index as
>>>>well as implementations using POI and Multivalent. Give us another
>>>>week or two.
>>>>        
>>>>
>>>Unfortunately, I didn't get this. Could you explain the mechanism,
>>>please? Thank you
>>>      
>>>
>>Not fully yet, since we are still working on it ;-) You can find the
>>code in our CVS repository on Sourceforge:
>>
>>
>>http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/
>>
>>The idea is that you have to supply different parsers for different
>>formats, then turn the results found into Lucene Document objects. At
>>the moment we do this using a normal interface similar to the one used
>>in the Java Ant tasks (see the "handlers" directory), but we want to
>>turn it into a plugin interface. Our tool should in the end do TXT, HTML
>>and XML out of the box and have at least three plugin implementations:
>>
>>  - POI for .doc, .xls
>>  - PDFbox for .pdf
>>  - Multivalent for .pdf, .dvi and others
>>
>>The plugin API will be extremely simple and it should fit easily with
>>the Ant tasks, so you should be able to wrap our code into an Ant task
>>or whatever interface you need.
>>    
>>
>
>This sounds really cool.  If I'm reading you correctly it will be a fairly intuitive
>exercise to port parsers writtent in Java for existing file formats to use your plugin
>architecture.  Accurate?
>  
>
The basic interface is just getting a URL and returning a data object 
containing the different types of information: body, title, authors, and 
so on. The data object returned is where we differ from the Ant task, 
where a Lucene Document is created -- the main reason we did it 
different is that we didn't want to decide the way of indexing (mostly 
the stored vs. unstored) in the parser code. But it also gives the 
option to use the parsers for something different than Lucene.

There will be some overhead for the plugin management, which we haven't 
done yet. Hopefully it will be very simple to use -- the idea here is to 
keep the complexity (if needed) in the plugin manager we will provide. 
We aim at getting this done very soon.

More info when it is done :-) If you are too curious just read the code ;-)

  Peter


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Parser Question

Posted by Tod Thomas <tt...@chubb.com>.
Peter Becker wrote:

> Leo Galambos wrote:
>
> > Peter Becker wrote:
> >
> >> Hi Tod,
> >>
> >> as far as I know Lucene itself doesn't offer this (at least we failed
> >> to find it). The closest thing available seem to be the Ant tasks.
> >>
> >> We are currently working on introducing this notion for our program,
> >> which is open source. Beside the plugin mechanism there will be a
> >> file filter mapping and a thread mechanism to maintain an index as
> >> well as implementations using POI and Multivalent. Give us another
> >> week or two.
> >
> >
> > Unfortunately, I didn't get this. Could you explain the mechanism,
> > please? Thank you
>
> Not fully yet, since we are still working on it ;-) You can find the
> code in our CVS repository on Sourceforge:
>
>
> http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/
>
> The idea is that you have to supply different parsers for different
> formats, then turn the results found into Lucene Document objects. At
> the moment we do this using a normal interface similar to the one used
> in the Java Ant tasks (see the "handlers" directory), but we want to
> turn it into a plugin interface. Our tool should in the end do TXT, HTML
> and XML out of the box and have at least three plugin implementations:
>
>   - POI for .doc, .xls
>   - PDFbox for .pdf
>   - Multivalent for .pdf, .dvi and others
>
> The plugin API will be extremely simple and it should fit easily with
> the Ant tasks, so you should be able to wrap our code into an Ant task
> or whatever interface you need.

This sounds really cool.  If I'm reading you correctly it will be a fairly intuitive
exercise to port parsers writtent in Java for existing file formats to use your plugin
architecture.  Accurate?

Tod


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Parser Question

Posted by Peter Becker <pb...@dstc.edu.au>.
Leo Galambos wrote:

> Peter Becker wrote:
>
>> Hi Tod,
>>
>> as far as I know Lucene itself doesn't offer this (at least we failed 
>> to find it). The closest thing available seem to be the Ant tasks.
>>
>> We are currently working on introducing this notion for our program, 
>> which is open source. Beside the plugin mechanism there will be a 
>> file filter mapping and a thread mechanism to maintain an index as 
>> well as implementations using POI and Multivalent. Give us another 
>> week or two.
>
>
> Unfortunately, I didn't get this. Could you explain the mechanism, 
> please? Thank you 

Not fully yet, since we are still working on it ;-) You can find the 
code in our CVS repository on Sourceforge:

  
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/

The idea is that you have to supply different parsers for different 
formats, then turn the results found into Lucene Document objects. At 
the moment we do this using a normal interface similar to the one used 
in the Java Ant tasks (see the "handlers" directory), but we want to 
turn it into a plugin interface. Our tool should in the end do TXT, HTML 
and XML out of the box and have at least three plugin implementations:

  - POI for .doc, .xls
  - PDFbox for .pdf
  - Multivalent for .pdf, .dvi and others

The plugin API will be extremely simple and it should fit easily with 
the Ant tasks, so you should be able to wrap our code into an Ant task 
or whatever interface you need.

  Peter



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Parser Question

Posted by Leo Galambos <Le...@seznam.cz>.
Peter Becker wrote:

> Hi Tod,
>
> as far as I know Lucene itself doesn't offer this (at least we failed 
> to find it). The closest thing available seem to be the Ant tasks.
>
> We are currently working on introducing this notion for our program, 
> which is open source. Beside the plugin mechanism there will be a file 
> filter mapping and a thread mechanism to maintain an index as well as 
> implementations using POI and Multivalent. Give us another week or two.

Unfortunately, I didn't get this. Could you explain the mechanism, 
please? Thank you

-g-


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Parser Question

Posted by Peter Becker <pb...@dstc.edu.au>.
Hi Tod,

as far as I know Lucene itself doesn't offer this (at least we failed to 
find it). The closest thing available seem to be the Ant tasks.

We are currently working on introducing this notion for our program, 
which is open source. Beside the plugin mechanism there will be a file 
filter mapping and a thread mechanism to maintain an index as well as 
implementations using POI and Multivalent. Give us another week or two.

BTW: has anyone looked into the option of using the OpenOffice UDK 
(http://udk.openoffice.org/) as document parser? We wanted to, but I am 
afraid we won't have the time. It sure will be a huge plugin and not as 
easy to deploy as the average JAR, but it would support a large range of 
documents and should be very suited for enterprise document collections.

  Peter



Tod Thomas wrote:

>I noticed from the FAQ that the developer must provide a parser for every
>type of document that requires indexing by Lucene.  Does Lucen have a
>'plugin' capacity to easily add a new parser into the mix?
>
>Forgive me if this is a dumb question, I haven't yet looked at the source
>code, or the configuration in detail.
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Parser Question

Posted by Tod Thomas <tt...@chubb.com>.
I noticed from the FAQ that the developer must provide a parser for every
type of document that requires indexing by Lucene.  Does Lucen have a
'plugin' capacity to easily add a new parser into the mix?

Forgive me if this is a dumb question, I haven't yet looked at the source
code, or the configuration in detail.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Lucene crawler plan

Posted by Peter Becker <pb...@dstc.edu.au>.
Erik Hatcher wrote:

[...some Ant related things I should look at...]

>> What are the issues with JTidy?
>
>
> The version number!  Its ancient.  It does a decent job with even 
> mangled HTML though - I just suspect something better surely is out 
> there by now.

My colleague had the same thought, but I think that is not a problem. 
The HTML 4.01 recommendation is from Christmas 1999. I don't really see 
any reason why they should have changed it once it worked good enough. 
Of course website programmers might have come up with other forms of 
weirdness in the code by now, but I can easily imagine that this is not 
a problem if the original parser was robust enough.

  Peter


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Lucene crawler plan

Posted by Erik Hatcher <li...@ehatchersolutions.com>.
On Tuesday, July 1, 2003, at 06:36  PM, Peter Becker wrote:
>> Ah, but Ant *does* have more sophisticated filtering mechanisms!  :)  
>> The <fileset>'s that the <index> task can take can leverage any of 
>> Ant's built-in capabilities, such as (new in Ant 1.5) Selector 
>> capability.  So you could easily filter on file size, file date, etc, 
>> and custom Selectors can be written and plugged in.
>
> Ant does. What I meant with the Ant project was the code in the Lucene 
> CVS for Ant. The decision between the two DocumentHandlers seems to be 
> made based on the extension. But maybe I didn't read the code > properly.

But, look at the setters on IndexTask.  The document handler is 
pluggable.  The one that is provided is definitely dumb, no question, 
and was only meant as an example.  I have my own BlogDocumentHandler 
for indexing my blog entries, for example (they are text files, but get 
indexed differently than plain ol' .txt).

> What I want to see is a user-defined mapping from some kinds of 
> FileFilters (extension, wildcard, regexp, magic numbers, whatever) to 
> the DocumentHandlers. They should be applied in order and whenever one 
> hits the iteration stops unless an exception gets thrown by the 
> DocumentHandler. Additional DocumentHandlers could be mixed in to 
> provide extra information. I am thinking of file system information 
> and metadata stores here. These would be an independent dimension of 
> data about the documents.

Also note that the code could easily be modified to allow dynamic 
properties to be passed to document handlers (see Ant's 
DynamicConfigurator interface).  I experimented with this some myself, 
but didn't need it so didn't keep the code around.

>> I think there are probably some better options out there than using 
>> JTidy these days, but I have not had time to investigate them.  JTidy 
>> does the job reasonably well though.
>
> We are looking into some alternatives. We have a few ten thousand 
> documents to test on :-) I suspect we will just implement whatever 
> comes along and let them run, collecting exceptions and time eaten. 
> Checking if they really got all interesting content will be too much 
> work, though.
>
> What are the issues with JTidy?

The version number!  Its ancient.  It does a decent job with even 
mangled HTML though - I just suspect something better surely is out 
there by now.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Lucene crawler plan

Posted by Peter Becker <pb...@dstc.edu.au>.
Erik Hatcher wrote:

> On Monday, June 30, 2003, at 10:21  PM, Peter Becker wrote:
>
>> this is far closer to what we are looking for. Using Ant is an 
>> interesting idea, although it probably won't help us for the UI tool. 
>> But we could try to layer things so we could use them for both
>
>
> Yes, I'm sure a more generalized method could be developed that 
> accomodates both.  Its pretty decoupled even within the Ant project 
> with a DocumentHandler interface and all. 

And frankly -- these little code pieces are easy to port. The trick is 
knowing which library to use and how.

>> Two differences between the Ant project and what we do right now:
>> - the Ant project doesn't have a notion of an explicit file filter. I 
>> think this is important if you want to extend the filter options to 
>> more than just extensions and if you want some UI to manage the 
>> filter mappings. BTW: does anyone know of a Java implementation for 
>> file(1) magic?
>
>
> Ah, but Ant *does* have more sophisticated filtering mechanisms!  :)  
> The <fileset>'s that the <index> task can take can leverage any of 
> Ant's built-in capabilities, such as (new in Ant 1.5) Selector 
> capability.  So you could easily filter on file size, file date, etc, 
> and custom Selectors can be written and plugged in. 

Ant does. What I meant with the Ant project was the code in the Lucene 
CVS for Ant. The decision between the two DocumentHandlers seems to be 
made based on the extension. But maybe I didn't read the code properly.

What I want to see is a user-defined mapping from some kinds of 
FileFilters (extension, wildcard, regexp, magic numbers, whatever) to 
the DocumentHandlers. They should be applied in order and whenever one 
hits the iteration stops unless an exception gets thrown by the 
DocumentHandler. Additional DocumentHandlers could be mixed in to 
provide extra information. I am thinking of file system information and 
metadata stores here. These would be an independent dimension of data 
about the documents.

>> - the code creates Documents as return values. The reason we went 
>> away from this is that we want to use the same document handler with 
>> different index options. One of the core issues here is storing the 
>> body or not. I don't think there is any true answer for this one, so 
>> it should be configurable somehow.
>
>
> Agreed.  It was a toss-up when I went to implement as who is actually 
> in control of the Document instantiation and population.
>
>>  The two options I see are either returning a data object and then 
>> turning that into a Document somewhere else or passing some 
>> configuration object around. Both are not really nice, the first one 
>> needs to create an additional object all the time, while the second 
>> one puts quite some burder on the implementer of the document 
>> handler. Ideas on that one would be extremely welcome.
>
>
> If you invert what I have done then the "controller" needs to know 
> more information about the fields, more than you could convey in a 
> String/String Map - is a field indexed or not?  Is a field tokenized 
> or not?  Is it stored or not?  Who decides on the field names?  Who 
> decides all of these are the questions we have to answer to do this 
> type of stuff. 

Exactly. Somehow these issues should be separated from the issue of 
finding the data. Our current idea is to collect everything in a data 
object and then get some other code to turn it into a Lucene Document. 
Another version would be a wrapper/factory/strategy around the Lucene 
Document doing the mapping.

The field name question would be separated this way, but one question 
would be left: what are the fields. The idea of having the extra 
Properties field doesn't really help that much, since then we are back 
to where we started. Giving a big range of default fields (along Dublin 
Core?) would help, but would be quite some overkill. It could be 
expensive in terms of object creation, too -- the wrapper approach would 
probably better here.

>> Two ideas we will probably pick up from this are:
>> - use Ant for creating indexes if we go larger than personal document 
>> retrieval
>
>
> Keep in mind you could also launch Ant via the API from a GUI as well, 
> or just leverage the IndexTask itself and call it via the API and its 
> execute() method. 

I'll investigate this. Thanks.

>> - use JTidy for HTML parsing (we missed that one and used Swing 
>> instead, which is no good)
>
>
> I think there are probably some better options out there than using 
> JTidy these days, but I have not had time to investigate them.  JTidy 
> does the job reasonably well though. 

We are looking into some alternatives. We have a few ten thousand 
documents to test on :-) I suspect we will just implement whatever comes 
along and let them run, collecting exceptions and time eaten. Checking 
if they really got all interesting content will be too much work, though.

What are the issues with JTidy?

  Peter


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Lucene crawler plan

Posted by Erik Hatcher <li...@ehatchersolutions.com>.
On Monday, June 30, 2003, at 10:21  PM, Peter Becker wrote:
> this is far closer to what we are looking for. Using Ant is an 
> interesting idea, although it probably won't help us for the UI tool. 
> But we could try to layer things so we could use them for both

Yes, I'm sure a more generalized method could be developed that 
accomodates both.  Its pretty decoupled even within the Ant project 
with a DocumentHandler interface and all.

> Two differences between the Ant project and what we do right now:
> - the Ant project doesn't have a notion of an explicit file filter. I 
> think this is important if you want to extend the filter options to 
> more than just extensions and if you want some UI to manage the filter 
> mappings. BTW: does anyone know of a Java implementation for file(1) 
> magic?

Ah, but Ant *does* have more sophisticated filtering mechanisms!  :)  
The <fileset>'s that the <index> task can take can leverage any of 
Ant's built-in capabilities, such as (new in Ant 1.5) Selector 
capability.  So you could easily filter on file size, file date, etc, 
and custom Selectors can be written and plugged in.

> - the code creates Documents as return values. The reason we went away 
> from this is that we want to use the same document handler with 
> different index options. One of the core issues here is storing the 
> body or not. I don't think there is any true answer for this one, so 
> it should be configurable somehow.

Agreed.  It was a toss-up when I went to implement as who is actually 
in control of the Document instantiation and population.

>  The two options I see are either returning a data object and then 
> turning that into a Document somewhere else or passing some 
> configuration object around. Both are not really nice, the first one 
> needs to create an additional object all the time, while the second 
> one puts quite some burder on the implementer of the document handler. 
> Ideas on that one would be extremely welcome.

If you invert what I have done then the "controller" needs to know more 
information about the fields, more than you could convey in a 
String/String Map - is a field indexed or not?  Is a field tokenized or 
not?  Is it stored or not?  Who decides on the field names?  Who 
decides all of these are the questions we have to answer to do this 
type of stuff.

> Two ideas we will probably pick up from this are:
> - use Ant for creating indexes if we go larger than personal document 
> retrieval

Keep in mind you could also launch Ant via the API from a GUI as well, 
or just leverage the IndexTask itself and call it via the API and its 
execute() method.

> - use JTidy for HTML parsing (we missed that one and used Swing 
> instead, which is no good)

I think there are probably some better options out there than using 
JTidy these days, but I have not had time to investigate them.  JTidy 
does the job reasonably well though.

> So thanks again, that was quite helpful.

My pleasure!

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Lucene crawler plan

Posted by Peter Becker <pb...@dstc.edu.au>.
Thanks Erik,

this is far closer to what we are looking for. Using Ant is an 
interesting idea, although it probably won't help us for the UI tool. 
But we could try to layer things so we could use them for both -- we 
want to get some more sophisticated index management anyway. The option 
to create the index at one place and use it somewhere else would be 
great -- during testing and demoing we ran into the problem that we 
wanted to demo on a Windows box but using a Unix filesystem mounted via 
SMB/Samba. Symlinks are no fun in this case :-( To work around this we 
need to develop some notion of a base URL, then we could easily mount an 
index created on one machine on another -- even if the underlying OS 
changes. To go Enterprise we would still need some security concept, 
which we probably won't do before someone is willing to pay for it :-) 
It might be better to go intranet for that one anyway -- we should be 
able to take it all to the Web.

Two differences between the Ant project and what we do right now:
- the Ant project doesn't have a notion of an explicit file filter. I 
think this is important if you want to extend the filter options to more 
than just extensions and if you want some UI to manage the filter 
mappings. BTW: does anyone know of a Java implementation for file(1) magic?
- the code creates Documents as return values. The reason we went away 
from this is that we want to use the same document handler with 
different index options. One of the core issues here is storing the body 
or not. I don't think there is any true answer for this one, so it 
should be configurable somehow. The two options I see are either 
returning a data object and then turning that into a Document somewhere 
else or passing some configuration object around. Both are not really 
nice, the first one needs to create an additional object all the time, 
while the second one puts quite some burder on the implementer of the 
document handler. Ideas on that one would be extremely welcome.

Two ideas we will probably pick up from this are:
- use Ant for creating indexes if we go larger than personal document 
retrieval
- use JTidy for HTML parsing (we missed that one and used Swing instead, 
which is no good)

So thanks again, that was quite helpful.

  Peter



Erik Hatcher wrote:

> If you are after a pure file system indexing abstraction, check out 
> the  'ant' project in the sandbox.  It's got a DocumentHandler 
> abstraction  allowing it to be a bit pluggable.  Its not perfect, but 
> it has worked  for me for quite some time quite sufficiently.
>
>     Erik
>
>
> On Monday, June 30, 2003, at 08:26  PM, Peter Becker wrote:
>
>> Clemens Marschner wrote:
>>
>>> There's an experimental webcrawler in the lucene-sandbox area called
>>> larm-webcrawler (see
>>> http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/ 
>>> overview.html),
>>>
>>> and a project on Sourceforge (http://larm.sf.net) that tries to  
>>> leverage
>>> this on a higher level. I want to encourage you to go on that side  
>>> and read
>>> through the specs in sourceforge's CVS.
>>>
>> I've done that by now -- my first problem was to identify LARM as 
>> the  relevant project, but then things were reasonably easy to find.
>>
>>> It concludes pretty much everything that Andy wrote in his 
>>> proposal,  and
>>> more. The project only contains conceptual documents at this time,  
>>> but if
>>> you're willing to contribute actively, that's very appreciated.
>>>
>> In many ways the project aims too high for us. We are interested 
>> only  in the file system part and our time is limited. My hope was 
>> that  someone would say there would be a basic framework somewhere 
>> where we  can put our code, but due to the time limitations we will 
>> rather do  our own thing. But this is maybe not as bad as it sounds 
>> since (a) our  original plan was very close to what you describe in 
>> certain parts of  the system, (b) we have read your documentation and 
>> (c) our code will  be BSD-licensed.
>>
>> The main ideas we have are:
>> - map file types to document processors
>> - use the java.io.FileFilter interface as base for the mappings
>> - the document processors will probably have a two method interface:
>>    DocumentSummary processDocument(URL);
>>    String getDisplayName();
>> - the DocSummary class will model the common attributes like author,  
>> title, text, etc. with a Properties object to be extensible. It's 
>> main  purpose is to separate indexing concerns like stored/unstored 
>> and  tokenized/untokenized from the document processors
>> - the display name will be used in the UI to create lists of  
>> FileFilter->DocumentProcessor mappings
>> - there will be some crawler code for the file system, but of course  
>> that is a lot easier
>>
>> Many of these things will not extend straightaway into the web  
>> context, but I think the main work we will do will be in 
>> implementing  the different DocumentProcessors. That part should be 
>> reusable. The  mapping idea should be reusable, although FileFilter 
>> would have to be  replaced with something more abstract, at least a 
>> URLFilter. My  experience with Java networking is not good enough to 
>> judge the  complexity of that.
>>
>> We expect to have the relevant parts of this done next week. Code 
>> will  be on Sourceforge  
>> (http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/), it  
>> might be at least useful as inspiration :-) We are also looking into  
>> alternatives for parsing PDF and other formats. We have a lot of  
>> problems with PDFBox at the moment, and there might be other  
>> candidates (http://www.cs.berkeley.edu/~phelps/Multivalent/). And we  
>> are looking into the option to use the UDK for indexing  
>> (http://udk.openoffice.org/), although that most likely will  
>> complicate deployment and increase program size quite a bit. One of  
>> the problems we have is that we have some interesting test cases for  
>> the parsing tools, but we can't give them away and don't have the 
>> time  to debug ourself. We have a file which causes PDFBox to get 
>> stuck  without any feedback and an XLS file which causes POI to loop 
>> with  funny messages for a long time until we run out of memory 
>> (with  -mx500m). But that is something we have to talk to the other 
>> projects  about.
>>
>> The point of this waffle is: if you think some of our ideas are not 
>> as  good as they should be or there are things that might affect 
>> reuse,  please shout now :-) We start coding this right now.
>>
>>> Unfortunately I have to stop my efforts regarding LARM. Long story  
>>> short: My
>>> future employer says it's too close to their business. But in  
>>> contrast to
>>> other open source projects, there's already lots of ideas in that  
>>> document
>>> and lots of code in the old crawler. If you wish to contribute, 
>>> it's  now up
>>> to you.
>>>
>> Fair enough. I guess as professional developer you can never be  
>> completely free from considering IP issues.
>>
>> Grüsse,
>>    Peter
>>
>>
>>> Clemens
>>>
>>>
>>>
>>> ----- Original Message ----- From: "Andrew C. Oliver"  
>>> <ac...@apache.org>
>>> To: "Peter Becker" <pb...@dstc.edu.au>
>>> Cc: "Lucene Developers List" <lu...@jakarta.apache.org>
>>> Sent: Friday, June 27, 2003 2:53 AM
>>> Subject: Re: Lucene crawler plan
>>>
>>>
>>>
>>>> On 6/26/03 8:33 PM, "Peter Becker" <pb...@dstc.edu.au> wrote:
>>>>
>>>>
>>>>> Hi Andrew,
>>>>>
>>>>> are you the Andy signing this:
>>>>> http://jakarta.apache.org/lucene/docs/luceneplan.html? If no -- 
>>>>> do  you
>>>>> know who wrote the page and could you forward this email? Thanks.  
>>>>> BTW:
>>>>> your website link on  
>>>>> http://jakarta.apache.org/lucene/docs/whoweare.html
>>>>> is dead.
>>>>>
>>>>>
>>>> Yes I wrote it.
>>>>
>>>>
>>>>> The question is: is there some code already? If yes: can we get 
>>>>> it?  Can
>>>>> we join the effort? If no: what are things we should consider 
>>>>> doing  to
>>>>> increase our chances that you guys accept our code in the end? We  
>>>>> are
>>>>> not really interested in maintaining the crawler bits and pieces,  
>>>>> our
>>>>> interest is in the visualization. We are happy to get something  
>>>>> going as
>>>>> part of our little demonstrator, but then we'd give it to you and  
>>>>> hope
>>>>> someone picks up maintenance.
>>>>>
>>>>>
>>>> I never wrote any code, but there is code in lucene-contrib which  
>>>> realized
>>>> most of what is in this document.  I was going to write code, but  
>>>> someone
>>>> beat me to the punch and I was like "wow I have things I can do that
>>>>
>>> others
>>>
>>>> won't do for me" and moved on :-)
>>>>
>>>> I'm cc'ing lucene developers list.  You'll find plenty of folks  
>>>> interested
>>>> in working with you on this.
>>>>
>>>> -Andy
>>>>
>>>>> Is this all an option anyway? It is ok to say no ;-)
>>>>>
>>>>> Regards,
>>>>> Peter
>>>>>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Lucene crawler plan

Posted by Erik Hatcher <li...@ehatchersolutions.com>.
If you are after a pure file system indexing abstraction, check out the  
'ant' project in the sandbox.  It's got a DocumentHandler abstraction  
allowing it to be a bit pluggable.  Its not perfect, but it has worked  
for me for quite some time quite sufficiently.

	Erik


On Monday, June 30, 2003, at 08:26  PM, Peter Becker wrote:

> Clemens Marschner wrote:
>
>> There's an experimental webcrawler in the lucene-sandbox area called
>> larm-webcrawler (see
>> http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/ 
>> overview.html),
>>
>> and a project on Sourceforge (http://larm.sf.net) that tries to  
>> leverage
>> this on a higher level. I want to encourage you to go on that side  
>> and read
>> through the specs in sourceforge's CVS.
>>
> I've done that by now -- my first problem was to identify LARM as the  
> relevant project, but then things were reasonably easy to find.
>
>> It concludes pretty much everything that Andy wrote in his proposal,  
>> and
>> more. The project only contains conceptual documents at this time,  
>> but if
>> you're willing to contribute actively, that's very appreciated.
>>
> In many ways the project aims too high for us. We are interested only  
> in the file system part and our time is limited. My hope was that  
> someone would say there would be a basic framework somewhere where we  
> can put our code, but due to the time limitations we will rather do  
> our own thing. But this is maybe not as bad as it sounds since (a) our  
> original plan was very close to what you describe in certain parts of  
> the system, (b) we have read your documentation and (c) our code will  
> be BSD-licensed.
>
> The main ideas we have are:
> - map file types to document processors
> - use the java.io.FileFilter interface as base for the mappings
> - the document processors will probably have a two method interface:
>    DocumentSummary processDocument(URL);
>    String getDisplayName();
> - the DocSummary class will model the common attributes like author,  
> title, text, etc. with a Properties object to be extensible. It's main  
> purpose is to separate indexing concerns like stored/unstored and  
> tokenized/untokenized from the document processors
> - the display name will be used in the UI to create lists of  
> FileFilter->DocumentProcessor mappings
> - there will be some crawler code for the file system, but of course  
> that is a lot easier
>
> Many of these things will not extend straightaway into the web  
> context, but I think the main work we will do will be in implementing  
> the different DocumentProcessors. That part should be reusable. The  
> mapping idea should be reusable, although FileFilter would have to be  
> replaced with something more abstract, at least a URLFilter. My  
> experience with Java networking is not good enough to judge the  
> complexity of that.
>
> We expect to have the relevant parts of this done next week. Code will  
> be on Sourceforge  
> (http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/), it  
> might be at least useful as inspiration :-) We are also looking into  
> alternatives for parsing PDF and other formats. We have a lot of  
> problems with PDFBox at the moment, and there might be other  
> candidates (http://www.cs.berkeley.edu/~phelps/Multivalent/). And we  
> are looking into the option to use the UDK for indexing  
> (http://udk.openoffice.org/), although that most likely will  
> complicate deployment and increase program size quite a bit. One of  
> the problems we have is that we have some interesting test cases for  
> the parsing tools, but we can't give them away and don't have the time  
> to debug ourself. We have a file which causes PDFBox to get stuck  
> without any feedback and an XLS file which causes POI to loop with  
> funny messages for a long time until we run out of memory (with  
> -mx500m). But that is something we have to talk to the other projects  
> about.
>
> The point of this waffle is: if you think some of our ideas are not as  
> good as they should be or there are things that might affect reuse,  
> please shout now :-) We start coding this right now.
>
>> Unfortunately I have to stop my efforts regarding LARM. Long story  
>> short: My
>> future employer says it's too close to their business. But in  
>> contrast to
>> other open source projects, there's already lots of ideas in that  
>> document
>> and lots of code in the old crawler. If you wish to contribute, it's  
>> now up
>> to you.
>>
> Fair enough. I guess as professional developer you can never be  
> completely free from considering IP issues.
>
> Grüsse,
>    Peter
>
>
>> Clemens
>>
>>
>>
>> ----- Original Message ----- From: "Andrew C. Oliver"  
>> <ac...@apache.org>
>> To: "Peter Becker" <pb...@dstc.edu.au>
>> Cc: "Lucene Developers List" <lu...@jakarta.apache.org>
>> Sent: Friday, June 27, 2003 2:53 AM
>> Subject: Re: Lucene crawler plan
>>
>>
>>
>>> On 6/26/03 8:33 PM, "Peter Becker" <pb...@dstc.edu.au> wrote:
>>>
>>>
>>>> Hi Andrew,
>>>>
>>>> are you the Andy signing this:
>>>> http://jakarta.apache.org/lucene/docs/luceneplan.html? If no -- do  
>>>> you
>>>> know who wrote the page and could you forward this email? Thanks.  
>>>> BTW:
>>>> your website link on  
>>>> http://jakarta.apache.org/lucene/docs/whoweare.html
>>>> is dead.
>>>>
>>>>
>>> Yes I wrote it.
>>>
>>>
>>>> The question is: is there some code already? If yes: can we get it?  
>>>> Can
>>>> we join the effort? If no: what are things we should consider doing  
>>>> to
>>>> increase our chances that you guys accept our code in the end? We  
>>>> are
>>>> not really interested in maintaining the crawler bits and pieces,  
>>>> our
>>>> interest is in the visualization. We are happy to get something  
>>>> going as
>>>> part of our little demonstrator, but then we'd give it to you and  
>>>> hope
>>>> someone picks up maintenance.
>>>>
>>>>
>>> I never wrote any code, but there is code in lucene-contrib which  
>>> realized
>>> most of what is in this document.  I was going to write code, but  
>>> someone
>>> beat me to the punch and I was like "wow I have things I can do that
>>>
>> others
>>
>>> won't do for me" and moved on :-)
>>>
>>> I'm cc'ing lucene developers list.  You'll find plenty of folks  
>>> interested
>>> in working with you on this.
>>>
>>> -Andy
>>>
>>>> Is this all an option anyway? It is ok to say no ;-)
>>>>
>>>> Regards,
>>>> Peter
>>>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Lucene crawler plan

Posted by Peter Becker <pb...@dstc.edu.au>.
Clemens Marschner wrote:

>There's an experimental webcrawler in the lucene-sandbox area called
>larm-webcrawler (see
>http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html),
>
>and a project on Sourceforge (http://larm.sf.net) that tries to leverage
>this on a higher level. I want to encourage you to go on that side and read
>through the specs in sourceforge's CVS.
>
I've done that by now -- my first problem was to identify LARM as the 
relevant project, but then things were reasonably easy to find.

>It concludes pretty much everything that Andy wrote in his proposal, and
>more. The project only contains conceptual documents at this time, but if
>you're willing to contribute actively, that's very appreciated.
>
In many ways the project aims too high for us. We are interested only in 
the file system part and our time is limited. My hope was that someone 
would say there would be a basic framework somewhere where we can put 
our code, but due to the time limitations we will rather do our own 
thing. But this is maybe not as bad as it sounds since (a) our original 
plan was very close to what you describe in certain parts of the system, 
(b) we have read your documentation and (c) our code will be BSD-licensed.

The main ideas we have are:
- map file types to document processors
- use the java.io.FileFilter interface as base for the mappings
- the document processors will probably have a two method interface:
    DocumentSummary processDocument(URL);
    String getDisplayName();
- the DocSummary class will model the common attributes like author, 
title, text, etc. with a Properties object to be extensible. It's main 
purpose is to separate indexing concerns like stored/unstored and 
tokenized/untokenized from the document processors
- the display name will be used in the UI to create lists of 
FileFilter->DocumentProcessor mappings
- there will be some crawler code for the file system, but of course 
that is a lot easier

Many of these things will not extend straightaway into the web context, 
but I think the main work we will do will be in implementing the 
different DocumentProcessors. That part should be reusable. The mapping 
idea should be reusable, although FileFilter would have to be replaced 
with something more abstract, at least a URLFilter. My experience with 
Java networking is not good enough to judge the complexity of that.

We expect to have the relevant parts of this done next week. Code will 
be on Sourceforge 
(http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/), it 
might be at least useful as inspiration :-) We are also looking into 
alternatives for parsing PDF and other formats. We have a lot of 
problems with PDFBox at the moment, and there might be other candidates 
(http://www.cs.berkeley.edu/~phelps/Multivalent/). And we are looking 
into the option to use the UDK for indexing 
(http://udk.openoffice.org/), although that most likely will complicate 
deployment and increase program size quite a bit. One of the problems we 
have is that we have some interesting test cases for the parsing tools, 
but we can't give them away and don't have the time to debug ourself. We 
have a file which causes PDFBox to get stuck without any feedback and an 
XLS file which causes POI to loop with funny messages for a long time 
until we run out of memory (with -mx500m). But that is something we have 
to talk to the other projects about.

The point of this waffle is: if you think some of our ideas are not as 
good as they should be or there are things that might affect reuse, 
please shout now :-) We start coding this right now.

>Unfortunately I have to stop my efforts regarding LARM. Long story short: My
>future employer says it's too close to their business. But in contrast to
>other open source projects, there's already lots of ideas in that document
>and lots of code in the old crawler. If you wish to contribute, it's now up
>to you.
>
Fair enough. I guess as professional developer you can never be 
completely free from considering IP issues.

Grüsse,
    Peter


>Clemens
>
>
>
>----- Original Message ----- 
>From: "Andrew C. Oliver" <ac...@apache.org>
>To: "Peter Becker" <pb...@dstc.edu.au>
>Cc: "Lucene Developers List" <lu...@jakarta.apache.org>
>Sent: Friday, June 27, 2003 2:53 AM
>Subject: Re: Lucene crawler plan
>
>
>  
>
>>On 6/26/03 8:33 PM, "Peter Becker" <pb...@dstc.edu.au> wrote:
>>
>>    
>>
>>>Hi Andrew,
>>>
>>>are you the Andy signing this:
>>>http://jakarta.apache.org/lucene/docs/luceneplan.html? If no -- do you
>>>know who wrote the page and could you forward this email? Thanks. BTW:
>>>your website link on http://jakarta.apache.org/lucene/docs/whoweare.html
>>>is dead.
>>>
>>>      
>>>
>>Yes I wrote it.
>>
>>    
>>
>>>The question is: is there some code already? If yes: can we get it? Can
>>>we join the effort? If no: what are things we should consider doing to
>>>increase our chances that you guys accept our code in the end? We are
>>>not really interested in maintaining the crawler bits and pieces, our
>>>interest is in the visualization. We are happy to get something going as
>>>part of our little demonstrator, but then we'd give it to you and hope
>>>someone picks up maintenance.
>>>
>>>      
>>>
>>I never wrote any code, but there is code in lucene-contrib which realized
>>most of what is in this document.  I was going to write code, but someone
>>beat me to the punch and I was like "wow I have things I can do that
>>    
>>
>others
>  
>
>>won't do for me" and moved on :-)
>>
>>I'm cc'ing lucene developers list.  You'll find plenty of folks interested
>>in working with you on this.
>>
>>-Andy
>>    
>>
>>>Is this all an option anyway? It is ok to say no ;-)
>>>
>>>Regards,
>>> Peter
>>>      
>>>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org