You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by Bastiaan Braams <bj...@gmail.com> on 2013/08/12 13:24:38 UTC

Getting started, wish to present a bibliographical database to the web

Greetings. I am a newcomer looking for advice about getting started with
Lucene Core and/or Solr in order to present to the world a searchable
bibliographical database.

I have the database in my filespace in a plain text format; let us say as a
BibTeX file. So the data is quite well structured, with fields such as
Author, Title, Journal and Year, but also some less structured fields:
Abstract, Notes, Keywords. I don't have the article full texts.

There are about 100 000 entries in the database; the total size is less
than 1 GB.

I have access to a server that already provides web pages to the world. Now
I want to provide these bibliographical data to the world, with some search
functionality for the visitors.

Would Lucene Core be a good building block for this? Would I have any use
for Lucene Solr? I have the impression that I should consider Solr only if
the data were distributed over the web, but in my case the data are all in
one place that is under my control.

The quick tutorial for Lucene Core shows how I may create a Lucene database
and query it on my system through the command line. Could someone please
recommend a tutorial about creating a web interface for the prospective
world-wide users of this database?

Re: Getting started, wish to present a bibliographical database to the web

Posted by Ted Dunning <te...@gmail.com>.
Solr will be the easier integration for you.  Solr can crawl your
bibliographic database but I don't see off-hand a way for Solr (or Lucene
for that matter) to parse bibtex directly.

One kind of fun way around that would be to munge your web-site to have the
option of presenting your bibliography in a simple format such as JSON.
 The json based bibtex parsers at

http://fisica.cab.cnea.gov.ar/colisiones/staff/fiol/bibtexparse_es.html

or

https://code.google.com/p/bibtex-js/

would make this fairly easy.  The former would be easy to integrate on the
server side and the second would be easy to integrate on the browser side.

Then all you would need to do is point Solr at a page that tells it all of
your files to parse.  A directory tree is a natural way to do this.



On Mon, Aug 12, 2013 at 4:24 AM, Bastiaan Braams <bj...@gmail.com> wrote:

> Greetings. I am a newcomer looking for advice about getting started with
> Lucene Core and/or Solr in order to present to the world a searchable
> bibliographical database.
>
> I have the database in my filespace in a plain text format; let us say as a
> BibTeX file. So the data is quite well structured, with fields such as
> Author, Title, Journal and Year, but also some less structured fields:
> Abstract, Notes, Keywords. I don't have the article full texts.
>
> There are about 100 000 entries in the database; the total size is less
> than 1 GB.
>
> I have access to a server that already provides web pages to the world. Now
> I want to provide these bibliographical data to the world, with some search
> functionality for the visitors.
>
> Would Lucene Core be a good building block for this? Would I have any use
> for Lucene Solr? I have the impression that I should consider Solr only if
> the data were distributed over the web, but in my case the data are all in
> one place that is under my control.
>
> The quick tutorial for Lucene Core shows how I may create a Lucene database
> and query it on my system through the command line. Could someone please
> recommend a tutorial about creating a web interface for the prospective
> world-wide users of this database?
>

Re: Getting started, wish to present a bibliographical database to the web

Posted by Bastiaan Braams <bj...@gmail.com>.
Thank you Mark Bennett and Ted Dunning: (a) for the advice to use Solr
rather than Lucene Core and (b) for the advice to use JSON or maybe XML or
CSV. I can transform my data files to JSON format quite easily. With
respect to Solr, indeed I was confused by all the references to its role as
a cloud platform; I had not recognized it as a tool to work with a simple
database that is stored on one's own server. -Bas Braams


On Mon, Aug 12, 2013 at 5:11 PM, Mark Bennett
<ma...@lucidworks.com>wrote:

> Hello Bastiaan,
>
> On Aug 12, 2013, at 4:24 AM, Bastiaan Braams <bj...@gmail.com> wrote:
>
> > Greetings. I am a newcomer looking for advice about getting started with
> > Lucene Core and/or Solr in order to present to the world a searchable
> > bibliographical database.
>
> Excellent.
>
> > I have the database in my filespace in a plain text format; let us say
> as a
> > BibTeX file. So the data is quite well structured, with fields such as
> > Author, Title, Journal and Year, but also some less structured fields:
> > Abstract, Notes, Keywords. I don't have the article full texts.
>
> The trick will be to get this data into one of the formats that Solr can
> digest (XML, JSON or CSV), or write a Java client that uses SolrJ that
> reads the file and submits it.
>
> > There are about 100 000 entries in the database; the total size is less
> > than 1 GB.
>
> That's fine, that's a reasonable amount of data.
>
> > I have access to a server that already provides web pages to the world.
> Now
> > I want to provide these bibliographical data to the world, with some
> search
> > functionality for the visitors.
>
> Good.
>
> > Would Lucene Core be a good building block for this? Would I have any use
> > for Lucene Solr?
>
> I would strongly suggest Solr over Lucene.
>
> > I have the impression that I should consider Solr only if
> > the data were distributed over the web, ...
>
> This is not correct, although I'm curious how you got that impression?
>  The "cloud" in SolrCloud refers to Solr itself being able to run on
> multiple machines for larger datasets, although I think other people are
> sometimes confused about what the "cloud" really means.
>
> > but in my case the data are all in
> > one place that is under my control.
>
> Solr can run on one machine, that's fine.
>
> >
> > The quick tutorial for Lucene Core shows how I may create a Lucene
> database
> > and query it on my system through the command line. Could someone please
> > recommend a tutorial about creating a web interface for the prospective
> > world-wide users of this database?
>
> You really want Solr for this.
>
> You can customize the Solr interface with the Velocity templates.  Here's
> an article that discusses several options:
>
> http://searchhub.org/2010/01/14/solr-search-user-interface-examples/
>
> Welcome on board!
>
>
>

Re: Getting started, wish to present a bibliographical database to the web

Posted by Mark Bennett <ma...@lucidworks.com>.
Hello Bastiaan,

On Aug 12, 2013, at 4:24 AM, Bastiaan Braams <bj...@gmail.com> wrote:

> Greetings. I am a newcomer looking for advice about getting started with
> Lucene Core and/or Solr in order to present to the world a searchable
> bibliographical database.

Excellent.

> I have the database in my filespace in a plain text format; let us say as a
> BibTeX file. So the data is quite well structured, with fields such as
> Author, Title, Journal and Year, but also some less structured fields:
> Abstract, Notes, Keywords. I don't have the article full texts.

The trick will be to get this data into one of the formats that Solr can digest (XML, JSON or CSV), or write a Java client that uses SolrJ that reads the file and submits it.

> There are about 100 000 entries in the database; the total size is less
> than 1 GB.

That's fine, that's a reasonable amount of data.

> I have access to a server that already provides web pages to the world. Now
> I want to provide these bibliographical data to the world, with some search
> functionality for the visitors.

Good.

> Would Lucene Core be a good building block for this? Would I have any use
> for Lucene Solr?

I would strongly suggest Solr over Lucene.

> I have the impression that I should consider Solr only if
> the data were distributed over the web, ...

This is not correct, although I'm curious how you got that impression?  The "cloud" in SolrCloud refers to Solr itself being able to run on multiple machines for larger datasets, although I think other people are sometimes confused about what the "cloud" really means.

> but in my case the data are all in
> one place that is under my control.

Solr can run on one machine, that's fine.

> 
> The quick tutorial for Lucene Core shows how I may create a Lucene database
> and query it on my system through the command line. Could someone please
> recommend a tutorial about creating a web interface for the prospective
> world-wide users of this database?

You really want Solr for this.

You can customize the Solr interface with the Velocity templates.  Here's an article that discusses several options:

http://searchhub.org/2010/01/14/solr-search-user-interface-examples/

Welcome on board!