You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Simon Willnauer <si...@googlemail.com> on 2006/07/19 23:52:22 UTC

GData - Server, Indexing entries

Hello everyone,

well the last mailing about distributed indexing / searching did not
receive many answers, maybe that's why the topic is very tough. Anyway
I try to kick of the indexing / searching milestone with another
mailing.
The Gdata server has to index all incoming entries on inserts or
updates and mark already indexed entries as deleted on delete
requests. So the format of incoming data will be XML in the first
place. How and which XML elements are supposed to be indexed will be
defined in the server configuration. I guess it would be quiet handy
to configure which elements to index using xpath expressions. That's
fairly generic and the most developers and admins are more or less
familiar with xpath. Analyzer etc. will also come from the
configuration file.
The next step is to retrieve the data from within the elements.
Elements have three types of content relevant for indexing plain text,
html, xhtml (binary content might be tough to index :)
I have to remove the tags from the Html and XHtml content I'm aware of
that there are several api's around doing that but it might be quite
helpful to have some recommendations.

GData defines a kind of a query "language" to query the a specific
feed via get parameters and / or defined endings of the query string.
(http://code.google.com/apis/gdata/protocol.html#Queries)
I do have some experience with building parsers (not javacc but yacc /
gentle) so I try to parse the so called "Gdata Query" to translate it
into a lucene query string. Using javaCC I can create a quite fast and
nice way to create lucene queries from incoming "Gdata Queries".

I do have lots of ideas to extend the search capabilities described in
the gdata protocol but I guess I will skip that after SoC has
finished.

I just wanna ask you guys to let me know if you have some ideas about all that.
Every comment will be highly appreciated!!!!

regards Simon

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: GData - Server, Indexing entries

Posted by Simon Willnauer <si...@googlemail.com>.
Hi Otis,

On 7/20/06, Otis Gospodnetic <ot...@yahoo.com> wrote:
> Hi Simon,
>
> I have to admit that I'm not sure what exactly you are asking here. :)
Well that is what I do ask myself sometimes. The Gdata Server is
unfortunately a bit different to the rest of the lucene work. I just
can discuss what I want to do and how I wanna do it. So this does
sometimes not look like a question.  I don't really have any question
about the indexing component at the moment but I would love to get
some ideas from you guys if there are any. I will just let you know
what I'm doing, not in detail but a overview. If you are interested in
it you (I don't particular  mean otis) can join the discussion. But
after a while I experienced that it is quiet hard to start up those
discussions. I think this is not because of the lucene developers are
not interested it's rather due to the fact that the first part of the
project is much more about building a server itself. Now there will be
much more lucene stuff to do but this will not be very exciting just
"ordinary" indexing of xml / html and text data.
It might get more interesting after summer of code has finished. There
will be lots of features to add. Similarity queries, distributed
searching / indexing and so on.

> But I do have a comment/question about this:
>
> > The next step is to retrieve the data from within the elements.
> > Elements have three types of content relevant for indexing plain text,
> > html, xhtml (binary content might be tough to index :)
>
> I looked at http://code.google.com/apis/gdata/protocol.html#Inserting-a-new-entry
> Examples in that section contain elements like these:
>   <title type="text">Entry 1</title>
>   <content type="text">This is my entry</content>Is that type="text" what you are referring to above, and are you saying this could also be type="html" and type="xhtml" and the actual content of between those container tags could be (X)HTML?  Is that described somewhere in the protocol as supported?
> What if you wanted some other format, say some XML?

Yes sure, that will be possible but indexing will be the same as XHTML
wouldn't it?!
If you want to index the elements as separate fields you can already
use the xpath configuration and the data will be plain text again.

I would like to have some suggestions which API I should use to
extract the html. Does the JavaCC one in the demo the job, well I
check it you if it's enough for my purpose. that one is apache
licenced already ;)

regards simon

>
> Just curious.
>
> Thanks,
> Otis
>
>
> ----- Original Message ----
> From: Simon Willnauer <si...@googlemail.com>
> To: java-dev@lucene.apache.org
> Sent: Wednesday, July 19, 2006 5:52:22 PM
> Subject: GData - Server, Indexing entries
>
> Hello everyone,
>
> well the last mailing about distributed indexing / searching did not
> receive many answers, maybe that's why the topic is very tough. Anyway
> I try to kick of the indexing / searching milestone with another
> mailing.
> The Gdata server has to index all incoming entries on inserts or
> updates and mark already indexed entries as deleted on delete
> requests. So the format of incoming data will be XML in the first
> place. How and which XML elements are supposed to be indexed will be
> defined in the server configuration. I guess it would be quiet handy
> to configure which elements to index using xpath expressions. That's
> fairly generic and the most developers and admins are more or less
> familiar with xpath. Analyzer etc. will also come from the
> configuration file.
> The next step is to retrieve the data from within the elements.
> Elements have three types of content relevant for indexing plain text,
> html, xhtml (binary content might be tough to index :)
> I have to remove the tags from the Html and XHtml content I'm aware of
> that there are several api's around doing that but it might be quite
> helpful to have some recommendations.
>
> GData defines a kind of a query "language" to query the a specific
> feed via get parameters and / or defined endings of the query string.
> (http://code.google.com/apis/gdata/protocol.html#Queries)
> I do have some experience with building parsers (not javacc but yacc /
> gentle) so I try to parse the so called "Gdata Query" to translate it
> into a lucene query string. Using javaCC I can create a quite fast and
> nice way to create lucene queries from incoming "Gdata Queries".
>
> I do have lots of ideas to extend the search capabilities described in
> the gdata protocol but I guess I will skip that after SoC has
> finished.
>
> I just wanna ask you guys to let me know if you have some ideas about all that.
> Every comment will be highly appreciated!!!!
>
> regards Simon
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: GData - Server, Indexing entries

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi Simon,

I have to admit that I'm not sure what exactly you are asking here. :)
But I do have a comment/question about this:

> The next step is to retrieve the data from within the elements.
> Elements have three types of content relevant for indexing plain text,
> html, xhtml (binary content might be tough to index :)

I looked at http://code.google.com/apis/gdata/protocol.html#Inserting-a-new-entry
Examples in that section contain elements like these:
  <title type="text">Entry 1</title>
  <content type="text">This is my entry</content>Is that type="text" what you are referring to above, and are you saying this could also be type="html" and type="xhtml" and the actual content of between those container tags could be (X)HTML?  Is that described somewhere in the protocol as supported?
What if you wanted some other format, say some XML?

Just curious.

Thanks,
Otis


----- Original Message ----
From: Simon Willnauer <si...@googlemail.com>
To: java-dev@lucene.apache.org
Sent: Wednesday, July 19, 2006 5:52:22 PM
Subject: GData - Server, Indexing entries

Hello everyone,

well the last mailing about distributed indexing / searching did not
receive many answers, maybe that's why the topic is very tough. Anyway
I try to kick of the indexing / searching milestone with another
mailing.
The Gdata server has to index all incoming entries on inserts or
updates and mark already indexed entries as deleted on delete
requests. So the format of incoming data will be XML in the first
place. How and which XML elements are supposed to be indexed will be
defined in the server configuration. I guess it would be quiet handy
to configure which elements to index using xpath expressions. That's
fairly generic and the most developers and admins are more or less
familiar with xpath. Analyzer etc. will also come from the
configuration file.
The next step is to retrieve the data from within the elements.
Elements have three types of content relevant for indexing plain text,
html, xhtml (binary content might be tough to index :)
I have to remove the tags from the Html and XHtml content I'm aware of
that there are several api's around doing that but it might be quite
helpful to have some recommendations.

GData defines a kind of a query "language" to query the a specific
feed via get parameters and / or defined endings of the query string.
(http://code.google.com/apis/gdata/protocol.html#Queries)
I do have some experience with building parsers (not javacc but yacc /
gentle) so I try to parse the so called "Gdata Query" to translate it
into a lucene query string. Using javaCC I can create a quite fast and
nice way to create lucene queries from incoming "Gdata Queries".

I do have lots of ideas to extend the search capabilities described in
the gdata protocol but I guess I will skip that after SoC has
finished.

I just wanna ask you guys to let me know if you have some ideas about all that.
Every comment will be highly appreciated!!!!

regards Simon

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org