You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Daniel Cortes <dc...@fib.upc.edu> on 2004/12/21 16:39:55 UTC

Lucene working with a DB

I read a lot of messages that Lucene can index a DB because it use that 
INPUTSTREAM "type"
I don't understand how to do this. For example if I've a forum with 
Mysql  and a lot of files on my web, for every search I've to select the 
index that I want use in my search, true? But I don't know how to do 
that Lucene writes an index about the information of the DB of forum 
(for example  MySQL)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Lucene working with a DB

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Dec 21, 2004, at 10:39 AM, Daniel Cortes wrote:
> I read a lot of messages that Lucene can index a DB because it use 
> that INPUTSTREAM "type"

Where have you read that?  This is incorrect.

> I don't understand how to do this. For example if I've a forum with 
> Mysql  and a lot of files on my web, for every search I've to select 
> the index that I want use in my search, true? But I don't know how to 
> do that Lucene writes an index about the information of the DB of 
> forum (for example  MySQL)

To index data in a database into a Lucene index, you must write code 
that pulls the records from the database and adds them to a Lucene 
index, slicing into fields in whatever manner you need.  You will want 
to be sure to update the index when your database changes by either 
removing, or "updating" (remove and re-add) documents.  There is 
nothing built-in that will do these steps for you.

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: (Offtopic) The unicode name for a character

Posted by Chris Hostetter <ho...@fucit.org>.

: However, I don't think that the names are consistent enough to permit a
: generic use of regular expressions. What Daniel is trying to achieve
: looks interesting anyway,

I'm not sure that that really matters in the long run ... I think the OP
was asking if there was a way to get the name in java because he figured
that way he could programaticly determine what the "base" character was in
his application.  But, that doesn't mean he needs to do this
progromatically every time his indexing/searching code sees a character
outside of LATIN-1

it would probably make more sense to write a little one off program that
could read in this file, and then spit out all of the non latin-1
characters with a guess as to which latin-1 character could act as a
substitution (if any) based on the name of the chracter, and a blank for
the user to override.  This program could be run once to generate a nice
small, efficient mapping table that could be (commited to cvs and) reused
over and over.

-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: (Offtopic) The unicode name for a character

Posted by Pierrick Brihaye <pi...@culture.gouv.fr>.

Hi,

Morus Walter a écrit :

> If you cannot find that list somewhere I can mail you a copy.

ICU4J's one is here :

http://oss.software.ibm.com/cvs/icu4j/icu4j/src/com/ibm/icu/dev/data/unicode/UnicodeData.txt?rev=1.7&content-type=text/x-cvsweb-markup

See also Unicode's one:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

http://pistos.pe.kr/javadocs/etc/icu4j2_4/doc/com/ibm/icu/lang/UCharacter.html#getName(int) 
should also help you.

However, I don't think that the names are consistent enough to permit a 
generic use of regular expressions. What Daniel is trying to achieve 
looks interesting anyway,

Good luck,

-- 
Pierrick Brihaye, informaticien
Service régional de l'Inventaire
DRAC Bretagne
mailto:pierrick.brihaye@culture.gouv.fr
+33 (0)2 99 29 67 78

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: (Offtopic) The unicode name for a character

Posted by Morus Walter <mo...@tanto.de>.

Hi Peter,
> 
> The Question:
> In Java generally, Is there an easy way to get the unicode name of a 
> character?  (e.g. "LATIN SMALL LETTER A" from 'a')
> 
...
> 
> I'm considering taking the unicode name for each character I encounter 
> and regexping it against something like:
> ^LATIN .* LETTER (.) WITH .*$
> ... to try and extract the single A-Z|a-z character.
> 
There used to be a list (ASCII) on some ftp server at unicode.org.
I have a version 'UnicodeData.txt' here.
It lists ~ 12000 characters in the form
01A4;LATIN CAPITAL LETTER P WITH HOOK;Lu;0;L;;;;;N;LATIN CAPITAL LETTER P HOOK;;;01A5;
01A5;LATIN SMALL LETTER P WITH HOOK;Ll;0;L;;;;;N;LATIN SMALL LETTER P HOOK;;01A4;;01A4

If you cannot find that list somewhere I can mail you a copy.

It would be a nice contribution if you could add your filter to lucenes
sandbox, once it's finished.

Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: (Offtopic) The unicode name for a character

Posted by Otis Gospodnetic <ot...@yahoo.com>.

If you are not tied to Java, see 'unac' at http://www.senga.org/.
It's old, but if nothing else you could see how it works and rewrite it
in Java.  And if you can, you can donate it to Lucene Sandbox.

Otis

--- Peter Pimley <pp...@semantico.com> wrote:

> 
> Hi everyone,
> 
> The Question:
> In Java generally, Is there an easy way to get the unicode name of a 
> character?  (e.g. "LATIN SMALL LETTER A" from 'a')
> 
> 
> The Reasoning (for those who are interested):
> The documents I'm indexing have quite a lot of characters that are 
> basically variations on the basic A-Z ones.  In my analysis step, I'd
> 
> like to convert these to their closest equivalent in the basic A-Z
> set.
> 
> For some letters, this is easy.  An example is the e-acute character 
> (00E9 LATIN SMALL LETTER E WITH ACUTE).  I'd like to turn that into 
> plain 'e'.  I can do that by using the IBM ICU4J tools to decompose
> the 
> single character into two; 'e' and 0301 COMBINING ACUTE ACCENT.  Then
> I 
> can strip all characters that fail Character.isLetterOrDigit.  That 
> works fine.
> 
> Some characters however do not decompose.  An example is the
> character 
> 01A4 LATIN CAPITAL LETTER P WITH HOOK.  I'd like to replace that with
> 
> 'P', but it does not decompose into P + something.
> 
> I'm considering taking the unicode name for each character I
> encounter 
> and regexping it against something like:
> ^LATIN .* LETTER (.) WITH .*$
> ... to try and extract the single A-Z|a-z character.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

(Offtopic) The unicode name for a character

Posted by Peter Pimley <pp...@semantico.com>.

Hi everyone,

The Question:
In Java generally, Is there an easy way to get the unicode name of a 
character?  (e.g. "LATIN SMALL LETTER A" from 'a')


The Reasoning (for those who are interested):
The documents I'm indexing have quite a lot of characters that are 
basically variations on the basic A-Z ones.  In my analysis step, I'd 
like to convert these to their closest equivalent in the basic A-Z set.

For some letters, this is easy.  An example is the e-acute character 
(00E9 LATIN SMALL LETTER E WITH ACUTE).  I'd like to turn that into 
plain 'e'.  I can do that by using the IBM ICU4J tools to decompose the 
single character into two; 'e' and 0301 COMBINING ACUTE ACCENT.  Then I 
can strip all characters that fail Character.isLetterOrDigit.  That 
works fine.

Some characters however do not decompose.  An example is the character 
01A4 LATIN CAPITAL LETTER P WITH HOOK.  I'd like to replace that with 
'P', but it does not decompose into P + something.

I'm considering taking the unicode name for each character I encounter 
and regexping it against something like:
^LATIN .* LETTER (.) WITH .*$
... to try and extract the single A-Z|a-z character.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Lucene working with a DB

Posted by "amigo@max3d.com" <am...@max3d.com>.

Hello

I'll just paste the relevant MySQL code, you add the calls to it per 
your needs..it has no checking of anything so better add that as well...
It's possible I didnt copy/paste everything but you should get the idea 
where this is going...

-pedja


------------------------------

....
import java.sql.*;
import lucene stuff...
....

public class  sqlTest {

  public static void main(String[] args) throws Exception {

    String sTable  = args[0];
    String sThing = args[1];
    String indexDir = "/path/to/lucene/index";

  try {
    Analyzer analyzer       = new StandardAnalyzer();
    IndexWriter fsWriter  = new IndexWriter(indexDir, analyzer, false);
    addSQLDoc(fsWriter, sTable, sThing);
    fsWriter.close();
  } catch (Exception e) {
        throw new Exception(" caught a " + e.getClass() + "\n with 
message: " + e.getMessage());
  }
 }

 private void addSQLDoc(IndexWriter writer, String sqlTable, String 
somethingElse) throws Exception {

    String cs             = 
"jdbc:mysql://HOST/DATABASE?user=SQLUSER&password=SQLPASSWORD";
    String sql            = "SELECT * FROM " + sqlTable + " WHERE 
something=\"" + somethingElse + "\"";

    // establish a connection to MySQL database
    try {
        Class.forName("com.mysql.jdbc.Driver").newInstance();
    } catch (Exception e) {
        System.out.println("Lucene: ERROR: Unable to load driver");
        e.printStackTrace();
    }

    // get the record data...
    try {

       Connection conn = DriverManager.getConnection(cs);
       Statement Stmt = conn.createStatement();
       ResultSet RS = Stmt.executeQuery(sql);

       while(RS.next()) {
          // make a new, empty document
          Document doc = new Document();

          // get the database fields
          String field2 = RS.getString(1);
          String field2 = RS.getString(2);
          String field3 = RS.getString(3);
          String field4 = RS.getString(4);
          String field5 = RS.getString(5);

          // add the first group of fields
          //
          doc.add(Field.Keyword("FIELD1", field1));
          doc.add(Field.Keyword("FIELD2", field2));
          doc.add(Field.Keyword("FIELD3", field3));
          doc.add(Field.Keyword("FIELD4", field4));
          doc.add(Field.Text("FIELD5", field5));

          // add the document
          writer.addDocument(doc);

        } catch (Exception e) {
                e.printStackTrace();
                throw new Exception();
        }

       } // close while(..)

       RS.close();
       Stmt.close();
       conn.close();

    } catch(SQLException e) {
        throw new Exception();
    }
  }
}

--------------------------------------------------------------


Daniel Cortes said the following on 12/21/2004 10:39 AM:

> I read a lot of messages that Lucene can index a DB because it use 
> that INPUTSTREAM "type"
> I don't understand how to do this. For example if I've a forum with 
> Mysql  and a lot of files on my web, for every search I've to select 
> the index that I want use in my search, true? But I don't know how to 
> do that Lucene writes an index about the information of the DB of 
> forum (for example  MySQL)
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org