You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Ilya Zavorin <iz...@caci.com> on 2011/12/06 23:41:00 UTC

tokenizing text using language analyzer but preserving stopwords if possible

I need to implement a "quick and dirty" or "poor man's" translation of a foreign language document by looking up each word in a dictionary and replacing it with the English translation. So what I need is to tokenize the original foreign text into words and then access each word, look it up and get its translation. However, if possible, I also need to preserve "non-words", i.e. stopwords so that I could replicate them in the output stream without translating. If the latter is not possible then I just need to preserve the order of the original words so that their translations have the same order in the output.

Can I accomplish this using Lucene components? I presume I'd have to start by creating an analyzer for the foreign language, but then what? How do I (i) tokenize, (ii) access words in the correct order, (iii) also access non-words if possible?

Thanks much


Ilya Zavorin

Re: tokenizing text using language analyzer but preserving stopwords if possible

Posted by KARTHIK SHIVAKUMAR <ns...@gmail.com>.

Hi

>> tokenize the original foreign text into words

Need to Identify the Appropriate analyzer ( foreign language before
Indexing ...)


with regards
karthik


On Wed, Dec 7, 2011 at 4:57 PM, Avi Rosenschein <ar...@gmail.com>wrote:

> On Wed, Dec 7, 2011 at 00:41, Ilya Zavorin <iz...@caci.com> wrote:
>
> > I need to implement a "quick and dirty" or "poor man's" translation of a
> > foreign language document by looking up each word in a dictionary and
> > replacing it with the English translation. So what I need is to tokenize
> > the original foreign text into words and then access each word, look it
> up
> > and get its translation. However, if possible, I also need to preserve
> > "non-words", i.e. stopwords so that I could replicate them in the output
> > stream without translating. If the latter is not possible then I just
> need
> > to preserve the order of the original words so that their translations
> have
> > the same order in the output.
> >
> > Can I accomplish this using Lucene components? I presume I'd have to
> start
> > by creating an analyzer for the foreign language, but then what? How do I
> > (i) tokenize, (ii) access words in the correct order, (iii) also access
> > non-words if possible?
> >
>
> You can always use something like StandardAnalyzer for the specific
> language, with an empty stopword list (so that no words are treated as
> stopwords). A bit trickier might be dealing with punctuation - depending on
> the analyzer, you might be able to get these to parse as separate tokens.
>
> -- Avi
>
>
> >
> > Thanks much
> >
> >
> > Ilya Zavorin
> >
> >
> >
>



-- 
*N.S.KARTHIK
R.M.S.COLONY
BEHIND BANK OF INDIA
R.M.V 2ND STAGE
BANGALORE
560094*

Re: Improving Lucene Search Performance

Posted by Ian Lea <ia...@gmail.com>.

See http://wiki.apache.org/lucene-java/ImproveSearchingSpeed.  Some of
the tips relate to indexing but most to search time stuff.


--
Ian.


On Thu, Dec 8, 2011 at 10:45 AM, Dilshad K. P. <di...@nestgroup.net> wrote:
> Hi,
> Is there any thing to take care while creating index for improving lucene text search speed.
>
> Thanks And Regards
> Dilshad K.P
> ***** Confidentiality Statement/Disclaimer *****
>
> This message and any attachments is intended for the sole use of the intended recipient. It may contain confidential information. Any unauthorized use, dissemination or modification is strictly prohibited. If you are not the intended recipient, please notify the sender immediately then delete it from all your systems, and do not copy, use or print. Internet communications are not secure and it is the responsibility of the recipient to make sure that it is virus/malicious code exempt.
> The company/sender cannot be responsible for any unauthorized alterations or modifications made to the contents. If you require any form of confirmation of the contents, please contact the company/sender. The company/sender is not liable for any errors or omissions in the content of this message.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Improving Lucene Search Performance

Posted by Chris Hostetter <ho...@fucit.org>.

: Subject: Improving Lucene Search Performance
: In-Reply-To:
:     <CA...@mail.gmail.com>
: References:
:     <16...@ebi.ac.uk><CAFVhWXieRFqstbGPi+wM1zhZ
:     LL0SMr0uz8+7CUhsHPYdUWQpQA@mail.gmail.com><347A161B-6C7B-4DC3-ACD0-9A804E2
:     DD36C@ebi.ac.uk><CABYvkPR3_14cTaorH-hQ+uYMvvRBMQx5GWzuNAYmE+PYp=fLsg@mail.
:     gmail.com><00...@ebi.ac.uk><A57498EDEC10C64
:     781EA0F7DBA665CEF019DE1@ex2010mb01-2.caci.com>
:  <CA...@mail.gmail.com>

https://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.



-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Improving Lucene Search Performance

Posted by "Dilshad K. P." <di...@nestgroup.net>.

Hi,
Is there any thing to take care while creating index for improving lucene text search speed.

Thanks And Regards
Dilshad K.P
***** Confidentiality Statement/Disclaimer *****

This message and any attachments is intended for the sole use of the intended recipient. It may contain confidential information. Any unauthorized use, dissemination or modification is strictly prohibited. If you are not the intended recipient, please notify the sender immediately then delete it from all your systems, and do not copy, use or print. Internet communications are not secure and it is the responsibility of the recipient to make sure that it is virus/malicious code exempt.
The company/sender cannot be responsible for any unauthorized alterations or modifications made to the contents. If you require any form of confirmation of the contents, please contact the company/sender. The company/sender is not liable for any errors or omissions in the content of this message.

Re: tokenizing text using language analyzer but preserving stopwords if possible

Posted by Avi Rosenschein <ar...@gmail.com>.

On Wed, Dec 7, 2011 at 00:41, Ilya Zavorin <iz...@caci.com> wrote:

> I need to implement a "quick and dirty" or "poor man's" translation of a
> foreign language document by looking up each word in a dictionary and
> replacing it with the English translation. So what I need is to tokenize
> the original foreign text into words and then access each word, look it up
> and get its translation. However, if possible, I also need to preserve
> "non-words", i.e. stopwords so that I could replicate them in the output
> stream without translating. If the latter is not possible then I just need
> to preserve the order of the original words so that their translations have
> the same order in the output.
>
> Can I accomplish this using Lucene components? I presume I'd have to start
> by creating an analyzer for the foreign language, but then what? How do I
> (i) tokenize, (ii) access words in the correct order, (iii) also access
> non-words if possible?
>

You can always use something like StandardAnalyzer for the specific
language, with an empty stopword list (so that no words are treated as
stopwords). A bit trickier might be dealing with punctuation - depending on
the analyzer, you might be able to get these to parse as separate tokens.

-- Avi


>
> Thanks much
>
>
> Ilya Zavorin
>
>
>