You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Sandeep B A <be...@gmail.com> on 2014/09/05 16:18:26 UTC

Is there any sentence tokenizers in sold 4.9.0?

Hi,

I was looking out the options for sentence tokenizers default in solr but
could not find it. Does any one used? Integrated from any other language
tokenizers to solr. Example python etc.. Please let me know.


Thanks and regards,
Sandeep

RE: Is there any sentence tokenizers in sold 4.9.0?

Posted by Susheel Kumar <su...@thedigitalgroup.net>.
There is SmartChineseSentenceTokenizerFactory or SentenceTokenizer  which is getting being deprecated & replaced with HMMChineseTokenizer.  Not aware of other tokenizer but you may to either build your own similar to SentenceTokenizer or employ any external Sentence detection/recognizer & built Solr tokenizer on top of it.

Don't know how complex your use case is but I would suggest to look SentenceTokenizer and create similar tokenizer.

Thanks,
Susheel

-----Original Message-----
From: Sandeep B A [mailto:belgavi.sandeep@gmail.com]
Sent: Friday, September 05, 2014 10:40 AM
To: solr-user@lucene.apache.org
Subject: Re: Is there any sentence tokenizers in sold 4.9.0?

Sorry for typo it is solr 4.9.0 instead of sold 4.9.0  On Sep 5, 2014 7:48 PM, "Sandeep B A" <be...@gmail.com> wrote:

> Hi,
>
> I was looking out the options for sentence tokenizers default in solr
> but could not find it. Does any one used? Integrated from any other
> language tokenizers to solr. Example python etc.. Please let me know.
>
>
> Thanks and regards,
> Sandeep
>
This e-mail message may contain confidential or legally privileged information and is intended only for the use of the intended recipient(s). Any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is prohibited. E-mails are not secure and cannot be guaranteed to be error free as they can be intercepted, amended, or contain viruses. Anyone who communicates with us by e-mail is deemed to have accepted these risks. The Digital Group is not responsible for errors or omissions in this message and denies any responsibility for any damage arising from the use of e-mail. Any opinion defamatory or deemed to be defamatory or  any material which could be reasonably branded to be a species of plagiarism and other statements contained in this message and any attachment are solely those of the author and do not necessarily represent those of the company.

RE: Is there any sentence tokenizers in sold 4.9.0?

Posted by Susheel Kumar <su...@thedigitalgroup.net>.
Just as an FYI, You may want to try Sentence Detection Tokenizer added as OpenNLP capabilities to Solr 4.9

https://issues.apache.org/jira/browse/LUCENE-2899

-----Original Message-----
From: Susheel Kumar [mailto:susheel.kumar@thedigitalgroup.net]
Sent: Monday, September 08, 2014 8:29 PM
To: solr-user@lucene.apache.org
Subject: RE: Is there any sentence tokenizers in sold 4.9.0?

Sandeep,

As Jack mentioned it will be useful to know the use case/what kind of query you will be executing as you may also need to handle on query side not just on indexing side.  For integrating with nltk there could be different options like calling ntlk as out of proc or use jythonc to generate java classes.

Thnx

-----Original Message-----
From: Jack Krupansky [mailto:jack@basetechnology.com]
Sent: Monday, September 08, 2014 7:52 AM
To: solr-user@lucene.apache.org
Subject: Re: Is there any sentence tokenizers in sold 4.9.0?

Out of curiosity, what would be an example query for your application that would depend on sentence tokenization, as opposed to simple term tokenization? I mean, there are no sentence-based query operators in the Solr query parsers.

-- Jack Krupansky

-----Original Message-----
From: Sandeep B A
Sent: Monday, September 8, 2014 12:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Is there any sentence tokenizers in sold 4.9.0?

Hi Susheel ,
Thanks for the information.
I have crawled few website and all I need is for sentence tokenizers on the data I have collected.
These websites are English only.

Well I don't have experience in writing custom sentence tokenizers for solr. Is there any tutorial link which tell how to do it?

Is it possible to integrate nltk for solr? If yes how to do it? Because I found sentence tokenizers for English in nltk.

Thanks,
Sandeep
On Sep 5, 2014 8:10 PM, "Sandeep B A" <be...@gmail.com> wrote:

> Sorry for typo it is solr 4.9.0 instead of sold 4.9.0  On Sep 5, 2014
> 7:48 PM, "Sandeep B A" <be...@gmail.com> wrote:
>
>> Hi,
>>
>> I was looking out the options for sentence tokenizers default in solr
>> but could not find it. Does any one used? Integrated from any other
>> language tokenizers to solr. Example python etc.. Please let me know.
>>
>>
>> Thanks and regards,
>> Sandeep
>>
>

This e-mail message may contain confidential or legally privileged information and is intended only for the use of the intended recipient(s). Any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is prohibited. E-mails are not secure and cannot be guaranteed to be error free as they can be intercepted, amended, or contain viruses. Anyone who communicates with us by e-mail is deemed to have accepted these risks. The Digital Group is not responsible for errors or omissions in this message and denies any responsibility for any damage arising from the use of e-mail. Any opinion defamatory or deemed to be defamatory or  any material which could be reasonably branded to be a species of plagiarism and other statements contained in this message and any attachment are solely those of the author and do not necessarily represent those of the company.
This e-mail message may contain confidential or legally privileged information and is intended only for the use of the intended recipient(s). Any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is prohibited. E-mails are not secure and cannot be guaranteed to be error free as they can be intercepted, amended, or contain viruses. Anyone who communicates with us by e-mail is deemed to have accepted these risks. The Digital Group is not responsible for errors or omissions in this message and denies any responsibility for any damage arising from the use of e-mail. Any opinion defamatory or deemed to be defamatory or  any material which could be reasonably branded to be a species of plagiarism and other statements contained in this message and any attachment are solely those of the author and do not necessarily represent those of the company.

RE: Is there any sentence tokenizers in sold 4.9.0?

Posted by Susheel Kumar <su...@thedigitalgroup.net>.
Sandeep,

As Jack mentioned it will be useful to know the use case/what kind of query you will be executing as you may also need to handle on query side not just on indexing side.  For integrating with nltk there could be different options like calling ntlk as out of proc or use jythonc to generate java classes.

Thnx

-----Original Message-----
From: Jack Krupansky [mailto:jack@basetechnology.com]
Sent: Monday, September 08, 2014 7:52 AM
To: solr-user@lucene.apache.org
Subject: Re: Is there any sentence tokenizers in sold 4.9.0?

Out of curiosity, what would be an example query for your application that would depend on sentence tokenization, as opposed to simple term tokenization? I mean, there are no sentence-based query operators in the Solr query parsers.

-- Jack Krupansky

-----Original Message-----
From: Sandeep B A
Sent: Monday, September 8, 2014 12:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Is there any sentence tokenizers in sold 4.9.0?

Hi Susheel ,
Thanks for the information.
I have crawled few website and all I need is for sentence tokenizers on the data I have collected.
These websites are English only.

Well I don't have experience in writing custom sentence tokenizers for solr. Is there any tutorial link which tell how to do it?

Is it possible to integrate nltk for solr? If yes how to do it? Because I found sentence tokenizers for English in nltk.

Thanks,
Sandeep
On Sep 5, 2014 8:10 PM, "Sandeep B A" <be...@gmail.com> wrote:

> Sorry for typo it is solr 4.9.0 instead of sold 4.9.0  On Sep 5, 2014
> 7:48 PM, "Sandeep B A" <be...@gmail.com> wrote:
>
>> Hi,
>>
>> I was looking out the options for sentence tokenizers default in solr
>> but could not find it. Does any one used? Integrated from any other
>> language tokenizers to solr. Example python etc.. Please let me know.
>>
>>
>> Thanks and regards,
>> Sandeep
>>
>

This e-mail message may contain confidential or legally privileged information and is intended only for the use of the intended recipient(s). Any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is prohibited. E-mails are not secure and cannot be guaranteed to be error free as they can be intercepted, amended, or contain viruses. Anyone who communicates with us by e-mail is deemed to have accepted these risks. The Digital Group is not responsible for errors or omissions in this message and denies any responsibility for any damage arising from the use of e-mail. Any opinion defamatory or deemed to be defamatory or  any material which could be reasonably branded to be a species of plagiarism and other statements contained in this message and any attachment are solely those of the author and do not necessarily represent those of the company.

Re: Is there any sentence tokenizers in sold 4.9.0?

Posted by Jack Krupansky <ja...@basetechnology.com>.
Out of curiosity, what would be an example query for your application that 
would depend on sentence tokenization, as opposed to simple term 
tokenization? I mean, there are no sentence-based query operators in the 
Solr query parsers.

-- Jack Krupansky

-----Original Message----- 
From: Sandeep B A
Sent: Monday, September 8, 2014 12:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Is there any sentence tokenizers in sold 4.9.0?

Hi Susheel ,
Thanks for the information.
I have crawled few website and all I need is for sentence tokenizers on the
data I have collected.
These websites are English only.

Well I don't have experience in writing custom sentence tokenizers for
solr. Is there any tutorial link which tell how to do it?

Is it possible to integrate nltk for solr? If yes how to do it? Because I
found sentence tokenizers for English in nltk.

Thanks,
Sandeep
On Sep 5, 2014 8:10 PM, "Sandeep B A" <be...@gmail.com> wrote:

> Sorry for typo it is solr 4.9.0 instead of sold 4.9.0
>  On Sep 5, 2014 7:48 PM, "Sandeep B A" <be...@gmail.com> wrote:
>
>> Hi,
>>
>> I was looking out the options for sentence tokenizers default in solr but
>> could not find it. Does any one used? Integrated from any other language
>> tokenizers to solr. Example python etc.. Please let me know.
>>
>>
>> Thanks and regards,
>> Sandeep
>>
> 


Re: Is there any sentence tokenizers in sold 4.9.0?

Posted by Benson Margulies <bi...@gmail.com>.
Basis Technology's toolset includes sentence boundary detectors. Please
contact me for more details.

On Fri, Sep 12, 2014 at 1:15 AM, Sandeep B A <be...@gmail.com>
wrote:

> Hi All,
> Sorry for the delayed response.
> I was out of office for last few days and was not able to reply.
> Thanks for the information.
>
> We have a use case were one sentence is the unit token with which we need
> to do normalization and semantic analyzer.
>
> We need to finalize on the type of normalizer and analyzer but was trying
> to view if solr has any inbuilt libraries, so that no cross language
> integration might be required.
>
> Again Wil get back if something works or not works.
>
> @susheel,
> Thanks will try to see if that works.
>
> Thanks,
> Sandeep.
> On Sep 8, 2014 12:54 PM, "Sandeep B A" <be...@gmail.com> wrote:
>
> > Hi Susheel ,
> > Thanks for the information.
> > I have crawled few website and all I need is for sentence tokenizers on
> > the data I have collected.
> > These websites are English only.
> >
> > Well I don't have experience in writing custom sentence tokenizers for
> > solr. Is there any tutorial link which tell how to do it?
> >
> > Is it possible to integrate nltk for solr? If yes how to do it? Because I
> > found sentence tokenizers for English in nltk.
> >
> > Thanks,
> > Sandeep
> > On Sep 5, 2014 8:10 PM, "Sandeep B A" <be...@gmail.com> wrote:
> >
> >> Sorry for typo it is solr 4.9.0 instead of sold 4.9.0
> >>  On Sep 5, 2014 7:48 PM, "Sandeep B A" <be...@gmail.com>
> wrote:
> >>
> >>> Hi,
> >>>
> >>> I was looking out the options for sentence tokenizers default in solr
> >>> but could not find it. Does any one used? Integrated from any other
> >>> language tokenizers to solr. Example python etc.. Please let me know.
> >>>
> >>>
> >>> Thanks and regards,
> >>> Sandeep
> >>>
> >>
>

Re: Is there any sentence tokenizers in sold 4.9.0?

Posted by Aman Tandon <am...@gmail.com>.
Hi,

Is there any semantic analyzer in solr?
On Sep 12, 2014 10:51 AM, "Sandeep B A" <be...@gmail.com> wrote:

> Hi All,
> Sorry for the delayed response.
> I was out of office for last few days and was not able to reply.
> Thanks for the information.
>
> We have a use case were one sentence is the unit token with which we need
> to do normalization and semantic analyzer.
>
> We need to finalize on the type of normalizer and analyzer but was trying
> to view if solr has any inbuilt libraries, so that no cross language
> integration might be required.
>
> Again Wil get back if something works or not works.
>
> @susheel,
> Thanks will try to see if that works.
>
> Thanks,
> Sandeep.
> On Sep 8, 2014 12:54 PM, "Sandeep B A" <be...@gmail.com> wrote:
>
> > Hi Susheel ,
> > Thanks for the information.
> > I have crawled few website and all I need is for sentence tokenizers on
> > the data I have collected.
> > These websites are English only.
> >
> > Well I don't have experience in writing custom sentence tokenizers for
> > solr. Is there any tutorial link which tell how to do it?
> >
> > Is it possible to integrate nltk for solr? If yes how to do it? Because I
> > found sentence tokenizers for English in nltk.
> >
> > Thanks,
> > Sandeep
> > On Sep 5, 2014 8:10 PM, "Sandeep B A" <be...@gmail.com> wrote:
> >
> >> Sorry for typo it is solr 4.9.0 instead of sold 4.9.0
> >>  On Sep 5, 2014 7:48 PM, "Sandeep B A" <be...@gmail.com>
> wrote:
> >>
> >>> Hi,
> >>>
> >>> I was looking out the options for sentence tokenizers default in solr
> >>> but could not find it. Does any one used? Integrated from any other
> >>> language tokenizers to solr. Example python etc.. Please let me know.
> >>>
> >>>
> >>> Thanks and regards,
> >>> Sandeep
> >>>
> >>
>

Re: Is there any sentence tokenizers in sold 4.9.0?

Posted by Sandeep B A <be...@gmail.com>.
Hi All,
Sorry for the delayed response.
I was out of office for last few days and was not able to reply.
Thanks for the information.

We have a use case were one sentence is the unit token with which we need
to do normalization and semantic analyzer.

We need to finalize on the type of normalizer and analyzer but was trying
to view if solr has any inbuilt libraries, so that no cross language
integration might be required.

Again Wil get back if something works or not works.

@susheel,
Thanks will try to see if that works.

Thanks,
Sandeep.
On Sep 8, 2014 12:54 PM, "Sandeep B A" <be...@gmail.com> wrote:

> Hi Susheel ,
> Thanks for the information.
> I have crawled few website and all I need is for sentence tokenizers on
> the data I have collected.
> These websites are English only.
>
> Well I don't have experience in writing custom sentence tokenizers for
> solr. Is there any tutorial link which tell how to do it?
>
> Is it possible to integrate nltk for solr? If yes how to do it? Because I
> found sentence tokenizers for English in nltk.
>
> Thanks,
> Sandeep
> On Sep 5, 2014 8:10 PM, "Sandeep B A" <be...@gmail.com> wrote:
>
>> Sorry for typo it is solr 4.9.0 instead of sold 4.9.0
>>  On Sep 5, 2014 7:48 PM, "Sandeep B A" <be...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I was looking out the options for sentence tokenizers default in solr
>>> but could not find it. Does any one used? Integrated from any other
>>> language tokenizers to solr. Example python etc.. Please let me know.
>>>
>>>
>>> Thanks and regards,
>>> Sandeep
>>>
>>

Re: Is there any sentence tokenizers in sold 4.9.0?

Posted by Sandeep B A <be...@gmail.com>.
Hi Susheel ,
Thanks for the information.
I have crawled few website and all I need is for sentence tokenizers on the
data I have collected.
These websites are English only.

Well I don't have experience in writing custom sentence tokenizers for
solr. Is there any tutorial link which tell how to do it?

Is it possible to integrate nltk for solr? If yes how to do it? Because I
found sentence tokenizers for English in nltk.

Thanks,
Sandeep
On Sep 5, 2014 8:10 PM, "Sandeep B A" <be...@gmail.com> wrote:

> Sorry for typo it is solr 4.9.0 instead of sold 4.9.0
>  On Sep 5, 2014 7:48 PM, "Sandeep B A" <be...@gmail.com> wrote:
>
>> Hi,
>>
>> I was looking out the options for sentence tokenizers default in solr but
>> could not find it. Does any one used? Integrated from any other language
>> tokenizers to solr. Example python etc.. Please let me know.
>>
>>
>> Thanks and regards,
>> Sandeep
>>
>

Re: Is there any sentence tokenizers in sold 4.9.0?

Posted by Sandeep B A <be...@gmail.com>.
Sorry for typo it is solr 4.9.0 instead of sold 4.9.0
 On Sep 5, 2014 7:48 PM, "Sandeep B A" <be...@gmail.com> wrote:

> Hi,
>
> I was looking out the options for sentence tokenizers default in solr but
> could not find it. Does any one used? Integrated from any other language
> tokenizers to solr. Example python etc.. Please let me know.
>
>
> Thanks and regards,
> Sandeep
>