You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Sahar Ebadi <sa...@yuxipacific.com> on 2013/04/30 15:43:07 UTC

valid sentence detector

Hi all,

lets say I have a text and I would like to detect only "good sentences". by
"good sentences" I mean sentences that are 1)complete( grammatically
2)have meaning 3)are in English language.

As far as I found Open NLP sentence detector only detects sentences
according to punctuation(and a list of acronyms it has), so there is
no guarantee that the sentences are real, complete and meaningful sentences.

Now my question is is there any process in NLP that can help me to :

1)find grammatically complete sentences?
2)find if a sentence has meaning or no?
3)filter non-english texts?

any suggestions or sharing useful resources is highly appreciated!

Thanks.

Re: valid sentence detector

Posted by Ryan Josal <rj...@gmail.com>.

Sahar,

  You could also try weeding out sentences that the sentence detector finds to have low probabilities.  For the language part, are there multiple languages in one chunk of text?  If not, you could use Tika or google LangDetect to detect the language.

Ryan

On Apr 30, 2013, at 6:58, William Colen <wi...@gmail.com> wrote:

> Hi, Sahar,
> 
> I don't know a stabilished approach that solves your problem, but there are
> a few things you could try. For example, you could check if the sentence is
> parseable. If a Parser can figure out a tree for the sentence, it might
> mean that its structure is known. I don't know if it would work with a
> statistical parser like the one in OpenNLP, but it works at least for rule
> based parsers, were you have fine-grained control over the structures.
> 
> Regards,
> William
> 
> On Tue, Apr 30, 2013 at 10:43 AM, Sahar Ebadi
> <sa...@yuxipacific.com>wrote:
> 
>> Hi all,
>> 
>> lets say I have a text and I would like to detect only "good sentences". by
>> "good sentences" I mean sentences that are 1)complete( grammatically
>> 2)have meaning 3)are in English language.
>> 
>> As far as I found Open NLP sentence detector only detects sentences
>> according to punctuation(and a list of acronyms it has), so there is
>> no guarantee that the sentences are real, complete and meaningful
>> sentences.
>> 
>> Now my question is is there any process in NLP that can help me to :
>> 
>> 1)find grammatically complete sentences?
>> 2)find if a sentence has meaning or no?
>> 3)filter non-english texts?
>> 
>> any suggestions or sharing useful resources is highly appreciated!
>> 
>> Thanks.
>>

Re: valid sentence detector

Posted by William Colen <wi...@gmail.com>.

Hi, Sahar,

I don't know a stabilished approach that solves your problem, but there are
a few things you could try. For example, you could check if the sentence is
parseable. If a Parser can figure out a tree for the sentence, it might
mean that its structure is known. I don't know if it would work with a
statistical parser like the one in OpenNLP, but it works at least for rule
based parsers, were you have fine-grained control over the structures.

Regards,
William

On Tue, Apr 30, 2013 at 10:43 AM, Sahar Ebadi
<sa...@yuxipacific.com>wrote:

> Hi all,
>
> lets say I have a text and I would like to detect only "good sentences". by
> "good sentences" I mean sentences that are 1)complete( grammatically
> 2)have meaning 3)are in English language.
>
> As far as I found Open NLP sentence detector only detects sentences
> according to punctuation(and a list of acronyms it has), so there is
> no guarantee that the sentences are real, complete and meaningful
> sentences.
>
> Now my question is is there any process in NLP that can help me to :
>
> 1)find grammatically complete sentences?
> 2)find if a sentence has meaning or no?
> 3)filter non-english texts?
>
> any suggestions or sharing useful resources is highly appreciated!
>
> Thanks.
>

Re: valid sentence detector

Posted by James Kosin <ja...@gmail.com>.

Hi Sahar,

Only problem with a rule based parser is that there are always 
exceptions to the rule.

Regards,
James

On 5/6/2013 8:36 AM, William Colen wrote:
> Hi, Sahar,
>
> I don't know any open source rule based parser, but probably there is.
>
> I only know a proprietary one called EngGram, which is built with
> Constraint Grammar (GPL).
> You can use EngGram online here:
> http://beta.visl.sdu.dk/visl/en/parsing/automatic/
>
> Regards,
> William
>
>
> On Fri, May 3, 2013 at 12:06 PM, Sahar Ebadi <sa...@yuxipacific.com>wrote:
>
>> Hi,
>> Thanks for the replies! :)
>>
>> Lance:
>> 1)yes, I use sentence detector just to split the text in to sentences and I
>> am not taking them as like they are Valid sentences.
>> 2)Watson goes beyond what I need. I only need to find good/valid sentences
>> in the text(only NLP, does not include reasoning and information retrival
>> as watson does).
>> 3)I know there should be some semi-effective solutions but I am not able to
>> find them. can you give me some keywords or short explanation on some of
>> them? that would be a greaat help!!
>>
>> So what I have done:
>> the only solution I found was to parse the sentence and then check to see
>> if it follows the standard grammatical pattern of a sentence. If so it is a
>> valid sentence otherwise it is not a valid sentence. so far, I have parsed
>> the sentences using Open NLP which is tagged based on penn treebank. now I
>> need to know if there is any standard sentence pattern which is based on
>> penn treebank?
>>
>> Ryan: the result will not be accurate enough.
>>
>> Willian: can you pass me the name of some rule-based parser you have in
>> mind? (especially those compatible with OPEN NLP)
>>
>> I really appreciate any suggestions on this.
>>
>> Thank you all so much!
>>
>>
>> On Wed, May 1, 2013 at 5:34 PM, Lance Norskog <go...@gmail.com> wrote:
>>
>>> The "sentence detector" is for tokenizing (breaking text into words), not
>>> analysis.
>>>
>>> The 'brute force' approach for removing non-english texts is to search
>> for
>>> higher-page Unicode. If it's over 255, it's not english. (Except maybe
>> for
>>> currency.)
>>>
>>> What you're talking about are semantically deep problems that have a lot
>>> of semi-effective solutions. How deep do you want this analysis to be?
>> How
>>> close to IBM Watson do you expect to get?
>>>
>>>
>>> On 04/30/2013 06:43 AM, Sahar Ebadi wrote:
>>>
>>>> Hi all,
>>>>
>>>> lets say I have a text and I would like to detect only "good sentences".
>>>> by
>>>> "good sentences" I mean sentences that are 1)complete( grammatically
>>>> 2)have meaning 3)are in English language.
>>>>
>>>> As far as I found Open NLP sentence detector only detects sentences
>>>> according to punctuation(and a list of acronyms it has), so there is
>>>> no guarantee that the sentences are real, complete and meaningful
>>>> sentences.
>>>>
>>>> Now my question is is there any process in NLP that can help me to :
>>>>
>>>> 1)find grammatically complete sentences?
>>>> 2)find if a sentence has meaning or no?
>>>> 3)filter non-english texts?
>>>>
>>>> any suggestions or sharing useful resources is highly appreciated!
>>>>
>>>> Thanks.
>>>>
>>>>

Re: valid sentence detector

Posted by William Colen <wi...@gmail.com>.

Hi, Sahar,

I don't know any open source rule based parser, but probably there is.

I only know a proprietary one called EngGram, which is built with
Constraint Grammar (GPL).
You can use EngGram online here:
http://beta.visl.sdu.dk/visl/en/parsing/automatic/

Regards,
William


On Fri, May 3, 2013 at 12:06 PM, Sahar Ebadi <sa...@yuxipacific.com>wrote:

> Hi,
> Thanks for the replies! :)
>
> Lance:
> 1)yes, I use sentence detector just to split the text in to sentences and I
> am not taking them as like they are Valid sentences.
> 2)Watson goes beyond what I need. I only need to find good/valid sentences
> in the text(only NLP, does not include reasoning and information retrival
> as watson does).
> 3)I know there should be some semi-effective solutions but I am not able to
> find them. can you give me some keywords or short explanation on some of
> them? that would be a greaat help!!
>
> So what I have done:
> the only solution I found was to parse the sentence and then check to see
> if it follows the standard grammatical pattern of a sentence. If so it is a
> valid sentence otherwise it is not a valid sentence. so far, I have parsed
> the sentences using Open NLP which is tagged based on penn treebank. now I
> need to know if there is any standard sentence pattern which is based on
> penn treebank?
>
> Ryan: the result will not be accurate enough.
>
> Willian: can you pass me the name of some rule-based parser you have in
> mind? (especially those compatible with OPEN NLP)
>
> I really appreciate any suggestions on this.
>
> Thank you all so much!
>
>
> On Wed, May 1, 2013 at 5:34 PM, Lance Norskog <go...@gmail.com> wrote:
>
> > The "sentence detector" is for tokenizing (breaking text into words), not
> > analysis.
> >
> > The 'brute force' approach for removing non-english texts is to search
> for
> > higher-page Unicode. If it's over 255, it's not english. (Except maybe
> for
> > currency.)
> >
> > What you're talking about are semantically deep problems that have a lot
> > of semi-effective solutions. How deep do you want this analysis to be?
> How
> > close to IBM Watson do you expect to get?
> >
> >
> > On 04/30/2013 06:43 AM, Sahar Ebadi wrote:
> >
> >> Hi all,
> >>
> >> lets say I have a text and I would like to detect only "good sentences".
> >> by
> >> "good sentences" I mean sentences that are 1)complete( grammatically
> >> 2)have meaning 3)are in English language.
> >>
> >> As far as I found Open NLP sentence detector only detects sentences
> >> according to punctuation(and a list of acronyms it has), so there is
> >> no guarantee that the sentences are real, complete and meaningful
> >> sentences.
> >>
> >> Now my question is is there any process in NLP that can help me to :
> >>
> >> 1)find grammatically complete sentences?
> >> 2)find if a sentence has meaning or no?
> >> 3)filter non-english texts?
> >>
> >> any suggestions or sharing useful resources is highly appreciated!
> >>
> >> Thanks.
> >>
> >>
> >
>

Re: valid sentence detector

Posted by Sahar Ebadi <sa...@yuxipacific.com>.

Hi,
Thanks for the replies! :)

Lance:
1)yes, I use sentence detector just to split the text in to sentences and I
am not taking them as like they are Valid sentences.
2)Watson goes beyond what I need. I only need to find good/valid sentences
in the text(only NLP, does not include reasoning and information retrival
as watson does).
3)I know there should be some semi-effective solutions but I am not able to
find them. can you give me some keywords or short explanation on some of
them? that would be a greaat help!!

So what I have done:
the only solution I found was to parse the sentence and then check to see
if it follows the standard grammatical pattern of a sentence. If so it is a
valid sentence otherwise it is not a valid sentence. so far, I have parsed
the sentences using Open NLP which is tagged based on penn treebank. now I
need to know if there is any standard sentence pattern which is based on
penn treebank?

Ryan: the result will not be accurate enough.

Willian: can you pass me the name of some rule-based parser you have in
mind? (especially those compatible with OPEN NLP)

I really appreciate any suggestions on this.

Thank you all so much!

On Wed, May 1, 2013 at 5:34 PM, Lance Norskog <go...@gmail.com> wrote:

> The "sentence detector" is for tokenizing (breaking text into words), not
> analysis.
>
> The 'brute force' approach for removing non-english texts is to search for
> higher-page Unicode. If it's over 255, it's not english. (Except maybe for
> currency.)
>
> What you're talking about are semantically deep problems that have a lot
> of semi-effective solutions. How deep do you want this analysis to be? How
> close to IBM Watson do you expect to get?
>
>
> On 04/30/2013 06:43 AM, Sahar Ebadi wrote:
>
>> Hi all,
>>
>> lets say I have a text and I would like to detect only "good sentences".
>> by
>> "good sentences" I mean sentences that are 1)complete( grammatically
>> 2)have meaning 3)are in English language.
>>
>> As far as I found Open NLP sentence detector only detects sentences
>> according to punctuation(and a list of acronyms it has), so there is
>> no guarantee that the sentences are real, complete and meaningful
>> sentences.
>>
>> Now my question is is there any process in NLP that can help me to :
>>
>> 1)find grammatically complete sentences?
>> 2)find if a sentence has meaning or no?
>> 3)filter non-english texts?
>>
>> any suggestions or sharing useful resources is highly appreciated!
>>
>> Thanks.
>>
>>
>

Re: valid sentence detector

Posted by Lance Norskog <go...@gmail.com>.

The "sentence detector" is for tokenizing (breaking text into words), 
not analysis.

The 'brute force' approach for removing non-english texts is to search 
for higher-page Unicode. If it's over 255, it's not english. (Except 
maybe for currency.)

What you're talking about are semantically deep problems that have a lot 
of semi-effective solutions. How deep do you want this analysis to be? 
How close to IBM Watson do you expect to get?

On 04/30/2013 06:43 AM, Sahar Ebadi wrote:
> Hi all,
>
> lets say I have a text and I would like to detect only "good sentences". by
> "good sentences" I mean sentences that are 1)complete( grammatically
> 2)have meaning 3)are in English language.
>
> As far as I found Open NLP sentence detector only detects sentences
> according to punctuation(and a list of acronyms it has), so there is
> no guarantee that the sentences are real, complete and meaningful sentences.
>
> Now my question is is there any process in NLP that can help me to :
>
> 1)find grammatically complete sentences?
> 2)find if a sentence has meaning or no?
> 3)filter non-english texts?
>
> any suggestions or sharing useful resources is highly appreciated!
>
> Thanks.
>