You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "abhishek bafna (JIRA)" <ji...@apache.org> on 2015/03/12 07:46:38 UTC

[jira] [Issue Comment Deleted] (SOLR-7193) Concatenate words from token stream

     [ https://issues.apache.org/jira/browse/SOLR-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

abhishek bafna updated SOLR-7193:
---------------------------------
    Comment: was deleted

(was: The ConcatenateWordsFilter takes all the input token (words) and generate a single token. The CPU time and memory depends on the number and size of the tokens coming in the stream. The use case for this filter, when input stream contains business name, address, etc., which usually have a small number of tokens. I am guessing, here (test environment) input data containing long paragraphs or documents and that might be causing the issue.)

> Concatenate words from token stream
> -----------------------------------
>
>                 Key: SOLR-7193
>                 URL: https://issues.apache.org/jira/browse/SOLR-7193
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: abhishek bafna
>         Attachments: concatenate_words.patch
>
>
> The user entered data often don't have proper spacing between words and words spelling and format also varies from data like business names, address etc. After tokenizing data, we might perform pattern replacement, stop word filtering etc. Later we want to concatenate all the tokens and generate n-grams token for indexing business name and perform the fuzzy match.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org