You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Marco <ze...@yahoo.co.uk> on 2013/08/06 12:05:35 UTC

Vectors (from raw text) with more than one word values

Is it possible to have vectors components from raw text samples with more than one word?

Example: 

Key: California: Value: "Arnold Schwarzenegger" "San Andreas Fault"

(I've put quotation marks just to show how I'd like to group vector's values)

Re: Vectors (from raw text) with more than one word values

Posted by Jake Mannix <ja...@gmail.com>.
On Tue, Aug 6, 2013 at 7:18 AM, Suneel Marthi <su...@yahoo.com>wrote:

> Marco,
>
> The actual tokenization is done by the Lucene Analyzer you specify with
> the option " --analyzerName" (the default being Lucene's StandardAnalyzer)
> while invoking seq2sparse.
>
> Top of my head, I don't think there is a custom Lucene tokenizer for
> "quoted" text, but it should be real easy to create one.
>

+1 to Suneel's remarks.

I'm not sure of an off-the-shelf tokenizer which looks for "grouping"
tokens and automatically crams those ones together.  That would be very
helpful, however, as then you could run e.g. stanford's named entity
recognizer (or lingpipe, etc) over the text first, and have it annotate
things like e.g. noun phrases before, using your predetermined grouping
tokens.

Alternatively, the "hacky" solution would be to force the typical lucene
tokenizer to leave stuff alone by forcing it together: in your
preprocessing code which decides that "New York Yankees" is a trigram you
don't want split apart, just have it replace this string with
NewYorkYankees, and most Lucene tokenizers will leave it alone.  For
seq2sparse, you'll get exactly the vectors you want, but your dictionary
will contain these "lame" concatenatedbigrams and so forth, which you could
write some simple code to re-unscramble after the fact.


>
>
>
>
>
> ________________________________
>  From: Marco <ze...@yahoo.co.uk>
> To: "user@mahout.apache.org" <us...@mahout.apache.org>
> Sent: Tuesday, August 6, 2013 9:08 AM
> Subject: Re: Vectors (from raw text) with more than one word values
>
>
> wow! this is a hell of an answer!
> thanks very much for it.
>
> i also thought about something else: since i'm the one also producing the
> sequence files that'll then be "seq2sparsed", i figured i could "wrap" my
> n-grams (say using quotation marks or whatever) so that then seq2sparse
> would not break them into smaller pieces.
>
> any chance this is possible? does it depend on the separator i use?
>
>
>
> ----- Messaggio originale -----
> Da: Jake Mannix <ja...@gmail.com>
> A: "user@mahout.apache.org" <us...@mahout.apache.org>; Marco <
> zentropa80@yahoo.co.uk>
> Cc:
> Inviato: Martedì 6 Agosto 2013 14:32
> Oggetto: Re: Vectors (from raw text) with more than one word values
>
> Indeed, and our seq2sparse utility enables this directly.  If you ask for
> cmdline help from seq2sparse, you'll see a bunch of options you maybe don't
> use:
>
> $ ./bin/mahout seq2sparse -h
> Error: Could not find or load main class classpath
> Running on hadoop, using /usr/local/Cellar/hadoop/0.20.1/libexec/bin/hadoop
> and HADOOP_CONF_DIR=
> MAHOUT-JOB:
>
> /Users/jake/open_src/gitrepo/mahout-twitter/examples/target/mahout-examples-0.9-SNAPSHOT-job.jar
> Usage:
>
> [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize
>
> <chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma
>
> <maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm>
>
> --minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize>
>
> --overwrite --help --sequentialAccessVector --namedVector --logNormalize]
>
> Options
>
>   --minSupport (-s) minSupport        (Optional) Minimum Support. Default
>
>                                       Value: 2
>
>   --analyzerName (-a) analyzerName    The class name of the analyzer
>
>   --chunkSize (-chunk) chunkSize      The chunkSize in MegaBytes. 100-10000
> MB
>   --output (-o) output                The directory pathname for output.
>
>   --input (-i) input                  Path to job input directory.
>
>   --minDF (-md) minDF                 The minimum document frequency.
> Default
>                                       is 1
>
>   --maxDFSigma (-xs) maxDFSigma       What portion of the tf (tf-idf)
> vectors
>                                       to be used, expressed in times the
>
>                                       standard deviation (sigma) of the
>
>                                       document frequencies of these
> vectors.
>                                       Can be used to remove really high
>
>                                       frequency terms. Expressed as a
> double
>                                       value. Good value to be specified is
> 3.0.
>                                       In case the value is less than 0 no
>
>                                       vectors will be filtered out. Default
> is
>                                       -1.0.  Overrides maxDFPercent
>
>   --maxDFPercent (-x) maxDFPercent    The max percentage of docs for the
> DF.
>                                       Can be used to remove really high
>
>                                       frequency terms. Expressed as an
> integer
>                                       between 0 and 100. Default is 99.  If
>
>                                       maxDFSigma is also set, it will
> override
>                                       this value.
>
>   --weight (-wt) weight               The kind of weight to use. Currently
> TF
>                                       or TFIDF
>
>   --norm (-n) norm                    The norm to use, expressed as either
> a
>                                       float or "INF" if you want to use the
>
>                                       Infinite norm.  Must be greater or
> equal
>                                       to 0.  The default is not to
> normalize
>   --minLLR (-ml) minLLR               (Optional)The minimum Log Likelihood
>
>                                       Ratio(Float)  Default is 1.0
>
>   --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.
>
>                                       Default Value: 1
>
>   --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams
> to
>                                       create (2 = bigrams, 3 = trigrams,
> etc)
>                                       Default Value:1
>
>   --overwrite (-ow)                   If set, overwrite the output
> directory
>   --help (-h)                         Print out help
>
>   --sequentialAccessVector (-seq)     (Optional) Whether output vectors
> should
>                                       be SequentialAccessVectors. If set
> true
>                                       else false
>
>   --namedVector (-nv)                 (Optional) Whether output vectors
> should
>                                       be NamedVectors. If set true else
> false
>   --logNormalize (-lnorm)             (Optional) Whether output vectors
> should
>                                       be logNormalize. If set true else
> false
> 13/08/06 04:44:45 INFO driver.MahoutDriver: Program took 158 ms (Minutes:
> 0.0026333333333333334)
>
> --------
>
> In particular, "--maxNGramSize 3" says you don't want just raw tokens, but
> bigrams like "Arnold Schwartzenegger" and trigrams like "New York Yankees".
> To decide *which* ones to use (because there are *way* too many 2 and
> 3grams if you take all of them), the simple technique we have in this
> utility is by a) filter by document frequency, either on the high end: get
> rid of features [either tokens/unigrams or ngrams for n > 1] which occur
> too frequently, by setting --maxDFPercent 95 [this drops the 5% of most
> commonly occurring features] or --maxDFSigma 3.0 [this drops all tokens
> with doc frequency > 3 sigma higher than the mean], or by getting rid of
> features which occur too rarely, with --minDF 2 : this would make sure that
> features which occur less than 2 times get dropped, b) filter by log
> likelihood ratio: --minLLR 10.0 sets the minimum LLR for an ngram to be at
> least 10.0.  For a more detailed explanation of ngrams and LLR, Ted's
> classic blog post<
> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html>may
> be helpful.
>
>   The TL;DR of it is that for practical purposes, you really want to try a
> simple run with say "--minLLR 1.0", see how many ngrams are left, and how
> good they look, and what they're LLR is, and then bump the value up to
> something which gets rid of more of the crappy ones - maybe its 10.0, maybe
> it's 15.0, maybe it's 25.0, depends on your data, and how many features you
> want to end up with in the end of the day.
>
>
>
> On Tue, Aug 6, 2013 at 3:05 AM, Marco <ze...@yahoo.co.uk> wrote:
>
> > Is it possible to have vectors components from raw text samples with more
> > than one word?
> >
> > Example:
> >
> > Key: California: Value: "Arnold Schwarzenegger" "San Andreas Fault"
> >
> > (I've put quotation marks just to show how I'd like to group vector's
> > values)
> >
>
>
>
> --
>
>   -jake
>



-- 

  -jake

Re: Vectors (from raw text) with more than one word values

Posted by Suneel Marthi <su...@yahoo.com>.
Marco,

The actual tokenization is done by the Lucene Analyzer you specify with the option " --analyzerName" (the default being Lucene's StandardAnalyzer) while invoking seq2sparse.

Top of my head, I don't think there is a custom Lucene tokenizer for "quoted" text, but it should be real easy to create one.





________________________________
 From: Marco <ze...@yahoo.co.uk>
To: "user@mahout.apache.org" <us...@mahout.apache.org> 
Sent: Tuesday, August 6, 2013 9:08 AM
Subject: Re: Vectors (from raw text) with more than one word values
 

wow! this is a hell of an answer!
thanks very much for it.

i also thought about something else: since i'm the one also producing the sequence files that'll then be "seq2sparsed", i figured i could "wrap" my n-grams (say using quotation marks or whatever) so that then seq2sparse would not break them into smaller pieces.

any chance this is possible? does it depend on the separator i use?



----- Messaggio originale -----
Da: Jake Mannix <ja...@gmail.com>
A: "user@mahout.apache.org" <us...@mahout.apache.org>; Marco <ze...@yahoo.co.uk>
Cc: 
Inviato: Martedì 6 Agosto 2013 14:32
Oggetto: Re: Vectors (from raw text) with more than one word values

Indeed, and our seq2sparse utility enables this directly.  If you ask for
cmdline help from seq2sparse, you'll see a bunch of options you maybe don't
use:

$ ./bin/mahout seq2sparse -h
Error: Could not find or load main class classpath
Running on hadoop, using /usr/local/Cellar/hadoop/0.20.1/libexec/bin/hadoop
and HADOOP_CONF_DIR=
MAHOUT-JOB:
/Users/jake/open_src/gitrepo/mahout-twitter/examples/target/mahout-examples-0.9-SNAPSHOT-job.jar
Usage:

[--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize

<chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma

<maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm>

--minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize>

--overwrite --help --sequentialAccessVector --namedVector --logNormalize]

Options

  --minSupport (-s) minSupport        (Optional) Minimum Support. Default

                                      Value: 2

  --analyzerName (-a) analyzerName    The class name of the analyzer

  --chunkSize (-chunk) chunkSize      The chunkSize in MegaBytes. 100-10000
MB
  --output (-o) output                The directory pathname for output.

  --input (-i) input                  Path to job input directory.

  --minDF (-md) minDF                 The minimum document frequency.
Default
                                      is 1

  --maxDFSigma (-xs) maxDFSigma       What portion of the tf (tf-idf)
vectors
                                      to be used, expressed in times the

                                      standard deviation (sigma) of the

                                      document frequencies of these
vectors.
                                      Can be used to remove really high

                                      frequency terms. Expressed as a
double
                                      value. Good value to be specified is
3.0.
                                      In case the value is less than 0 no

                                      vectors will be filtered out. Default
is
                                      -1.0.  Overrides maxDFPercent

  --maxDFPercent (-x) maxDFPercent    The max percentage of docs for the
DF.
                                      Can be used to remove really high

                                      frequency terms. Expressed as an
integer
                                      between 0 and 100. Default is 99.  If

                                      maxDFSigma is also set, it will
override
                                      this value.

  --weight (-wt) weight               The kind of weight to use. Currently
TF
                                      or TFIDF

  --norm (-n) norm                    The norm to use, expressed as either
a
                                      float or "INF" if you want to use the

                                      Infinite norm.  Must be greater or
equal
                                      to 0.  The default is not to
normalize
  --minLLR (-ml) minLLR               (Optional)The minimum Log Likelihood

                                      Ratio(Float)  Default is 1.0

  --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.

                                      Default Value: 1

  --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams
to
                                      create (2 = bigrams, 3 = trigrams,
etc)
                                      Default Value:1

  --overwrite (-ow)                   If set, overwrite the output
directory
  --help (-h)                         Print out help

  --sequentialAccessVector (-seq)     (Optional) Whether output vectors
should
                                      be SequentialAccessVectors. If set
true
                                      else false

  --namedVector (-nv)                 (Optional) Whether output vectors
should
                                      be NamedVectors. If set true else
false
  --logNormalize (-lnorm)             (Optional) Whether output vectors
should
                                      be logNormalize. If set true else
false
13/08/06 04:44:45 INFO driver.MahoutDriver: Program took 158 ms (Minutes:
0.0026333333333333334)

--------

In particular, "--maxNGramSize 3" says you don't want just raw tokens, but
bigrams like "Arnold Schwartzenegger" and trigrams like "New York Yankees".
To decide *which* ones to use (because there are *way* too many 2 and
3grams if you take all of them), the simple technique we have in this
utility is by a) filter by document frequency, either on the high end: get
rid of features [either tokens/unigrams or ngrams for n > 1] which occur
too frequently, by setting --maxDFPercent 95 [this drops the 5% of most
commonly occurring features] or --maxDFSigma 3.0 [this drops all tokens
with doc frequency > 3 sigma higher than the mean], or by getting rid of
features which occur too rarely, with --minDF 2 : this would make sure that
features which occur less than 2 times get dropped, b) filter by log
likelihood ratio: --minLLR 10.0 sets the minimum LLR for an ngram to be at
least 10.0.  For a more detailed explanation of ngrams and LLR, Ted's
classic blog post<http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html>may
be helpful.

  The TL;DR of it is that for practical purposes, you really want to try a
simple run with say "--minLLR 1.0", see how many ngrams are left, and how
good they look, and what they're LLR is, and then bump the value up to
something which gets rid of more of the crappy ones - maybe its 10.0, maybe
it's 15.0, maybe it's 25.0, depends on your data, and how many features you
want to end up with in the end of the day.



On Tue, Aug 6, 2013 at 3:05 AM, Marco <ze...@yahoo.co.uk> wrote:

> Is it possible to have vectors components from raw text samples with more
> than one word?
>
> Example:
>
> Key: California: Value: "Arnold Schwarzenegger" "San Andreas Fault"
>
> (I've put quotation marks just to show how I'd like to group vector's
> values)
>



-- 

  -jake

Re: Vectors (from raw text) with more than one word values

Posted by Marco <ze...@yahoo.co.uk>.
wow! this is a hell of an answer!
thanks very much for it.

i also thought about something else: since i'm the one also producing the sequence files that'll then be "seq2sparsed", i figured i could "wrap" my n-grams (say using quotation marks or whatever) so that then seq2sparse would not break them into smaller pieces.

any chance this is possible? does it depend on the separator i use?



----- Messaggio originale -----
Da: Jake Mannix <ja...@gmail.com>
A: "user@mahout.apache.org" <us...@mahout.apache.org>; Marco <ze...@yahoo.co.uk>
Cc: 
Inviato: Martedì 6 Agosto 2013 14:32
Oggetto: Re: Vectors (from raw text) with more than one word values

Indeed, and our seq2sparse utility enables this directly.  If you ask for
cmdline help from seq2sparse, you'll see a bunch of options you maybe don't
use:

$ ./bin/mahout seq2sparse -h
Error: Could not find or load main class classpath
Running on hadoop, using /usr/local/Cellar/hadoop/0.20.1/libexec/bin/hadoop
and HADOOP_CONF_DIR=
MAHOUT-JOB:
/Users/jake/open_src/gitrepo/mahout-twitter/examples/target/mahout-examples-0.9-SNAPSHOT-job.jar
Usage:

[--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize

<chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma

<maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm>

--minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize>

--overwrite --help --sequentialAccessVector --namedVector --logNormalize]

Options

  --minSupport (-s) minSupport        (Optional) Minimum Support. Default

                                      Value: 2

  --analyzerName (-a) analyzerName    The class name of the analyzer

  --chunkSize (-chunk) chunkSize      The chunkSize in MegaBytes. 100-10000
MB
  --output (-o) output                The directory pathname for output.

  --input (-i) input                  Path to job input directory.

  --minDF (-md) minDF                 The minimum document frequency.
Default
                                      is 1

  --maxDFSigma (-xs) maxDFSigma       What portion of the tf (tf-idf)
vectors
                                      to be used, expressed in times the

                                      standard deviation (sigma) of the

                                      document frequencies of these
vectors.
                                      Can be used to remove really high

                                      frequency terms. Expressed as a
double
                                      value. Good value to be specified is
3.0.
                                      In case the value is less than 0 no

                                      vectors will be filtered out. Default
is
                                      -1.0.  Overrides maxDFPercent

  --maxDFPercent (-x) maxDFPercent    The max percentage of docs for the
DF.
                                      Can be used to remove really high

                                      frequency terms. Expressed as an
integer
                                      between 0 and 100. Default is 99.  If

                                      maxDFSigma is also set, it will
override
                                      this value.

  --weight (-wt) weight               The kind of weight to use. Currently
TF
                                      or TFIDF

  --norm (-n) norm                    The norm to use, expressed as either
a
                                      float or "INF" if you want to use the

                                      Infinite norm.  Must be greater or
equal
                                      to 0.  The default is not to
normalize
  --minLLR (-ml) minLLR               (Optional)The minimum Log Likelihood

                                      Ratio(Float)  Default is 1.0

  --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.

                                      Default Value: 1

  --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams
to
                                      create (2 = bigrams, 3 = trigrams,
etc)
                                      Default Value:1

  --overwrite (-ow)                   If set, overwrite the output
directory
  --help (-h)                         Print out help

  --sequentialAccessVector (-seq)     (Optional) Whether output vectors
should
                                      be SequentialAccessVectors. If set
true
                                      else false

  --namedVector (-nv)                 (Optional) Whether output vectors
should
                                      be NamedVectors. If set true else
false
  --logNormalize (-lnorm)             (Optional) Whether output vectors
should
                                      be logNormalize. If set true else
false
13/08/06 04:44:45 INFO driver.MahoutDriver: Program took 158 ms (Minutes:
0.0026333333333333334)

--------

In particular, "--maxNGramSize 3" says you don't want just raw tokens, but
bigrams like "Arnold Schwartzenegger" and trigrams like "New York Yankees".
To decide *which* ones to use (because there are *way* too many 2 and
3grams if you take all of them), the simple technique we have in this
utility is by a) filter by document frequency, either on the high end: get
rid of features [either tokens/unigrams or ngrams for n > 1] which occur
too frequently, by setting --maxDFPercent 95 [this drops the 5% of most
commonly occurring features] or --maxDFSigma 3.0 [this drops all tokens
with doc frequency > 3 sigma higher than the mean], or by getting rid of
features which occur too rarely, with --minDF 2 : this would make sure that
features which occur less than 2 times get dropped, b) filter by log
likelihood ratio: --minLLR 10.0 sets the minimum LLR for an ngram to be at
least 10.0.  For a more detailed explanation of ngrams and LLR, Ted's
classic blog post<http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html>may
be helpful.

  The TL;DR of it is that for practical purposes, you really want to try a
simple run with say "--minLLR 1.0", see how many ngrams are left, and how
good they look, and what they're LLR is, and then bump the value up to
something which gets rid of more of the crappy ones - maybe its 10.0, maybe
it's 15.0, maybe it's 25.0, depends on your data, and how many features you
want to end up with in the end of the day.



On Tue, Aug 6, 2013 at 3:05 AM, Marco <ze...@yahoo.co.uk> wrote:

> Is it possible to have vectors components from raw text samples with more
> than one word?
>
> Example:
>
> Key: California: Value: "Arnold Schwarzenegger" "San Andreas Fault"
>
> (I've put quotation marks just to show how I'd like to group vector's
> values)
>



-- 

  -jake


Re: Vectors (from raw text) with more than one word values

Posted by Jake Mannix <ja...@gmail.com>.
Indeed, and our seq2sparse utility enables this directly.  If you ask for
cmdline help from seq2sparse, you'll see a bunch of options you maybe don't
use:

$ ./bin/mahout seq2sparse -h
Error: Could not find or load main class classpath
Running on hadoop, using /usr/local/Cellar/hadoop/0.20.1/libexec/bin/hadoop
and HADOOP_CONF_DIR=
MAHOUT-JOB:
/Users/jake/open_src/gitrepo/mahout-twitter/examples/target/mahout-examples-0.9-SNAPSHOT-job.jar
Usage:

 [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize

<chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma

<maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm>

--minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize>

--overwrite --help --sequentialAccessVector --namedVector --logNormalize]

Options

  --minSupport (-s) minSupport        (Optional) Minimum Support. Default

                                      Value: 2

  --analyzerName (-a) analyzerName    The class name of the analyzer

  --chunkSize (-chunk) chunkSize      The chunkSize in MegaBytes. 100-10000
MB
  --output (-o) output                The directory pathname for output.

  --input (-i) input                  Path to job input directory.

  --minDF (-md) minDF                 The minimum document frequency.
 Default
                                      is 1

  --maxDFSigma (-xs) maxDFSigma       What portion of the tf (tf-idf)
vectors
                                      to be used, expressed in times the

                                      standard deviation (sigma) of the

                                      document frequencies of these
vectors.
                                      Can be used to remove really high

                                      frequency terms. Expressed as a
double
                                      value. Good value to be specified is
3.0.
                                      In case the value is less than 0 no

                                      vectors will be filtered out. Default
is
                                      -1.0.  Overrides maxDFPercent

  --maxDFPercent (-x) maxDFPercent    The max percentage of docs for the
DF.
                                      Can be used to remove really high

                                      frequency terms. Expressed as an
integer
                                      between 0 and 100. Default is 99.  If

                                      maxDFSigma is also set, it will
override
                                      this value.

  --weight (-wt) weight               The kind of weight to use. Currently
TF
                                      or TFIDF

  --norm (-n) norm                    The norm to use, expressed as either
a
                                      float or "INF" if you want to use the

                                      Infinite norm.  Must be greater or
equal
                                      to 0.  The default is not to
normalize
  --minLLR (-ml) minLLR               (Optional)The minimum Log Likelihood

                                      Ratio(Float)  Default is 1.0

  --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.

                                      Default Value: 1

  --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams
to
                                      create (2 = bigrams, 3 = trigrams,
etc)
                                      Default Value:1

  --overwrite (-ow)                   If set, overwrite the output
directory
  --help (-h)                         Print out help

  --sequentialAccessVector (-seq)     (Optional) Whether output vectors
should
                                      be SequentialAccessVectors. If set
true
                                      else false

  --namedVector (-nv)                 (Optional) Whether output vectors
should
                                      be NamedVectors. If set true else
false
  --logNormalize (-lnorm)             (Optional) Whether output vectors
should
                                      be logNormalize. If set true else
false
13/08/06 04:44:45 INFO driver.MahoutDriver: Program took 158 ms (Minutes:
0.0026333333333333334)

--------

In particular, "--maxNGramSize 3" says you don't want just raw tokens, but
bigrams like "Arnold Schwartzenegger" and trigrams like "New York Yankees".
 To decide *which* ones to use (because there are *way* too many 2 and
3grams if you take all of them), the simple technique we have in this
utility is by a) filter by document frequency, either on the high end: get
rid of features [either tokens/unigrams or ngrams for n > 1] which occur
too frequently, by setting --maxDFPercent 95 [this drops the 5% of most
commonly occurring features] or --maxDFSigma 3.0 [this drops all tokens
with doc frequency > 3 sigma higher than the mean], or by getting rid of
features which occur too rarely, with --minDF 2 : this would make sure that
features which occur less than 2 times get dropped, b) filter by log
likelihood ratio: --minLLR 10.0 sets the minimum LLR for an ngram to be at
least 10.0.  For a more detailed explanation of ngrams and LLR, Ted's
classic blog post<http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html>may
be helpful.

  The TL;DR of it is that for practical purposes, you really want to try a
simple run with say "--minLLR 1.0", see how many ngrams are left, and how
good they look, and what they're LLR is, and then bump the value up to
something which gets rid of more of the crappy ones - maybe its 10.0, maybe
it's 15.0, maybe it's 25.0, depends on your data, and how many features you
want to end up with in the end of the day.



On Tue, Aug 6, 2013 at 3:05 AM, Marco <ze...@yahoo.co.uk> wrote:

> Is it possible to have vectors components from raw text samples with more
> than one word?
>
> Example:
>
> Key: California: Value: "Arnold Schwarzenegger" "San Andreas Fault"
>
> (I've put quotation marks just to show how I'd like to group vector's
> values)
>



-- 

  -jake