You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Loren Wilton <lw...@earthlink.net> on 2005/01/17 05:39:04 UTC

OT: word frequency analysis

I'm not a unix type, so how to do this isn't obvious to me, but it is
probably trivial.

Given a file with a few paragraphs of words (multiple words per line,
obviously) I want to generate a list of the individual words in descending
order of occurance frequency.  I'd like the frequency number with each word
too.

Can anyone give me a simple command line incantation to do that?

Thanks,
        Loren

Re: OT: word frequency analysis

Posted by Thomas Arend <ml...@arend-whv.info>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Am Montag, 17. Januar 2005 06:34 schrieb Loren Wilton:
> > Probably want to nuke punctuation and capitalization before doing
> > the sort.  I'm too braindead at the moment, but some perl incantation
> > might be the way to go, or if you're old school then awk would probably
> > work.
>
> Yea, that occurred to me.  Since I was pasting a spam into a text file
> anyway I just did that manually, since I couldn't seem to figure out sed's
> re grouping syntax to treat the punctuation like spaces.
>
>         Loren

For capitalization use "tr [:lower:] [:upper:]" or 
do delete chars use "tr -d 'chartodelet..'"

Thomas

- -- 
icq:133073900
http://www.t-arend.de
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)

iD8DBQFB7CHlHe2ZLU3NgHsRAmOIAJ0XdtNmh62jmAWd42XI/xzsmArXLgCfXhvG
hkMgLcwz4b2RPqN45y1Ocl4=
=5GXe
-----END PGP SIGNATURE-----

Re: OT: word frequency analysis

Posted by Loren Wilton <lw...@earthlink.net>.

> Probably want to nuke punctuation and capitalization before doing
> the sort.  I'm too braindead at the moment, but some perl incantation
> might be the way to go, or if you're old school then awk would probably
> work.

Yea, that occurred to me.  Since I was pasting a spam into a text file
anyway I just did that manually, since I couldn't seem to figure out sed's
re grouping syntax to treat the punctuation like spaces.

        Loren

Re: OT: word frequency analysis

Posted by Steve Prior <sp...@geekster.com>.

Probably want to nuke punctuation and capitalization before doing
the sort.  I'm too braindead at the moment, but some perl incantation
might be the way to go, or if you're old school then awk would probably
work.

Steve

Rich Puhek wrote:

> Loren Wilton wrote:
> 
>> I'm not a unix type, so how to do this isn't obvious to me, but it is
>> probably trivial.
>>
>> Given a file with a few paragraphs of words (multiple words per line,
>> obviously) I want to generate a list of the individual words in 
>> descending
>> order of occurance frequency.  I'd like the frequency number with each 
>> word
>> too.
>>
>> Can anyone give me a simple command line incantation to do that?
>>
>> Thanks,
>>         Loren
>>
> 
> Something like the following should do what you want:
> 
> sed -e's/ /g' <inputfile> | sort | uniq -c | sort -rn
> 
> In English: go through <inputfile>, replacing any spaces with a newline 
> (so each word is on its own line). Send that to sort. Send the sorted 
> output to the uniq command, and have uniq count the number of 
> occurrences. Finally, send the output of uniq to the sort command, and 
> have it sort by frequency.
> 
> Not extremely trivial, but a good study in how commands like sed, uniq, 
> and sort can be pretty powerful.
> 
> 
> --Rich

Re: OT: word frequency analysis

Posted by Rich Puhek <rp...@etnsystems.com>.

Loren Wilton wrote:
> I'm not a unix type, so how to do this isn't obvious to me, but it is
> probably trivial.
> 
> Given a file with a few paragraphs of words (multiple words per line,
> obviously) I want to generate a list of the individual words in descending
> order of occurance frequency.  I'd like the frequency number with each word
> too.
> 
> Can anyone give me a simple command line incantation to do that?
> 
> Thanks,
>         Loren
> 

Something like the following should do what you want:

sed -e's/ /g' <inputfile> | sort | uniq -c | sort -rn

In English: go through <inputfile>, replacing any spaces with a newline 
(so each word is on its own line). Send that to sort. Send the sorted 
output to the uniq command, and have uniq count the number of 
occurrences. Finally, send the output of uniq to the sort command, and 
have it sort by frequency.

Not extremely trivial, but a good study in how commands like sed, uniq, 
and sort can be pretty powerful.

--Rich