You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@forrest.apache.org by g4 <ja...@root10.net> on 2003/08/12 20:21:46 UTC
meta thoughts
Hi Jeff,
looking into awk got the better of me. So I ended going off at a
tangent ;)
As you said, FTR. nonetheless I've come up with some interesting
possibilities, so here's just to let you and the list know . Here is
the result of a quick test using the Forrest home page:
[g4:~/awk-dev] g4% /sw/bin/gawk-3.1.0 -f strip forrest.html
forrest 27
apache 14
website 7
documentation 7
document 7
content 7
project 6
write 5
dynamic 5
cocoon 5
status 4
skins 4
sites 4
Ended at Tue/Aug/2003 18:14:26
[g4:~/awk-dev] g4%
This is with a word length set to > 4 and word frequency > 3
[g4:~/awk-dev] g4% /sw/bin/gawk-3.1.0 -f strip forrest.html
forrest 27
apache 14
website 7
documentation 7
document 7
content 7
project 6
write 5
dynamic 5
cocoon 5
status 4
skins 4
sites 4
using 3
these 3
static 3
specific 3
software 3
rendered 3
projects 3
print 3
powerful 3
making 3
makes 3
formats 3
focus 3
Ended at Tue/Aug/2003 18:36:06
and this with a word frequency > 2
As you can see there are some words that shouldn't be there (these,
makes, etc...). So I think managing keywords words by frequency is not
really the way to go with something like this, a definitive list of
excluded words would be needed, this would also have the benefit of
being accessible and manageable. I will continue with this anyway, at
least I'm getting to know awk ;)
Jason Lane