You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Erik Hatcher <er...@gmail.com> on 2010/03/24 15:40:33 UTC

wikipedia and teaching kids search engines

I've got a couple of questions for the community...

   * what's the simplest way to get Solr up and running with a  
relatively richly schema'd index of a Wikipedia dump?

What I'm looking for is something as easy as something along these  
lines:

   java -Dsolr.solr.home=./wikipedia_solr_home -jar start.jar

   cat wikipedia.bz2 | wikipedia_solr_indexer

My goal is to index wikipedia in order to demonstrate search to a  
class of middle school kids that I've volunteered to teach for a  
couple of hours.  Which brings me to my next question...

  * anyone have ideas on some basic hands-on ways of teaching search  
engine fundamentals?

One idea I have is to bring some actual "documents", say a poster  
board with a sentence written largely on it, have the students  
physically *tokenize* the document by cutting it up and  
lexicographically building the term dictionary.  Thoughts on taking it  
further welcome!

Thanks all.

	Erik


Re: wikipedia and teaching kids search engines

Posted by Walter Underwood <wu...@wunderwood.org>.
This is brilliant. I love it!

Is a computer game a document? How about each level, each room, each player?

If you want some fancy linguistics besides stemming, try compounding or what I call "one word or two?" English loves to glom words together.

schoolroom or school room?
babysitter, baby-sitter, or baby sitter?
Ghost Busters or Ghostbusters? 

Note: the poster and movie titles for Ghostbusters disagree, I have screenshots of that.

wunder

On Mar 24, 2010, at 9:53 AM, Erick Erickson wrote:

> Erik:
> 
> In a former incarnation, I thought I was going to teach 6th graders. Until I
> found out I can't deal with 25 kids for 6 hours at a stretch for years on
> end....
> 
> My thoughts, presented in a "feel free to ignore but this is what I'd do"
> spirit.
> There are some random thoughts below, but here's what I'd think about...
> 
> Do a bit of an intro to the game. 10 minutes tops.
> 
> Make a game of sorts out of it. Some teams are the "indexers" and some are
> the "searchers". Give them some simple rules to follow, perhaps different
> ones for different pairs. Make sure some get surprising results (e.g. have
> one indexing team stem, the paired search team not stem). The searchers
> should rank the documents, you'll get some really surprising results.
> Emphasize that the game isn't pass/fail, it's to show the kinds of things we
> have to deal with.
> 
> Find some random near age-mates and try it once or twice before you present,
> you'll undoubtedly change something. Maybe run it by a teacher or two.
> 
> Use that as a basis to discuss the fact that people who write the programs
> that index/search have to cope with all the stuff they did, and the rules
> are imperfect. And each decision is made to serve a need, and when the user
> needs something *else*, it probably isn't a good match. And how horrible
> things happen when one part of the team assumes something different than the
> other part. And how end users don't care about all the internal stuff, they
> just care about how well their needs were served....
> 
> ***here're my random musings, they may even be useful***
> Outline what you want to cover. Then cut out 75% of it. Really. Forget
> running SOLR, the kids don't care. Think about questions like "what's a
> word?" "How is a stupid computer going to figure out what *you* want?"
> "what's a document?"
> 
> Certainly do the exercise of presenting sentences and asking what they'd
> expect, e.g.
> "The dog is running", would you expect "run" to be a hit? ran? the? You can
> work tokenizing in here, perhaps under the guise of "what's important when
> searching?" Maybe even before the game above if you decide to do that.
> 
> Why or why not? Perhaps ask/talk about how a really stupid computer program
> is supposed to figure stuff like this out.
> 
> Back up and tell them what a document is. How hard that is to define. Chris
> M. is right on when he talks about hooking what they're interested in.
> 
> Maybe come up with some examples of really surprising results from searches,
> and do a really *simple* explanation of how it got that way.
> 
> If you decide to go into scoring, stick with simplicity. Like "the more
> times a word appears in a document, the more relevant it is". Can you even
> guarantee that they'd understand phrasing this in terms of percentages?
> 
> FWIW
> Erick
> 
> On Wed, Mar 24, 2010 at 10:40 AM, Erik Hatcher <er...@gmail.com>wrote:
> 
>> I've got a couple of questions for the community...
>> 
>> * what's the simplest way to get Solr up and running with a relatively
>> richly schema'd index of a Wikipedia dump?
>> 
>> What I'm looking for is something as easy as something along these lines:
>> 
>> java -Dsolr.solr.home=./wikipedia_solr_home -jar start.jar
>> 
>> cat wikipedia.bz2 | wikipedia_solr_indexer
>> 
>> My goal is to index wikipedia in order to demonstrate search to a class of
>> middle school kids that I've volunteered to teach for a couple of hours.
>> Which brings me to my next question...
>> 
>> * anyone have ideas on some basic hands-on ways of teaching search engine
>> fundamentals?
>> 
>> One idea I have is to bring some actual "documents", say a poster board
>> with a sentence written largely on it, have the students physically
>> *tokenize* the document by cutting it up and lexicographically building the
>> term dictionary.  Thoughts on taking it further welcome!
>> 
>> Thanks all.
>> 
>>       Erik
>> 





Re: wikipedia and teaching kids search engines

Posted by Erick Erickson <er...@gmail.com>.
Erik:

In a former incarnation, I thought I was going to teach 6th graders. Until I
found out I can't deal with 25 kids for 6 hours at a stretch for years on
end....

My thoughts, presented in a "feel free to ignore but this is what I'd do"
spirit.
There are some random thoughts below, but here's what I'd think about...

Do a bit of an intro to the game. 10 minutes tops.

Make a game of sorts out of it. Some teams are the "indexers" and some are
the "searchers". Give them some simple rules to follow, perhaps different
ones for different pairs. Make sure some get surprising results (e.g. have
one indexing team stem, the paired search team not stem). The searchers
should rank the documents, you'll get some really surprising results.
Emphasize that the game isn't pass/fail, it's to show the kinds of things we
have to deal with.

Find some random near age-mates and try it once or twice before you present,
you'll undoubtedly change something. Maybe run it by a teacher or two.

Use that as a basis to discuss the fact that people who write the programs
that index/search have to cope with all the stuff they did, and the rules
are imperfect. And each decision is made to serve a need, and when the user
needs something *else*, it probably isn't a good match. And how horrible
things happen when one part of the team assumes something different than the
other part. And how end users don't care about all the internal stuff, they
just care about how well their needs were served....

***here're my random musings, they may even be useful***
Outline what you want to cover. Then cut out 75% of it. Really. Forget
running SOLR, the kids don't care. Think about questions like "what's a
word?" "How is a stupid computer going to figure out what *you* want?"
"what's a document?"

Certainly do the exercise of presenting sentences and asking what they'd
expect, e.g.
"The dog is running", would you expect "run" to be a hit? ran? the? You can
work tokenizing in here, perhaps under the guise of "what's important when
searching?" Maybe even before the game above if you decide to do that.

Why or why not? Perhaps ask/talk about how a really stupid computer program
is supposed to figure stuff like this out.

Back up and tell them what a document is. How hard that is to define. Chris
M. is right on when he talks about hooking what they're interested in.

Maybe come up with some examples of really surprising results from searches,
and do a really *simple* explanation of how it got that way.

If you decide to go into scoring, stick with simplicity. Like "the more
times a word appears in a document, the more relevant it is". Can you even
guarantee that they'd understand phrasing this in terms of percentages?

FWIW
Erick

On Wed, Mar 24, 2010 at 10:40 AM, Erik Hatcher <er...@gmail.com>wrote:

> I've got a couple of questions for the community...
>
>  * what's the simplest way to get Solr up and running with a relatively
> richly schema'd index of a Wikipedia dump?
>
> What I'm looking for is something as easy as something along these lines:
>
>  java -Dsolr.solr.home=./wikipedia_solr_home -jar start.jar
>
>  cat wikipedia.bz2 | wikipedia_solr_indexer
>
> My goal is to index wikipedia in order to demonstrate search to a class of
> middle school kids that I've volunteered to teach for a couple of hours.
>  Which brings me to my next question...
>
>  * anyone have ideas on some basic hands-on ways of teaching search engine
> fundamentals?
>
> One idea I have is to bring some actual "documents", say a poster board
> with a sentence written largely on it, have the students physically
> *tokenize* the document by cutting it up and lexicographically building the
> term dictionary.  Thoughts on taking it further welcome!
>
> Thanks all.
>
>        Erik
>
>

Re: wikipedia and teaching kids search engines

Posted by Jon Baer <jo...@gmail.com>.
Just throwing this out there ... I recently saw something I found pretty interesting from CMU ...

http://csunplugged.org/activities

The search algorithm exercise was focused on a Battleship lookup I think.  

- Jon 

On Mar 24, 2010, at 10:40 AM, Erik Hatcher wrote:

> I've got a couple of questions for the community...
> 
>  * what's the simplest way to get Solr up and running with a relatively richly schema'd index of a Wikipedia dump?
> 
> What I'm looking for is something as easy as something along these lines:
> 
>  java -Dsolr.solr.home=./wikipedia_solr_home -jar start.jar
> 
>  cat wikipedia.bz2 | wikipedia_solr_indexer
> 
> My goal is to index wikipedia in order to demonstrate search to a class of middle school kids that I've volunteered to teach for a couple of hours.  Which brings me to my next question...
> 
> * anyone have ideas on some basic hands-on ways of teaching search engine fundamentals?
> 
> One idea I have is to bring some actual "documents", say a poster board with a sentence written largely on it, have the students physically *tokenize* the document by cutting it up and lexicographically building the term dictionary.  Thoughts on taking it further welcome!
> 
> Thanks all.
> 
> 	Erik
> 


Re: wikipedia and teaching kids search engines

Posted by Chris Hostetter <ho...@fucit.org>.
: My goal is to index wikipedia in order to demonstrate search to a class of
: middle school kids that I've volunteered to teach for a couple of hours.
: Which brings me to my next question...

twitter data is a little easier to ingest easily then the wikipedia markup 
(the json based streaming API gives you each tweet on it's own line in a 
way that's really trivial to convert into CSV with a perl script) and 
might seem more interesting to kids then wikipedia, while still having 
some interesting metadata (user, post date, hash tags) and lexigraphic 
challanges (synonyms, abbreviations, @ and # markup, etc...)


: One idea I have is to bring some actual "documents", say a poster board with a
: sentence written largely on it, have the students physically *tokenize* the
: document by cutting it up and lexicographically building the term dictionary.
: Thoughts on taking it further welcome!

cutting up a paper document is a great way to teach textual analysis, but 
i think the real key is having two copies of multiple documents (3 would 
be enough) ... cut up one copy of each doc to build the term dictionary 
and tape all of those to the wall; then tape the second copy of every doc 
on the wall arround them and draw lines from each term to the documents 
it's in  (using differnet a different color for the paper of each doc 
would be an easy way to spot which term is in which doc, and what the term 
frequency is).


-Hoss


Re: wikipedia and teaching kids search engines

Posted by Grant Ingersoll <gs...@apache.org>.
On Mar 24, 2010, at 1:53 PM, Andrzej Bialecki wrote:

> On 2010-03-24 16:15, Markus Jelsma wrote:
>> A bit off-topic but how about Nutch grabbing some conent and have it indexed
>> in Solr?
> 
> The problem is not with collecting and submitting the documents, the problem is with parsing the Wikimedia markup embedded in XML. WikipediaTokenizer from Lucene contrib/ is a quick and perhaps acceptable solution ...

Yeah, the WikipediaTokenizer does a pretty decent job, but still has a few bugs that need fixing.  It handles most of the syntax and can also assign types to the tokens based on the type of token it is.  It can also "roll up" tokens for things like categories into a single token (for things like faceting)

-Grant

Re: wikipedia and teaching kids search engines

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-03-24 16:15, Markus Jelsma wrote:
> A bit off-topic but how about Nutch grabbing some conent and have it indexed
> in Solr?

The problem is not with collecting and submitting the documents, the 
problem is with parsing the Wikimedia markup embedded in XML. 
WikipediaTokenizer from Lucene contrib/ is a quick and perhaps 
acceptable solution ...

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: wikipedia and teaching kids search engines

Posted by Markus Jelsma <ma...@buyways.nl>.
A bit off-topic but how about Nutch grabbing some conent and have it indexed 
in Solr?

On Wednesday 24 March 2010 16:08:43 Christopher Laux wrote:
> Hi Erik,
> 
> I'm working on Wikipedia search and use Solr. Afaik it can't easily be
> done. The Wikipedia XML dump only provided the page title and author
> in terms of data one would search for. The rest requires parsing the
> Mediawiki markup for which there is no good one freely available
> (still writing my own). If you are happy with individual pages you
> could go with the HTML parser.
> 
> For the second part of your question, why don't you let them try
> competing tokenization strategies (with and w/o stemming etc.) and
> compare?
> 
> -Chris
> 
> On Wed, Mar 24, 2010 at 3:40 PM, Erik Hatcher <er...@gmail.com> 
wrote:
> > I've got a couple of questions for the community...
> >
> >  * what's the simplest way to get Solr up and running with a relatively
> > richly schema'd index of a Wikipedia dump?
> >
> > What I'm looking for is something as easy as something along these lines:
> >
> >  java -Dsolr.solr.home=./wikipedia_solr_home -jar start.jar
> >
> >  cat wikipedia.bz2 | wikipedia_solr_indexer
> >
> > My goal is to index wikipedia in order to demonstrate search to a class
> > of middle school kids that I've volunteered to teach for a couple of
> > hours. Which brings me to my next question...
> >
> >  * anyone have ideas on some basic hands-on ways of teaching search
> > engine fundamentals?
> >
> > One idea I have is to bring some actual "documents", say a poster board
> > with a sentence written largely on it, have the students physically
> > *tokenize* the document by cutting it up and lexicographically building
> > the term dictionary.  Thoughts on taking it further welcome!
> >
> > Thanks all.
> >
> >        Erik
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: wikipedia and teaching kids search engines

Posted by Christopher Laux <ct...@googlemail.com>.
Hi Erik,

I'm working on Wikipedia search and use Solr. Afaik it can't easily be
done. The Wikipedia XML dump only provided the page title and author
in terms of data one would search for. The rest requires parsing the
Mediawiki markup for which there is no good one freely available
(still writing my own). If you are happy with individual pages you
could go with the HTML parser.

For the second part of your question, why don't you let them try
competing tokenization strategies (with and w/o stemming etc.) and
compare?

-Chris


On Wed, Mar 24, 2010 at 3:40 PM, Erik Hatcher <er...@gmail.com> wrote:
> I've got a couple of questions for the community...
>
>  * what's the simplest way to get Solr up and running with a relatively
> richly schema'd index of a Wikipedia dump?
>
> What I'm looking for is something as easy as something along these lines:
>
>  java -Dsolr.solr.home=./wikipedia_solr_home -jar start.jar
>
>  cat wikipedia.bz2 | wikipedia_solr_indexer
>
> My goal is to index wikipedia in order to demonstrate search to a class of
> middle school kids that I've volunteered to teach for a couple of hours.
>  Which brings me to my next question...
>
>  * anyone have ideas on some basic hands-on ways of teaching search engine
> fundamentals?
>
> One idea I have is to bring some actual "documents", say a poster board with
> a sentence written largely on it, have the students physically *tokenize*
> the document by cutting it up and lexicographically building the term
> dictionary.  Thoughts on taking it further welcome!
>
> Thanks all.
>
>        Erik
>
>

Re: wikipedia and teaching kids search engines

Posted by Mark Miller <ma...@gmail.com>.
On 03/24/2010 10:40 AM, Erik Hatcher wrote:
> I've got a couple of questions for the community...
>
>   * what's the simplest way to get Solr up and running with a 
> relatively richly schema'd index of a Wikipedia dump?
>
> What I'm looking for is something as easy as something along these lines:
>
>   java -Dsolr.solr.home=./wikipedia_solr_home -jar start.jar
>
>   cat wikipedia.bz2 | wikipedia_solr_indexer
>
> My goal is to index wikipedia in order to demonstrate search to a 
> class of middle school kids that I've volunteered to teach for a 
> couple of hours.  Which brings me to my next question...
>
>  * anyone have ideas on some basic hands-on ways of teaching search 
> engine fundamentals?
>
> One idea I have is to bring some actual "documents", say a poster 
> board with a sentence written largely on it, have the students 
> physically *tokenize* the document by cutting it up and 
> lexicographically building the term dictionary.  Thoughts on taking it 
> further welcome!
>
> Thanks all.
>
>     Erik
>

For what its worth, this is what I use. Its probably one of the fastest 
methods out there.

It uses embedded Solr and multiple threads to process either an expanded 
wiki dump, or a bz2 compressed dump.

Simply apply the following patch to Solr trunk: 
http://pastebin.com/raw.php?i=Q5PR261W

And add commons-compress jar to solr/lib: 
http://mirrors.axint.net/apache/commons/compress/binaries/commons-compress-1.0-bin.zip

Then run with ant by specifying the wikidump (like what you can get 
here: http://download.wikimedia.org/enwiki/20100312/)

ant wikipedia 
-Dwiki-file=/home/mark/wikidumps/enwiki-latest-pages-articles.xml.bz2

Other properties you can pass:

-Dnum-docs=300 : defaults to 10000 - use max integer (or just something 
really high) to process the whole file
-Dnum-threads=2 : defaults to number of processore/cores - 1
-Dsolr.home={solrhomepath} : defaults to example/solr


This processes the wiki-dump in the same manner as the Lucene benchmark 
contrib - so not super deep - like text, title, date and one or two 
others I think. More could be added though, though I don't think 
anything else is easy pickings.

-- 
- Mark

http://www.lucidimagination.com



Re: wikipedia and teaching kids search engines

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hey Erik,

One thing to think about (and I'm no expert at middle school kids) would be
to relate search somehow to a topic they are interested in. My 12 year old
nephew loves the NBA, so if I were to talk to him about search, I would try
and relate it to e.g., NBA.com, or understanding the difference between Kobe
(beef) say, and Kobe Bryant. Or trying to explain relevance in the context
of looking at Cars (the movie) versus looking for Cars (automobiles).

As far as interactivity, cutting up the document is a great idea. You may
also want to make a handout with some I don't want to call them "problems"
but let's say exercises that the kids can do involving using some of the
fundamentals that you cover with the cutting exercise to maybe then
identifying why (and most importantly how) the search engine can begin to
figure out if you were looking for a Kobe steak, versus Kobe the NBA star.

Just my 2 cents...

Cheers,
Chris



On 3/24/10 7:40 AM, "Erik Hatcher" <er...@gmail.com> wrote:

> I've got a couple of questions for the community...
> 
>    * what's the simplest way to get Solr up and running with a
> relatively richly schema'd index of a Wikipedia dump?
> 
> What I'm looking for is something as easy as something along these
> lines:
> 
>    java -Dsolr.solr.home=./wikipedia_solr_home -jar start.jar
> 
>    cat wikipedia.bz2 | wikipedia_solr_indexer
> 
> My goal is to index wikipedia in order to demonstrate search to a
> class of middle school kids that I've volunteered to teach for a
> couple of hours.  Which brings me to my next question...
> 
>   * anyone have ideas on some basic hands-on ways of teaching search
> engine fundamentals?
> 
> One idea I have is to bring some actual "documents", say a poster
> board with a sentence written largely on it, have the students
> physically *tokenize* the document by cutting it up and
> lexicographically building the term dictionary.  Thoughts on taking it
> further welcome!
> 
> Thanks all.
> 
>         Erik
> 
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++