You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by xn...@gmx.net on 2007/05/12 16:53:31 UTC

missing post.jar

hi,

i was trying out the introductory solr tutorial
and i was unable to locate the post.jar
mentioned in there,
which shall be used for posting files to index.
i then used post.sh inside cygwin,
but i would still like to know, where to find this util (post.jar).

thanks

matej

Re: solr for corpus?

Posted by Huib Verweij <so...@cv2h.com>.
Hi matej,

since I didn't see anyone answering your question yet, I'll have a go at 
it, but I'm not one of the Solr developers, I've just used it so far and 
am very happy with it. I use it for searching literary texts, storing 
information from a SQL database in the Solr documents as metadata for 
the texts.


xnrn@gmx.net schreef:
> i test solr as one of potential tools for the purpose of building a 
> linguistic corpus.
>
> i'd like to have your opinion, to which extent it would be a good choice.
>
> the specifics, which i find deviate from the typical use of solr, are:
>
> 1. basic unit is a text (of a book, of a newspaper-article, etc.), 
> with a bibliographic header
> looking at the examples of the solr-tutorial and the central concept 
> of the "field",
> i am a bit confused how to map these on one another, ie would the 
> whole text be one field "text"
> and the bibheader-items individual fields?
Yes, you could do that. What I did was: add the text as a whole in one 
field, add each chapter in it's own field, add metadata fields from a 
SQL database for each title (e.g. year=1966, author.name=Some one, 
author.placeofbirth=Somewhere). Basically, everything you want to 
explicitely search for/in you put in a separate field.
>
> 2. real full-text (no stop-words, really every word is indexed, 
> possibly even all the punctuation)
Shouldn't be a problem I think.
>
> 3. Lemma and PoS-tag (Part-of-Speech tag) or generally additional keys 
> on the word-level,
> ie for every word we also have its lemma-value and its PoS.
> In dedicated systems, this is implemented either as verticale (each 
> word in one line):
> word   lemma   pos
> ...
> or in newer systems with xml-attributes:
> <w pos="Noun" lemma="tree">trees</w>
>
> Important is, that it has to be possible to mix this various layers in 
> one query, eg:
> "word(some) lemma(nice) pos(Noun)"
>
> This seems to me to be the biggest challenge for solr.
I'm not 100% sure what you are trying to do here, sorry.
>
> 4. indexing/searching-ratio
> corpus is very static: the selection fo texts changes perhaps once a 
> year (in production environment),
> so it doesnt really matter how long the indexing takes. Regarding the 
> speed the emphasis is on the searches,
> which have to be "fast", exact and the results have to be further 
> processable (kwic-view
possible (though it cuts off searching the text for keywords after 50Kb. 
Actually, Lucene does that and it is configurable, but it can be 
annoying, so you might have to hack that if you find that Solr doesn't 
return a kwic-index for a hit. But maybe I'm not using Solr the right 
way ;-). )
> , thinning the solution,
possible
> sorting,
possible
> export,
Not sure what you mean here, but Solr just returns a XML document that 
you can process any way you like.
> etc.). "Fast" is important also for more complex queries (ranges, 
> boolean operators and prefixes mixed)
> and i say 10-15 seconds is the upper limit, which should be rather an 
> exception to the rule of ~ 1 second.
>
> 5. also to regard the size: we are talking of multiples of 100 
> millions of tokens.
> the canonic example British National Corpus is 100 million, there are 
> corpora with 2 billions tokens
That's a lot of text. I find Solr performs very well, but I can't 
guarantee you that Solr will work in your case, other more knowledgable 
people might be able to though.

Good luck with your decision making!

Kind regards,

Huib Verweij.


RE: solr for corpus?

Posted by "Binkley, Peter" <Pe...@ualberta.ca>.
Regarding the Lemma and PoS-tag requirement: you might handle this by
inserting each word as its own document, with "lemma", "pos", and "word"
fields, thereby allowing you lots of search flexibility. You could also
include ID fields for the item and (if necessary) part (chapter etc.)
and use these as facets, allowing you to group results by the items that
contain them. Your application would have to know how to use the item ID
value to retrieve the full item-level record.

These word-level records could live in a separate index or in the main
index (since there are no required fields in Solr, you can have entirely
different record structures in a single index; you just have to
structure your queries accordingly). The problem will be that because
your word-level entries are separate from your item-level entries,
you'll have to include in the word-level entries any item-level fields
that you want to be able to use in word-level queries (e.g. if you
wanted to be able to limit a lemma search by date).  

The alternative would be to insert the lemma/pos/word entries in a
multivalued string field and come up with more complex wildcard query
structures to get at them. Apparently you can now get queries with
leading and trailing wildcards to work, so you should be able to do
everything you need, but I don't know how the performance will be.

All the best,

Peter

-----Original Message-----
From: xnrn@gmx.net [mailto:xnrn@gmx.net] 
Sent: Saturday, May 12, 2007 11:28 AM
To: solr-user@lucene.apache.org
Subject: solr for corpus?

i test solr as one of potential tools for the purpose of building a
linguistic corpus.

i'd like to have your opinion, to which extent it would be a good
choice.

the specifics, which i find deviate from the typical use of solr, are:

1. basic unit is a text (of a book, of a newspaper-article, etc.), with
a bibliographic header looking at the examples of the solr-tutorial and
the central concept of the "field", i am a bit confused how to map these
on one another, ie would the whole text be one field "text"
and the bibheader-items individual fields?

2. real full-text (no stop-words, really every word is indexed, possibly
even all the punctuation)

3. Lemma and PoS-tag (Part-of-Speech tag) or generally additional keys
on the word-level, ie for every word we also have its lemma-value and
its PoS.
In dedicated systems, this is implemented either as verticale (each word
in one line):
word   lemma   pos
...
or in newer systems with xml-attributes:
<w pos="Noun" lemma="tree">trees</w>

Important is, that it has to be possible to mix this various layers in
one query, eg:
"word(some) lemma(nice) pos(Noun)"

This seems to me to be the biggest challenge for solr.

4. indexing/searching-ratio
corpus is very static: the selection fo texts changes perhaps once a
year (in production environment), so it doesnt really matter how long
the indexing takes. Regarding the speed the emphasis is on the searches,
which have to be "fast", exact and the results have to be further
processable (kwic-view, thinning the solution, sorting, export, etc.). 
"Fast" is important also for more complex queries (ranges, boolean
operators and prefixes mixed) and i say 10-15 seconds is the upper
limit, which should be rather an exception to the rule of ~ 1 second.

5. also to regard the size: we are talking of multiples of 100 millions
of tokens.
the canonic example British National Corpus is 100 million, there are
corpora with 2 billions tokens



thank you in advance

regards
matej

Re: solr for corpus?

Posted by Chris Hostetter <ho...@fucit.org>.
: 3. Lemma and PoS-tag (Part-of-Speech tag) or generally additional keys
: on the word-level,
: ie for every word we also have its lemma-value and its PoS.

: or in newer systems with xml-attributes:
: <w pos="Noun" lemma="tree">trees</w>
:
: Important is, that it has to be possible to mix this various layers in
: one query, eg:
: "word(some) lemma(nice) pos(Noun)"

the best way to approach this would probably be to preprocess the data nad
use a custom analyzer ... send it to solr with all of the info encoded in
each word, (ie: trees__tree_Noun) and then have a custom indexing analyzer
create multiple tokens in each position with an easy way to distinguish
wether a token is a word, the Lemma for a word, or the POS for word (ie:
the regular word plain, the Lemma prefixed by two underscores, and the POS
indexed by a single understore) then at query time if you know you are
looking for the phrase "some nice trees" you would search for "some nice
trees" but if you are looking for the word "some" followed by a word whose
lemma is "nice" followed by any Noun, you would search for "some __nice _Noun"

: This seems to me to be the biggest challenge for solr.

yeah ... neither Solr nor Lucene really attempt to tackly complex query
forms like this ... but Lucene has recently added a Token Payload
mechanism in an attempt to make queries like this easier (allowing
annotation of the actual terms that can be queried instead of needing to
create artificial terms in identical positions)

: corpus is very static: the selection fo texts changes perhaps once a
: year (in production environment),
: so it doesnt really matter how long the indexing takes. Regarding the
: speed the emphasis is on the searches,
: which have to be "fast", exact and the results have to be further
: processable (kwic-view, thinning the solution, sorting, export, etc.).
: "Fast" is important also for more complex queries (ranges, boolean
: operators and prefixes mixed)

these things should all be decent, especially since your index will be
fairly static so you don't have to worry baout 'warming' FieldCaches for
sorting etc.... something you might wnat to consider if you find query
speeds unacceptible on your full corpus with stop words left in would be
to sacrifice disk for speed by creating another field where the stop words
are removed and using it as much as possible (ie: anytime a query doesn't
care about stop words). ... but i wouldn't worry abotu that unless you
find it's actually a problem.  i've yet to see a complaint from anyone
that Solr isn't fast enough unless they are doing heavy faceting, or
updating their index so frequently that the caches can't be used.


-Hoss


solr for corpus?

Posted by xn...@gmx.net.
i test solr as one of potential tools for the purpose of building a 
linguistic corpus.

i'd like to have your opinion, to which extent it would be a good choice.

the specifics, which i find deviate from the typical use of solr, are:

1. basic unit is a text (of a book, of a newspaper-article, etc.), with 
a bibliographic header
looking at the examples of the solr-tutorial and the central concept of 
the "field",
i am a bit confused how to map these on one another, ie would the whole 
text be one field "text"
and the bibheader-items individual fields?

2. real full-text (no stop-words, really every word is indexed, possibly 
even all the punctuation)

3. Lemma and PoS-tag (Part-of-Speech tag) or generally additional keys 
on the word-level,
ie for every word we also have its lemma-value and its PoS.
In dedicated systems, this is implemented either as verticale (each word 
in one line):
word   lemma   pos
...
or in newer systems with xml-attributes:
<w pos="Noun" lemma="tree">trees</w>

Important is, that it has to be possible to mix this various layers in 
one query, eg:
"word(some) lemma(nice) pos(Noun)"

This seems to me to be the biggest challenge for solr.

4. indexing/searching-ratio
corpus is very static: the selection fo texts changes perhaps once a 
year (in production environment),
so it doesnt really matter how long the indexing takes. Regarding the 
speed the emphasis is on the searches,
which have to be "fast", exact and the results have to be further 
processable (kwic-view, thinning the solution, sorting, export, etc.). 
"Fast" is important also for more complex queries (ranges, boolean 
operators and prefixes mixed)
and i say 10-15 seconds is the upper limit, which should be rather an 
exception to the rule of ~ 1 second.

5. also to regard the size: we are talking of multiples of 100 millions 
of tokens.
the canonic example British National Corpus is 100 million, there are 
corpora with 2 billions tokens



thank you in advance

regards
matej

Re: missing post.jar

Posted by Brian Whitman <br...@variogr.am>.
> i was trying out the introductory solr tutorial
> and i was unable to locate the post.jar
> mentioned in there,
> which shall be used for posting files to index.
> i then used post.sh inside cygwin,
> but i would still like to know, where to find this util (post.jar).

If you downloaded the latest release it's not in there. I suggest  
downloading from svn or a nightly build: http://people.apache.org/ 
builds/lucene/solr/nightly/

In the nightly builds post.jar is in example/exampledocs/post.jar