You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Eric Martin <er...@makethembite.com> on 2010/10/30 21:34:48 UTC

Basic Document Question

HI everyone,

 

I'm new which won't be hard to figure out after I ask this question:

 

I use Drupal/Solr/Nutch

 

http://svn.apache.org/viewvc/lucene/dev/trunk/solr/example/solr/conf/schema.
xml?view=markup

 

Solr specific:

How do I re-index for specific content only? I am starting  a legal index
specifically geared for law students and lawyers. I am crawling law related
sites but I really don't want to index law firms, just the law content on
places like:

http://www.ecasebriefs.com/blog/law/

http://www.lawnix.com/cases/cases-index/

http://www.oyez.org/

http://www.4lawnotes.com/

http://www.docstoc.com/documents/education/law-school/case-briefs

http://www.lawschoolcasebriefs.com/

http://dictionary.findlaw.com <http://dictionary.findlaw.com/> 

 

As I was saying, while crawling I get all kinds of extrinsic information put
into the Solr index. How do I combat that?

 

I am assuming (cough) that I can do this but I am really at a loss as to
where I start to look to get this done. I prefer to learn and I defiantly
don't want to waste anyone's time.

 

Non-Solr Specific

Does anyone here help with nutch or is this Solr only?

 

I am sorry if I am asking elementary questions and am asking in the wrong
place. I just need to be pointed to the right place. I'm sort of
lost.(imagine that.) 

 

Thanks

 

Eric

 

 

 


Re: Basic Document Question

Posted by Erick Erickson <er...@gmail.com>.
I guess that depends on what you mean by re-index, but here are some
guesses.
All of them share the assumption that you can determine #what# you want to
index from the various sites. That is, you have some way of identifying
the content you care about.

Solr won't help you at all in identifying what you really want, it just
follows
the orders you give it when you tell it to index content.


> if you already have junk in your solr index that you want to remove, you
can
delete by query (and risk removing valuable stuff). You could also
reindex from scratch.

> #Assuming# you have a unique key defined, and you're really asking about
updating documents, you don't have to do anything. If your schema.xml file
has <uniqueKey> identifying a particular field, just add your document again
and
Solr will automatically delete the old version and add the new one.

If none of this makes sense, perhaps you can give us a better idea of what
updating means in your use case...

This forum concentrates on Solr, there's a Nutch form that'll help you there
and I
haven't a clue about Drupal.

Best
Erick

On Sat, Oct 30, 2010 at 3:34 PM, Eric Martin <er...@makethembite.com> wrote:

> HI everyone,
>
>
>
> I'm new which won't be hard to figure out after I ask this question:
>
>
>
> I use Drupal/Solr/Nutch
>
>
>
>
> http://svn.apache.org/viewvc/lucene/dev/trunk/solr/example/solr/conf/schema.
> xml?view=markup
>
>
>
> Solr specific:
>
> How do I re-index for specific content only? I am starting  a legal index
> specifically geared for law students and lawyers. I am crawling law related
> sites but I really don't want to index law firms, just the law content on
> places like:
>
> http://www.ecasebriefs.com/blog/law/
>
> http://www.lawnix.com/cases/cases-index/
>
> http://www.oyez.org/
>
> http://www.4lawnotes.com/
>
> http://www.docstoc.com/documents/education/law-school/case-briefs
>
> http://www.lawschoolcasebriefs.com/
>
> http://dictionary.findlaw.com <http://dictionary.findlaw.com/>
>
>
>
> As I was saying, while crawling I get all kinds of extrinsic information
> put
> into the Solr index. How do I combat that?
>
>
>
> I am assuming (cough) that I can do this but I am really at a loss as to
> where I start to look to get this done. I prefer to learn and I defiantly
> don't want to waste anyone's time.
>
>
>
> Non-Solr Specific
>
> Does anyone here help with nutch or is this Solr only?
>
>
>
> I am sorry if I am asking elementary questions and am asking in the wrong
> place. I just need to be pointed to the right place. I'm sort of
> lost.(imagine that.)
>
>
>
> Thanks
>
>
>
> Eric
>
>
>
>
>
>
>
>