You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Kevin Corby <kc...@pf-cvl.net> on 2007/12/21 16:50:42 UTC

Possible hadoop application

Hello,

I am just looking into Hadoop for a possible application and was hoping 
to get some feedback about whether it is a good fit and how to structure 
it. Basically my application works like this:
1. Documents arrive, maybe as part of a web crawl or something like that.
2. Documents are indexed for searching.
3. Documents have special fields extracted and stored, for instance all 
country names might be extracted as a COUNTRY field, dates as a DATE 
field, IP addresses as an IP field, etc.
4. Users run queries against the index to find matching documents.
5. Users run jobs that process some combination of the extracted field 
values and query terms for a (possibly large) number of documents to 
find patterns, relationships, etc.

An example of #5 might be:
Find all business-country relationships that exist in this set of 
document IDs where the previously extracted country name is within 20 
terms of a term matching a query of business names (not previously 
extracted or tagged):  (McDonalds OR "Burger King" OR "Taco Bell" OR 
"Wal Mart" ...)

The output would be something like:
McDonald's - Mexico => Documents 5, 76, 100
Wal Mart - Mexico => Documents 5, 22
Wal Mart - United States => Documents 22, 43, 100, 101

I work on an existing application that functions similarly to this. We 
are currently using Lucene for the search index and it functions fairly 
well, but it is difficult to scale #5 to a large number of users or 
documents and have it run in a reasonably responsive way.

It seems that Hadoop might be a nice fit for this in a few places:
1) Indexing
2) Extraction of field values
3) Running of jobs to process field values / query terms

I am especially interested in #3, but I'm not quite sure how it would 
work. How would the extracted values be stored for quick lookup by 
document ID and processing? Given that hadoop is read only, would I be 
forced to have many small files as new documents are added and 
processed, or can the new extractions be somehow combined with the old 
ones on the distributed file system?

And would it be possible to use hadoop to dig the matching query terms 
out of the documents, since that can also be slow?

Thanks for any feedback.

- Kevin

Re: Possible hadoop application

Posted by Toby DiPasquale <co...@gmail.com>.

You might want to look at CouchDB for this. It is stronger on the  
query side of things right now and has a similar model.

--
Toby DiPasquale
Software Assassin

On Dec 21, 2007, at 10:50, Kevin Corby <kc...@pf-cvl.net> wrote:

> Hello,
>
> I am just looking into Hadoop for a possible application and was  
> hoping to get some feedback about whether it is a good fit and how  
> to structure it. Basically my application works like this:
> 1. Documents arrive, maybe as part of a web crawl or something like  
> that.
> 2. Documents are indexed for searching.
> 3. Documents have special fields extracted and stored, for instance  
> all country names might be extracted as a COUNTRY field, dates as a  
> DATE field, IP addresses as an IP field, etc.
> 4. Users run queries against the index to find matching documents.
> 5. Users run jobs that process some combination of the extracted  
> field values and query terms for a (possibly large) number of  
> documents to find patterns, relationships, etc.
>
> An example of #5 might be:
> Find all business-country relationships that exist in this set of  
> document IDs where the previously extracted country name is within  
> 20 terms of a term matching a query of business names (not  
> previously extracted or tagged):  (McDonalds OR "Burger King" OR  
> "Taco Bell" OR "Wal Mart" ...)
>
> The output would be something like:
> McDonald's - Mexico => Documents 5, 76, 100
> Wal Mart - Mexico => Documents 5, 22
> Wal Mart - United States => Documents 22, 43, 100, 101
>
> I work on an existing application that functions similarly to this.  
> We are currently using Lucene for the search index and it functions  
> fairly well, but it is difficult to scale #5 to a large number of  
> users or documents and have it run in a reasonably responsive way.
>
> It seems that Hadoop might be a nice fit for this in a few places:
> 1) Indexing
> 2) Extraction of field values
> 3) Running of jobs to process field values / query terms
>
> I am especially interested in #3, but I'm not quite sure how it  
> would work. How would the extracted values be stored for quick  
> lookup by document ID and processing? Given that hadoop is read  
> only, would I be forced to have many small files as new documents  
> are added and processed, or can the new extractions be somehow  
> combined with the old ones on the distributed file system?
>
> And would it be possible to use hadoop to dig the matching query  
> terms out of the documents, since that can also be slow?
>
> Thanks for any feedback.
>
> - Kevin

RE: Possible hadoop application

Posted by edward yoon <we...@udanax.org>.

>> ...Documents are indexed for searching.
>> query terms for ...

I thought inverted index will be used for your data mining application.
Then, i would recommend a survey of map/reduce. (Hadoop examples are great)

further references : 
Data mining, Document classification/categorization, Social Network Analysis, etc.

------------------------------

B. Regards,

Edward yoon @ NHN, corp.
Home : http://www.udanax.org


> Date: Fri, 21 Dec 2007 10:50:42 -0500
> From: kcorby@pf-cvl.net
> To: hadoop-user@lucene.apache.org
> Subject: Possible hadoop application
>
> Hello,
>
> I am just looking into Hadoop for a possible application and was hoping
> to get some feedback about whether it is a good fit and how to structure
> it. Basically my application works like this:
> 1. Documents arrive, maybe as part of a web crawl or something like that.
> 2. Documents are indexed for searching.
> 3. Documents have special fields extracted and stored, for instance all
> country names might be extracted as a COUNTRY field, dates as a DATE
> field, IP addresses as an IP field, etc.
> 4. Users run queries against the index to find matching documents.
> 5. Users run jobs that process some combination of the extracted field
> values and query terms for a (possibly large) number of documents to
> find patterns, relationships, etc.
>
> An example of #5 might be:
> Find all business-country relationships that exist in this set of
> document IDs where the previously extracted country name is within 20
> terms of a term matching a query of business names (not previously
> extracted or tagged): (McDonalds OR "Burger King" OR "Taco Bell" OR
> "Wal Mart" ...)
>
> The output would be something like:
> McDonald's - Mexico => Documents 5, 76, 100
> Wal Mart - Mexico => Documents 5, 22
> Wal Mart - United States => Documents 22, 43, 100, 101
>
> I work on an existing application that functions similarly to this. We
> are currently using Lucene for the search index and it functions fairly
> well, but it is difficult to scale #5 to a large number of users or
> documents and have it run in a reasonably responsive way.
>
> It seems that Hadoop might be a nice fit for this in a few places:
> 1) Indexing
> 2) Extraction of field values
> 3) Running of jobs to process field values / query terms
>
> I am especially interested in #3, but I'm not quite sure how it would
> work. How would the extracted values be stored for quick lookup by
> document ID and processing? Given that hadoop is read only, would I be
> forced to have many small files as new documents are added and
> processed, or can the new extractions be somehow combined with the old
> ones on the distributed file system?
>
> And would it be possible to use hadoop to dig the matching query terms
> out of the documents, since that can also be slow?
>
> Thanks for any feedback.
>
> - Kevin

_________________________________________________________________
i’m is proud to present Cause Effect, a series about real people making a difference.
http://im.live.com/Messenger/IM/MTV/?source=text_Cause_Effect