You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by David Milne <d....@gmail.com> on 2010/06/13 02:24:16 UTC
Re: How to implement this with hadoop, guidelines PLEASE, hadoop beginner

I doubt anyone is going to do your homework for you. The point of
these projects is not for the project to get done, its for you to
learn how to do it, and for you to learn how to discover things for
yourself.

Just start with the tutorials like everyone else. What you want to do
is extremely close to the 2nd WordCount tutorial described here:

http://hadoop.apache.org/common/docs/current/mapred_tutorial.html

Also, just sit down for a bit and think about efficiency. Is using
Hadoop part of the assignment, or do you want to use it just because
your current program is slow? Your current program could be made much
much faster.

At the moment, when deciding whether to print a line, you look at
every word in your presumably large term list (expensive), and look
for that word in the string line using String.contains() (also
expensive). You could instead split the line into words, look at each
word (there would not be many, so this would be cheap) and then check
if your list contains that word. This would also be cheap if you used
the right data structure instead of a list (something that supports
instant lookup - you should know what this is if you are studying
computer science).

Also, BufferedWriter is so named because it buffers stuff already.
Just write directly to it, rather than using your string builder. If
you keep doing what you are doing you are likely to run out of memory
on large files.





On Sat, Jun 12, 2010 at 6:44 PM, Xaida <ho...@gmail.com> wrote:
>
> Hi all!
>
> dataset for my project turned out to be huge, and my teacher told me I have
> to use hadoop framework. I am stuggling to understand how to make this, but
> honestly, i cant move from dead point :( I am not that good programmer and I
> can not find any of classmates who knows hadoop to help me out...I will
> excuse myself ahead for writing a lot, but i really dont know what else to
> do.
>
> So I have this implemented with some Java concurrency features, but it is
> too slow for this set size
>
> - algorithm takes one folder, and in all its subfolders, finds .txt files
> with specific name
> - It queries lucene index and pupulates a list of most frequent terms
> - Parses the .txt files line by line, and searches for a match between every
> line's third word and if there is match in the list
> - In case that there was match between some list term and third word from
> some line in txt, the entire line is stored in buffer and afterwards buffers
> are written to output txt files.
>
> So final result are txt files, which are of identical structure as original
> ones, except that they are smaller, since they contain only matching lines.
>
> I am attaching files
> 1) TextFileAnalyzer, is a java callable object which takes txt file and list
> and does the parsing and comparison.
> 2) MainAnalyzer.java, goes through main folder, gets txt files, and gives
> them to TextFileAnalyzer callables, together with list it gets from lucene
> index.
>
> I am sorry for  asking for so much help, but i really have nobody to ask and
> i tried to grasp how to do this, but with this brain and time, its out of my
> reach.
>
> Also, I also read that it is not possible to query lucene index on
> hadoop?????
>
> I will very much apreciate all the help, it is very much needed.
> Thank you in advance!
> Aida
>
>
>
> http://lucene.472066.n3.nabble.com/file/n890309/TextFileAnalyzer.java
> TextFileAnalyzer.java
> http://lucene.472066.n3.nabble.com/file/n890309/MainAnalyzer.java
> MainAnalyzer.java
> --
> View this message in context: http://lucene.472066.n3.nabble.com/How-to-implement-this-with-hadoop-guidelines-PLEASE-hadoop-beginner-tp890309p890309.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
>