You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Edward Capriolo <ed...@gmail.com> on 2010/01/11 20:52:13 UTC

Bible Code and some input format ideas

Hey all,
I saw a special on discovery about bible code.
http://en.wikipedia.org/wiki/Bible_code

I am designing something in hadoop to do bible code on any text (not
just the bible). I have a rough idea on how to make all the parts
efficient in map reduce. I have a little challenge I originally
thought I could solve with with a custom InputFormat but it seems I
may have to do this in a stand alone program.

Lets assume your input looks like this:

Is there any
bible-code in this
text? I don't know.

The end result might look like this ( assuming I take every 5th letter.)

irbcn
tdn__

The first part of the process is given an input text we have to strip
out a user configured list of things '\t' '-' '.' '?' .  That I have
no problem with.

The second part of the process, I would like to get the data to be the
proper width, in this case 5 characters. This is a challenge because
assuming a line is 5 characters e.g. 'done?' Once it is cleaned it
will be 4 characters  'done'. This -1 offsets changes the rest of the
data, the next line might have another offset, so on and so on.

Originally I was thinking I could create NCharacterInputFormat, but it
seems like this stage of the process can not easily be done in
map/reduce. I guess I need to write a single threaded program to read
through the data and make the correct offsets (5 characters per line).
Unless someone else has an idea.

Re: Bible Code and some input format ideas

Posted by Stephen Watt <sw...@us.ibm.com>.
Neat !  Please keep the list appraised when you have something to demo.

Kind regards
Steve Watt



From:
Edward Capriolo <ed...@gmail.com>
To:
common-user@hadoop.apache.org
Date:
01/11/2010 01:55 PM
Subject:
Bible Code and some input format ideas



Hey all,
I saw a special on discovery about bible code.
http://en.wikipedia.org/wiki/Bible_code

I am designing something in hadoop to do bible code on any text (not
just the bible). I have a rough idea on how to make all the parts
efficient in map reduce. I have a little challenge I originally
thought I could solve with with a custom InputFormat but it seems I
may have to do this in a stand alone program.

Lets assume your input looks like this:

Is there any
bible-code in this
text? I don't know.

The end result might look like this ( assuming I take every 5th letter.)

irbcn
tdn__

The first part of the process is given an input text we have to strip
out a user configured list of things '\t' '-' '.' '?' .  That I have
no problem with.

The second part of the process, I would like to get the data to be the
proper width, in this case 5 characters. This is a challenge because
assuming a line is 5 characters e.g. 'done?' Once it is cleaned it
will be 4 characters  'done'. This -1 offsets changes the rest of the
data, the next line might have another offset, so on and so on.

Originally I was thinking I could create NCharacterInputFormat, but it
seems like this stage of the process can not easily be done in
map/reduce. I guess I need to write a single threaded program to read
through the data and make the correct offsets (5 characters per line).
Unless someone else has an idea.



Re: Bible Code and some input format ideas

Posted by Edward Capriolo <ed...@gmail.com>.
On Tue, Jan 12, 2010 at 5:49 PM, Edward Capriolo <ed...@gmail.com> wrote:
> On Tue, Jan 12, 2010 at 5:37 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
>> I'm guessing that you want to set the width of the text to avoid the issue
>> where if you split by block, then all splits but the first will have an
>> unknown offset.
>>
>> Most texts have natural divisions in them which I'm guessing you'll want to
>> respect anyway.  In the Bible this would be the different books, in more
>> recent books it would be different chapters.  Could you instead set up your
>> InputFormat to split on these divisions in the text?  Then you don't have to
>> go through this single threaded step.  And in most cases the divisions in
>> the text will be small enough to be handled by a single mapper (though not
>> necessarily well balanced).
>>
>> Alan.
>>
>> On Jan 11, 2010, at 11:52 AM, Edward Capriolo wrote:
>>
>>> Hey all,
>>> I saw a special on discovery about bible code.
>>> http://en.wikipedia.org/wiki/Bible_code
>>>
>>> I am designing something in hadoop to do bible code on any text (not
>>> just the bible). I have a rough idea on how to make all the parts
>>> efficient in map reduce. I have a little challenge I originally
>>> thought I could solve with with a custom InputFormat but it seems I
>>> may have to do this in a stand alone program.
>>>
>>> Lets assume your input looks like this:
>>>
>>> Is there any
>>> bible-code in this
>>> text? I don't know.
>>>
>>> The end result might look like this ( assuming I take every 5th letter.)
>>>
>>> irbcn
>>> tdn__
>>>
>>> The first part of the process is given an input text we have to strip
>>> out a user configured list of things '\t' '-' '.' '?' .  That I have
>>> no problem with.
>>>
>>> The second part of the process, I would like to get the data to be the
>>> proper width, in this case 5 characters. This is a challenge because
>>> assuming a line is 5 characters e.g. 'done?' Once it is cleaned it
>>> will be 4 characters  'done'. This -1 offsets changes the rest of the
>>> data, the next line might have another offset, so on and so on.
>>>
>>> Originally I was thinking I could create NCharacterInputFormat, but it
>>> seems like this stage of the process can not easily be done in
>>> map/reduce. I guess I need to write a single threaded program to read
>>> through the data and make the correct offsets (5 characters per line).
>>> Unless someone else has an idea.
>>
>>
>
> Alan,
>
> You could split by book, there is no standard way to do it really. I
> spoke to someone off list that told me a single threaded program was
> able to make skips from 2-3000, starting at each different offset of
> moby dick in 12 minutes. Given that a text copy of the bible was 4MB,
> your conclusion
>
>> And in most cases the divisions in
>> the text will be small enough to be handled by a single mapper (though not
>> necessarily well balanced).
>
> is correct. I am thinking to have each map reduce process work with
> the entire data and just one skip
>
> mapper1 bible.txt skip2 all permutations
> mapper1 bible.txt skip3 all permutations
> mapper1 bible.txt skip4 all permutations
> mapper1 bible.txt skipN all permutations
>
> So map process might collect something lie
>
> "bible.txt", 2 ,   [grid as bytes] [grid as bytes]
> "bible.txt", 3 ,   [grid as bytes] [grid as bytes] [grid as bytes]
>
> Since I have been told making the permutations take 12 minutes single
> threaded, I am not going to over think it :) Hopefully doing the word
> search on the grids will give hadoop more of a workout :)
>

All,

I wanted to share some of the progress I have made with this problem.
Firstly, I have a distributed program that generates "skip grids".
Then I have two programs:
1) searches a grid and applies word count logic
2) searches a grid and returns all locations of a word

All the code is available here:
http://www.jointhegrid.com/svn/hadoop_bible_code/

So here is a sample of text:
http://www.cs.princeton.edu/introcs/data/bible.txt

After some searching:

[ecapriolo@laptop1 bc]$ more wordlocations/part-00000
outgrids/2/1,edward     ,69593 1 69588 1
outgrids/3/0,hadoop     ,164154 1 164149 1
outgrids/4/0,java       ,11400 3 11403 0
outgrids/4/1,java       ,73400 0 73403 3

Trick to find location in original file:
vi +1477386 +"/[hadoop]" outgrids/1/0

Hear the voice of my supplications, when I cry unto thee, when I lift
up my hands toward thy holy oracle. Draw me not away with the wicked,
and with the workers of iniquity, which speak peace to their
neighbours, but mischief is in their hearts.

See anything? Try skipping by 3 :)

cuo
ehi
fph
dor
hoo
cdw
nay
thi
eni
twk
siq
thh
epc

See anything?

p        o        o        d        a        h
pmyhandstowardthyholyoracledrawmenotawaywiththewicked

:)

Re: Bible Code and some input format ideas

Posted by Edward Capriolo <ed...@gmail.com>.
On Tue, Jan 12, 2010 at 5:37 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
> I'm guessing that you want to set the width of the text to avoid the issue
> where if you split by block, then all splits but the first will have an
> unknown offset.
>
> Most texts have natural divisions in them which I'm guessing you'll want to
> respect anyway.  In the Bible this would be the different books, in more
> recent books it would be different chapters.  Could you instead set up your
> InputFormat to split on these divisions in the text?  Then you don't have to
> go through this single threaded step.  And in most cases the divisions in
> the text will be small enough to be handled by a single mapper (though not
> necessarily well balanced).
>
> Alan.
>
> On Jan 11, 2010, at 11:52 AM, Edward Capriolo wrote:
>
>> Hey all,
>> I saw a special on discovery about bible code.
>> http://en.wikipedia.org/wiki/Bible_code
>>
>> I am designing something in hadoop to do bible code on any text (not
>> just the bible). I have a rough idea on how to make all the parts
>> efficient in map reduce. I have a little challenge I originally
>> thought I could solve with with a custom InputFormat but it seems I
>> may have to do this in a stand alone program.
>>
>> Lets assume your input looks like this:
>>
>> Is there any
>> bible-code in this
>> text? I don't know.
>>
>> The end result might look like this ( assuming I take every 5th letter.)
>>
>> irbcn
>> tdn__
>>
>> The first part of the process is given an input text we have to strip
>> out a user configured list of things '\t' '-' '.' '?' .  That I have
>> no problem with.
>>
>> The second part of the process, I would like to get the data to be the
>> proper width, in this case 5 characters. This is a challenge because
>> assuming a line is 5 characters e.g. 'done?' Once it is cleaned it
>> will be 4 characters  'done'. This -1 offsets changes the rest of the
>> data, the next line might have another offset, so on and so on.
>>
>> Originally I was thinking I could create NCharacterInputFormat, but it
>> seems like this stage of the process can not easily be done in
>> map/reduce. I guess I need to write a single threaded program to read
>> through the data and make the correct offsets (5 characters per line).
>> Unless someone else has an idea.
>
>

Alan,

You could split by book, there is no standard way to do it really. I
spoke to someone off list that told me a single threaded program was
able to make skips from 2-3000, starting at each different offset of
moby dick in 12 minutes. Given that a text copy of the bible was 4MB,
your conclusion

> And in most cases the divisions in
> the text will be small enough to be handled by a single mapper (though not
> necessarily well balanced).

is correct. I am thinking to have each map reduce process work with
the entire data and just one skip

mapper1 bible.txt skip2 all permutations
mapper1 bible.txt skip3 all permutations
mapper1 bible.txt skip4 all permutations
mapper1 bible.txt skipN all permutations

So map process might collect something lie

"bible.txt", 2 ,   [grid as bytes] [grid as bytes]
"bible.txt", 3 ,   [grid as bytes] [grid as bytes] [grid as bytes]

Since I have been told making the permutations take 12 minutes single
threaded, I am not going to over think it :) Hopefully doing the word
search on the grids will give hadoop more of a workout :)

Re: Bible Code and some input format ideas

Posted by Alan Gates <ga...@yahoo-inc.com>.
I'm guessing that you want to set the width of the text to avoid the  
issue where if you split by block, then all splits but the first will  
have an unknown offset.

Most texts have natural divisions in them which I'm guessing you'll  
want to respect anyway.  In the Bible this would be the different  
books, in more recent books it would be different chapters.  Could you  
instead set up your InputFormat to split on these divisions in the  
text?  Then you don't have to go through this single threaded step.   
And in most cases the divisions in the text will be small enough to be  
handled by a single mapper (though not necessarily well balanced).

Alan.

On Jan 11, 2010, at 11:52 AM, Edward Capriolo wrote:

> Hey all,
> I saw a special on discovery about bible code.
> http://en.wikipedia.org/wiki/Bible_code
>
> I am designing something in hadoop to do bible code on any text (not
> just the bible). I have a rough idea on how to make all the parts
> efficient in map reduce. I have a little challenge I originally
> thought I could solve with with a custom InputFormat but it seems I
> may have to do this in a stand alone program.
>
> Lets assume your input looks like this:
>
> Is there any
> bible-code in this
> text? I don't know.
>
> The end result might look like this ( assuming I take every 5th  
> letter.)
>
> irbcn
> tdn__
>
> The first part of the process is given an input text we have to strip
> out a user configured list of things '\t' '-' '.' '?' .  That I have
> no problem with.
>
> The second part of the process, I would like to get the data to be the
> proper width, in this case 5 characters. This is a challenge because
> assuming a line is 5 characters e.g. 'done?' Once it is cleaned it
> will be 4 characters  'done'. This -1 offsets changes the rest of the
> data, the next line might have another offset, so on and so on.
>
> Originally I was thinking I could create NCharacterInputFormat, but it
> seems like this stage of the process can not easily be done in
> map/reduce. I guess I need to write a single threaded program to read
> through the data and make the correct offsets (5 characters per line).
> Unless someone else has an idea.