You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Cam Bazz <ca...@gmail.com> on 2008/06/27 00:02:56 UTC

edge count question

hello,

I have a lucene index storing documents which holds src and dst words. word
pairs may repeat. (it is a multigraph).

I want to use hadoop to count how many of the same word pairs there are. I
have looked at the aggregateword count example, and I understand that if I
make a txt file
such as

src1>dst2
src2>dst2
src1>dst2

..

and use something similar to the aggregate word count example, I will get
the result desired.

Now questions. how can I hookup my lucene index to hadoop. is there a better
way then dumping the index to a text file with >'s, copying this to dfs and
getting the results back?

how can I make incremental runs? (once the index processed and I got the
results, how can I dump more data onto it so it does not start from
beginning)

Best regards,

-C.B.

Re: edge count question

Posted by Enis Soztutar <en...@gmail.com>.


Cam Bazz wrote:
> Hello,
>
> When I use an custom input format, as in the nutch project - do I have to
> keep my index in DFS, or regular file system?
>
>   
You have to ensure that your indexes are accessible by the map/reduce 
tasks, ie. by using hdfs, s3, nfs, kfs, etc.
> By the way, are there any alternatives to nutch?
>   
yes, of course. There are all sorts of open source crawlers / indexers.
>
> Best Regards
>
>
> -C.B.
>
> On Fri, Jun 27, 2008 at 10:08 AM, Enis Soztutar <en...@gmail.com>
> wrote:
>
>   
>> Cam Bazz wrote:
>>
>>     
>>> hello,
>>>
>>> I have a lucene index storing documents which holds src and dst words.
>>> word
>>> pairs may repeat. (it is a multigraph).
>>>
>>> I want to use hadoop to count how many of the same word pairs there are. I
>>> have looked at the aggregateword count example, and I understand that if I
>>> make a txt file
>>> such as
>>>
>>> src1>dst2
>>> src2>dst2
>>> src1>dst2
>>>
>>> ..
>>>
>>> and use something similar to the aggregate word count example, I will get
>>> the result desired.
>>>
>>> Now questions. how can I hookup my lucene index to hadoop. is there a
>>> better
>>> way then dumping the index to a text file with >'s, copying this to dfs
>>> and
>>> getting the results back?
>>>
>>>
>>>       
>> Yes, you can implement an InputFormat to read from the lucene index. You
>> can use the implementation in the nutch project, the classes
>> DeleteDuplicates$InputFormat, DeleteDuplicates$DDRecordReader.
>>
>>     
>>> how can I make incremental runs? (once the index processed and I got the
>>> results, how can I dump more data onto it so it does not start from
>>> beginning)
>>>
>>>
>>>       
>> As far as i know, there is no easy way for this. Why do you keep your data
>> as a lucene index?
>>
>>     
>>> Best regards,
>>>
>>> -C.B.
>>>
>>>
>>>
>>>       
>
>

Re: edge count question

Posted by Cam Bazz <ca...@gmail.com>.

Hello,

When I use an custom input format, as in the nutch project - do I have to
keep my index in DFS, or regular file system?

By the way, are there any alternatives to nutch?


Best Regards


-C.B.

On Fri, Jun 27, 2008 at 10:08 AM, Enis Soztutar <en...@gmail.com>
wrote:

> Cam Bazz wrote:
>
>> hello,
>>
>> I have a lucene index storing documents which holds src and dst words.
>> word
>> pairs may repeat. (it is a multigraph).
>>
>> I want to use hadoop to count how many of the same word pairs there are. I
>> have looked at the aggregateword count example, and I understand that if I
>> make a txt file
>> such as
>>
>> src1>dst2
>> src2>dst2
>> src1>dst2
>>
>> ..
>>
>> and use something similar to the aggregate word count example, I will get
>> the result desired.
>>
>> Now questions. how can I hookup my lucene index to hadoop. is there a
>> better
>> way then dumping the index to a text file with >'s, copying this to dfs
>> and
>> getting the results back?
>>
>>
> Yes, you can implement an InputFormat to read from the lucene index. You
> can use the implementation in the nutch project, the classes
> DeleteDuplicates$InputFormat, DeleteDuplicates$DDRecordReader.
>
>> how can I make incremental runs? (once the index processed and I got the
>> results, how can I dump more data onto it so it does not start from
>> beginning)
>>
>>
> As far as i know, there is no easy way for this. Why do you keep your data
> as a lucene index?
>
>> Best regards,
>>
>> -C.B.
>>
>>
>>
>

Re: HDFS blocks

Posted by Ted Dunning <te...@gmail.com>.

I would strongly recommend leaving the block size large.  Writing the small
files is no big deal since no space is wasted to speak of.

At the data rate that you are talking about, the cost of merging should not
be a big deal.  You should definitely merge often enough to avoid having
very many of these small files.  If you have hundreds of them, you will
definitely notice significant degradation in you ability to process them.

One useful strategy is to merge them repeated.  This costs you a little bit
in repeated merging, but wins big by keeping the number of files much
smaller.

For the future, lohit's comments are exactly correct ... archive files and
append will make your problems much easier.

For coordinating which files are current and which are partially done, you
might consider using zookeeper.  Very nice for fast, reliable updates.

On Fri, Jun 27, 2008 at 1:18 AM, Goel, Ankur <An...@corp.aol.com>
wrote:

> Hi Folks,
>        I have a setup where in I am streaming data into HDFS from a
> remote location and creating a new files every X min. The file generated
> is of a very small size (512 KB - 6 MB) size. Since that is the size
> range the streaming code sets the block size to 6MB whereas default that
> we have set for the cluster is 128 MB. The idea behind such a thing is
> to generate small temporal data chunks from multiple sources and merge
> them periodically into a big chunk with our default (128 MB) block size.
>
> The webUI for DFS reports the block size for these files to be 6 MB. My
> questions are.
> 1. Can we have multiple files in DFS use different block sizes ?
> 2. If we use default block size for these small chunks, is the DFS space
> wasted ?
>   If not then does it mean that a single DFS block can hold data from
> more than one file ?
>
> Thanks
> -Ankur
>

-- 
ted

HDFS blocks

Posted by "Goel, Ankur" <An...@corp.aol.com>.

Hi Folks,
        I have a setup where in I am streaming data into HDFS from a
remote location and creating a new files every X min. The file generated
is of a very small size (512 KB - 6 MB) size. Since that is the size
range the streaming code sets the block size to 6MB whereas default that
we have set for the cluster is 128 MB. The idea behind such a thing is
to generate small temporal data chunks from multiple sources and merge
them periodically into a big chunk with our default (128 MB) block size.

The webUI for DFS reports the block size for these files to be 6 MB. My
questions are.
1. Can we have multiple files in DFS use different block sizes ?
2. If we use default block size for these small chunks, is the DFS space
wasted ? 
   If not then does it mean that a single DFS block can hold data from
more than one file ?

Thanks
-Ankur

Re: edge count question

Posted by Enis Soztutar <en...@gmail.com>.

Cam Bazz wrote:
> hello,
>
> I have a lucene index storing documents which holds src and dst words. word
> pairs may repeat. (it is a multigraph).
>
> I want to use hadoop to count how many of the same word pairs there are. I
> have looked at the aggregateword count example, and I understand that if I
> make a txt file
> such as
>
> src1>dst2
> src2>dst2
> src1>dst2
>
> ..
>
> and use something similar to the aggregate word count example, I will get
> the result desired.
>
> Now questions. how can I hookup my lucene index to hadoop. is there a better
> way then dumping the index to a text file with >'s, copying this to dfs and
> getting the results back?
>   
Yes, you can implement an InputFormat to read from the lucene index. You 
can use the implementation in the nutch project, the classes 
DeleteDuplicates$InputFormat, DeleteDuplicates$DDRecordReader.
> how can I make incremental runs? (once the index processed and I got the
> results, how can I dump more data onto it so it does not start from
> beginning)
>   
As far as i know, there is no easy way for this. Why do you keep your 
data as a lucene index?
> Best regards,
>
> -C.B.
>
>