You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Ami Akshay Parikh <am...@usc.edu> on 2015/02/26 19:53:52 UTC

MetaData fornear duplicates

Hello,

When I try to use the parse_data from the segment directory for getting the
MetaData for finding near duplicates, My code runs into a EOFException. I
found something about a bug in nutch in the archives, but I wanted to know
if anyone else is facing this problem and how can I possibly resolve it.

Thanks,

Regards,
Ami Parikh
(213)590-0005

Re: MetaData fornear duplicates

Posted by Ami Akshay Parikh <am...@usc.edu>.

Ya. I know about that. But I just thought that because Parse_Data already
does that for us, I did not want to do tthe same processing again. I will
try to figure something out. Thanks a lot.

Regards,
Ami Parikh
(213)590-0005

On Thu, Feb 26, 2015 at 12:39 PM, Renxia Wang <re...@usc.edu> wrote:

> Not sure how you implement it so it is hard to tell. You may want to take
> a look at the SegmentReader's get and getMapRecords methods. Those may give
> you ideas. You can use SegmentReader.get directly to get the segment data
> too. While it is slow as it slepp(5000) at every time you call it, so slow
> that you definitely cannot get the result tomorrow by running it on your
> 50K urls data set. Muti-threading to call the SegmentReader.get on all the
> segments at the same time can speed this up, while if you have a lot of
> segments(like me,  > 20), OutOfMemory issue will come to you, even if you
> set the java heap size to be 4GBs(or even more) I am stuck at here. T_T
>
> Zhique
>
>
>
> On Thu, Feb 26, 2015 at 11:54 AM, Ami Akshay Parikh <am...@usc.edu>
> wrote:
>
>> I am using the MapFileReader to iterate through the file. And I read the
>> key into a Text object and the MetaData into a ParseData object. I get the
>> following exception:
>>
>> Exception in thread "main" java.io.EOFException
>> at java.io.DataInputStream.readFully(DataInputStream.java:197)
>> at org.apache.hadoop.io.Text.readString(Text.java:402)
>> at org.apache.nutch.metadata.Metadata.readFields(Metadata.java:243)
>> at org.apache.nutch.parse.ParseData.readFields(ParseData.java:144)
>> at
>> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813)
>> at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1941)
>> at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517)
>> at NearDuplicates.main(NearDuplicates.java:58)
>>
>> Thanks,
>>
>> Regards,
>> Ami Parikh
>> (213)590-0005
>>
>> On Thu, Feb 26, 2015 at 11:00 AM, Renxia Wang <re...@usc.edu> wrote:
>>
>>> Hi Ami,
>>>
>>> What method of what class do you use to get the meta data? Please
>>> provide more info about this, log etc.
>>>
>>> Zhique
>>>
>>> On Thu, Feb 26, 2015 at 10:53 AM, Ami Akshay Parikh <am...@usc.edu>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> When I try to use the parse_data from the segment directory for getting
>>>> the MetaData for finding near duplicates, My code runs into a EOFException.
>>>> I found something about a bug in nutch in the archives, but I wanted to
>>>> know if anyone else is facing this problem and how can I possibly resolve
>>>> it.
>>>>
>>>> Thanks,
>>>>
>>>> Regards,
>>>> Ami Parikh
>>>> (213)590-0005
>>>>
>>>
>>>
>>
>

Re: MetaData fornear duplicates

Posted by Renxia Wang <re...@usc.edu>.

Not sure how you implement it so it is hard to tell. You may want to take a
look at the SegmentReader's get and getMapRecords methods. Those may give
you ideas. You can use SegmentReader.get directly to get the segment data
too. While it is slow as it slepp(5000) at every time you call it, so slow
that you definitely cannot get the result tomorrow by running it on your
50K urls data set. Muti-threading to call the SegmentReader.get on all the
segments at the same time can speed this up, while if you have a lot of
segments(like me,  > 20), OutOfMemory issue will come to you, even if you
set the java heap size to be 4GBs(or even more) I am stuck at here. T_T

Zhique

On Thu, Feb 26, 2015 at 11:54 AM, Ami Akshay Parikh <am...@usc.edu>
wrote:

> I am using the MapFileReader to iterate through the file. And I read the
> key into a Text object and the MetaData into a ParseData object. I get the
> following exception:
>
> Exception in thread "main" java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> at org.apache.hadoop.io.Text.readString(Text.java:402)
> at org.apache.nutch.metadata.Metadata.readFields(Metadata.java:243)
> at org.apache.nutch.parse.ParseData.readFields(ParseData.java:144)
> at
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813)
> at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1941)
> at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517)
> at NearDuplicates.main(NearDuplicates.java:58)
>
> Thanks,
>
> Regards,
> Ami Parikh
> (213)590-0005
>
> On Thu, Feb 26, 2015 at 11:00 AM, Renxia Wang <re...@usc.edu> wrote:
>
>> Hi Ami,
>>
>> What method of what class do you use to get the meta data? Please provide
>> more info about this, log etc.
>>
>> Zhique
>>
>> On Thu, Feb 26, 2015 at 10:53 AM, Ami Akshay Parikh <am...@usc.edu>
>> wrote:
>>
>>> Hello,
>>>
>>> When I try to use the parse_data from the segment directory for getting
>>> the MetaData for finding near duplicates, My code runs into a EOFException.
>>> I found something about a bug in nutch in the archives, but I wanted to
>>> know if anyone else is facing this problem and how can I possibly resolve
>>> it.
>>>
>>> Thanks,
>>>
>>> Regards,
>>> Ami Parikh
>>> (213)590-0005
>>>
>>
>>
>

Re: MetaData fornear duplicates

Posted by Ami Akshay Parikh <am...@usc.edu>.

I am using the MapFileReader to iterate through the file. And I read the
key into a Text object and the MetaData into a ParseData object. I get the
following exception:

Exception in thread "main" java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at org.apache.hadoop.io.Text.readString(Text.java:402)
at org.apache.nutch.metadata.Metadata.readFields(Metadata.java:243)
at org.apache.nutch.parse.ParseData.readFields(ParseData.java:144)
at
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1941)
at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517)
at NearDuplicates.main(NearDuplicates.java:58)

Thanks,

Regards,
Ami Parikh
(213)590-0005

On Thu, Feb 26, 2015 at 11:00 AM, Renxia Wang <re...@usc.edu> wrote:

> Hi Ami,
>
> What method of what class do you use to get the meta data? Please provide
> more info about this, log etc.
>
> Zhique
>
> On Thu, Feb 26, 2015 at 10:53 AM, Ami Akshay Parikh <am...@usc.edu>
> wrote:
>
>> Hello,
>>
>> When I try to use the parse_data from the segment directory for getting
>> the MetaData for finding near duplicates, My code runs into a EOFException.
>> I found something about a bug in nutch in the archives, but I wanted to
>> know if anyone else is facing this problem and how can I possibly resolve
>> it.
>>
>> Thanks,
>>
>> Regards,
>> Ami Parikh
>> (213)590-0005
>>
>
>

Re: MetaData fornear duplicates

Posted by Renxia Wang <re...@usc.edu>.

Hi Ami,

What method of what class do you use to get the meta data? Please provide
more info about this, log etc.

Zhique

On Thu, Feb 26, 2015 at 10:53 AM, Ami Akshay Parikh <am...@usc.edu>
wrote:

> Hello,
>
> When I try to use the parse_data from the segment directory for getting
> the MetaData for finding near duplicates, My code runs into a EOFException.
> I found something about a bug in nutch in the archives, but I wanted to
> know if anyone else is facing this problem and how can I possibly resolve
> it.
>
> Thanks,
>
> Regards,
> Ami Parikh
> (213)590-0005
>