You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Caklovic, Nenad" <Ne...@amd.com> on 2012/05/23 01:45:35 UTC

Common Crawl dataset

Hi.

I am trying to import some Common Crawl dataset files into Nutch.
Those files are in Arc file format.
I tried using ArcSegmentCreator tool, but that didn't work well.
It was using up all the heap space. Increasing heap space limit didn't help.

Does anyone have any thoughts on this?
Is there a better way to import Common Crawl files?
Why does ArcSegmentCreator have issues?

Thanks,
Nenad


Re: Common Crawl dataset

Posted by Julien Nioche <li...@gmail.com>.
> I am trying to import some Common Crawl dataset files into Nutch.
> Those files are in Arc file format.
> I tried using ArcSegmentCreator tool, but that didn't work well.
>

I think Common Crawl which uses a slightly different definition of ARCs,
not sure though. Anyway they have released a library to read/write to their
format https://github.com/commoncrawl/commoncrawl which I have tried to use
with Behemoth https://github.com/jnioche/behemoth-commoncrawl but without
much luck so far.


> It was using up all the heap space. Increasing heap space limit didn't
> help.
>


> Does anyone have any thoughts on this?
> Is there a better way to import Common Crawl files?
>

see above. Depends on what you need to do. Behemoth has a Tika module
ready, so if you want to parse the dataset this would be a good option


> Why does ArcSegmentCreator have issues?
>

strange question. sounds as if bugs were written for a purpose. anyway,
there might be a bug but again, I am not sure the ARCs generated by CC are
at the same format

Julien


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Common Crawl dataset

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Nenad,

It would be really great if you could post some output for us to see.
Personally I've not used this tool for a while and it would be a good
excuse to get to grips with it.

On Wed, May 23, 2012 at 12:45 AM, Caklovic, Nenad
<Ne...@amd.com> wrote:
> Hi.
>
> I am trying to import some Common Crawl dataset files into Nutch.
> Those files are in Arc file format.
> I tried using ArcSegmentCreator tool, but that didn't work well.
> It was using up all the heap space. Increasing heap space limit didn't help.
>
> Does anyone have any thoughts on this?
> Is there a better way to import Common Crawl files?
> Why does ArcSegmentCreator have issues?
>
> Thanks,
> Nenad
>



-- 
Lewis