You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Caklovic, Nenad" <Ne...@amd.com> on 2012/05/23 01:45:35 UTC
Common Crawl dataset
Hi.
I am trying to import some Common Crawl dataset files into Nutch.
Those files are in Arc file format.
I tried using ArcSegmentCreator tool, but that didn't work well.
It was using up all the heap space. Increasing heap space limit didn't help.
Does anyone have any thoughts on this?
Is there a better way to import Common Crawl files?
Why does ArcSegmentCreator have issues?
Thanks,
Nenad
Re: Common Crawl dataset
Posted by Julien Nioche <li...@gmail.com>.
> I am trying to import some Common Crawl dataset files into Nutch.
> Those files are in Arc file format.
> I tried using ArcSegmentCreator tool, but that didn't work well.
>
I think Common Crawl which uses a slightly different definition of ARCs,
not sure though. Anyway they have released a library to read/write to their
format https://github.com/commoncrawl/commoncrawl which I have tried to use
with Behemoth https://github.com/jnioche/behemoth-commoncrawl but without
much luck so far.
> It was using up all the heap space. Increasing heap space limit didn't
> help.
>
> Does anyone have any thoughts on this?
> Is there a better way to import Common Crawl files?
>
see above. Depends on what you need to do. Behemoth has a Tika module
ready, so if you want to parse the dataset this would be a good option
> Why does ArcSegmentCreator have issues?
>
strange question. sounds as if bugs were written for a purpose. anyway,
there might be a bug but again, I am not sure the ARCs generated by CC are
at the same format
Julien
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
Re: Common Crawl dataset
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Nenad,
It would be really great if you could post some output for us to see.
Personally I've not used this tool for a while and it would be a good
excuse to get to grips with it.
On Wed, May 23, 2012 at 12:45 AM, Caklovic, Nenad
<Ne...@amd.com> wrote:
> Hi.
>
> I am trying to import some Common Crawl dataset files into Nutch.
> Those files are in Arc file format.
> I tried using ArcSegmentCreator tool, but that didn't work well.
> It was using up all the heap space. Increasing heap space limit didn't help.
>
> Does anyone have any thoughts on this?
> Is there a better way to import Common Crawl files?
> Why does ArcSegmentCreator have issues?
>
> Thanks,
> Nenad
>
--
Lewis