You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@crunch.apache.org by Ben Juhn <be...@gmail.com> on 2016/02/26 02:15:06 UTC

Processing splittable inputs

Hello there,

I haven’t been able to get crunch to split inputs into multiple mappers.  Currently it’s giving me one mapper per text file, even though they’re 1GB each.  I’ve tried supplying split.maxsize on the command line and in the DoFn implementation: 

@Override
public void configure(Configuration conf) {
conf.set("crunch.combine.file.size", "67108864");
conf.set("mapreduce.input.fileinputformat.split.maxsize", "67108864");
conf.set("mapreduce.input.fileinputformat.split.minsize", "67108864");
}

Any suggestions?

Thanks,
Ben

Re: Processing splittable inputs

Posted by Josh Wills <jo...@gmail.com>.

Yeah, I suspect the Source-property approach is the right thing here.

On Fri, Feb 26, 2016 at 3:37 PM, Micah Whitacre <mk...@gmail.com> wrote:

> Where are you trying to specify them?  Inside a DoFn?  Prior to
> constructing the MRPipeline?
>
> I'd suggest trying either:
> 1. Setting those values on the initial Configuration object you pass to the
> MRPipeline
> 2. Setting them as Source specific properties[1] on the source itself.
>
> The latter approach might be better if you are reading a lot of different
> sources into your pipeline and don't want to affect them all.
>
> [1] -
>
> http://crunch.apache.org/apidocs/0.12.0/org/apache/crunch/Source.html#inputConf(java.lang.String,%20java.lang.String)
>
> On Fri, Feb 26, 2016 at 5:17 PM, Ben Juhn <be...@gmail.com> wrote:
>
> > The data isn’t compressed.  The parameters aren’t showing up in the job
> > configuration either.
> >
> >
> > > On Feb 25, 2016, at 5:15 PM, Ben Juhn <be...@gmail.com> wrote:
> > >
> > > Hello there,
> > >
> > > I haven’t been able to get crunch to split inputs into multiple
> > mappers.  Currently it’s giving me one mapper per text file, even though
> > they’re 1GB each.  I’ve tried supplying split.maxsize on the command line
> > and in the DoFn implementation:
> > >
> > > @Override
> > > public void configure(Configuration conf) {
> > > conf.set("crunch.combine.file.size", "67108864");
> > > conf.set("mapreduce.input.fileinputformat.split.maxsize", "67108864");
> > > conf.set("mapreduce.input.fileinputformat.split.minsize", "67108864");
> > > }
> > >
> > > Any suggestions?
> > >
> > > Thanks,
> > > Ben
> > >
> >
> >
>

Re: Processing splittable inputs

Posted by Micah Whitacre <mk...@gmail.com>.

Where are you trying to specify them?  Inside a DoFn?  Prior to
constructing the MRPipeline?

I'd suggest trying either:
1. Setting those values on the initial Configuration object you pass to the
MRPipeline
2. Setting them as Source specific properties[1] on the source itself.

The latter approach might be better if you are reading a lot of different
sources into your pipeline and don't want to affect them all.

[1] -
http://crunch.apache.org/apidocs/0.12.0/org/apache/crunch/Source.html#inputConf(java.lang.String,%20java.lang.String)

On Fri, Feb 26, 2016 at 5:17 PM, Ben Juhn <be...@gmail.com> wrote:

> The data isn’t compressed.  The parameters aren’t showing up in the job
> configuration either.
>
>
> > On Feb 25, 2016, at 5:15 PM, Ben Juhn <be...@gmail.com> wrote:
> >
> > Hello there,
> >
> > I haven’t been able to get crunch to split inputs into multiple
> mappers.  Currently it’s giving me one mapper per text file, even though
> they’re 1GB each.  I’ve tried supplying split.maxsize on the command line
> and in the DoFn implementation:
> >
> > @Override
> > public void configure(Configuration conf) {
> > conf.set("crunch.combine.file.size", "67108864");
> > conf.set("mapreduce.input.fileinputformat.split.maxsize", "67108864");
> > conf.set("mapreduce.input.fileinputformat.split.minsize", "67108864");
> > }
> >
> > Any suggestions?
> >
> > Thanks,
> > Ben
> >
>
>

Re: Processing splittable inputs

Posted by Ben Juhn <be...@gmail.com>.

The data isn’t compressed.  The parameters aren’t showing up in the job configuration either.


> On Feb 25, 2016, at 5:15 PM, Ben Juhn <be...@gmail.com> wrote:
> 
> Hello there,
> 
> I haven’t been able to get crunch to split inputs into multiple mappers.  Currently it’s giving me one mapper per text file, even though they’re 1GB each.  I’ve tried supplying split.maxsize on the command line and in the DoFn implementation: 
> 
> @Override
> public void configure(Configuration conf) {
> conf.set("crunch.combine.file.size", "67108864");
> conf.set("mapreduce.input.fileinputformat.split.maxsize", "67108864");
> conf.set("mapreduce.input.fileinputformat.split.minsize", "67108864");
> }
> 
> Any suggestions?
> 
> Thanks,
> Ben
>

Re: Processing splittable inputs

Posted by Micah Whitacre <mk...@gmail.com>.

Ben,
  Are the text files you are processing compressed?  If so that data
wouldn't be splittable.[1]

[1] -
http://www.grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-mapreduce-client-core/2.6.0/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java#57

On Thu, Feb 25, 2016 at 7:15 PM, Ben Juhn <be...@gmail.com> wrote:

> Hello there,
>
> I haven’t been able to get crunch to split inputs into multiple mappers.
> Currently it’s giving me one mapper per text file, even though they’re 1GB
> each.  I’ve tried supplying split.maxsize on the command line and in the
> DoFn implementation:
>
> @Override
> public void configure(Configuration conf) {
> conf.set("crunch.combine.file.size", "67108864");
> conf.set("mapreduce.input.fileinputformat.split.maxsize", "67108864");
> conf.set("mapreduce.input.fileinputformat.split.minsize", "67108864");
> }
>
> Any suggestions?
>
> Thanks,
> Ben
>
>