You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Mike Hugo <mi...@piragua.com> on 2013/05/14 00:09:38 UTC

Should I store Long values as String or Long?

I've been playing around with the LongCombiner on a table that's summing up
the counts of output of a MapReduce job, very similar to the WordCount
example from the user manual.

I started out encoding the values using LongCombiner.FIXED_LEN_ENCODER, but
have noticed that this can lead to some confusion later on downstream.  For
example, a co-worker was scanning using the shell and was caught off guard
by the encoded values.  Also, out of the box, the StatsCombiner example
works using String values, not Long values so we built a custom piece to
essentially do the same thing with Long values instead.

It looks to me like most of the examples I've seen just store things are
String values, rather than encoding them.  What are the tradeoffs?  We're
at a point where we could pretty easily switch things to just use strings -
it seems like that might make things more convenient from a maintenance
perspective (human readable values) and would allow us to re-use some
existing components (e.g. StatsCombiner).  Any thoughts?

Thanks,

Mike

Re: Should I store Long values as String or Long?

Posted by Jared Winick <ja...@gmail.com>.
I believe the feature John is referring to above is the Formatter interface
(org.apache.accumulo.core.util.format.Formatter). You can implement this
interface to convert key/values to a more human readable format for the
shell. You can drop a JAR file containing your implementation into lib/ext
just like your Iterators and then load it in the shell with the "formatter"
command.


On Mon, May 13, 2013 at 8:04 PM, Mike Hugo <mi...@piragua.com> wrote:

> Thanks - String it is!
>
>
> On Mon, May 13, 2013 at 7:47 PM, Christopher <ct...@apache.org> wrote:
>
>> Well, encoding it might save space, but strings are nice and
>> human-readable, especially in the shell, and in the overall scheme of
>> things, a string probably isn't really that much larger on disk,
>> especially after compression.
>>
>> --
>> Christopher L Tubbs II
>> http://gravatar.com/ctubbsii
>>
>>
>> On Mon, May 13, 2013 at 6:09 PM, Mike Hugo <mi...@piragua.com> wrote:
>> > I've been playing around with the LongCombiner on a table that's
>> summing up
>> > the counts of output of a MapReduce job, very similar to the WordCount
>> > example from the user manual.
>> >
>> > I started out encoding the values using LongCombiner.FIXED_LEN_ENCODER,
>> but
>> > have noticed that this can lead to some confusion later on downstream.
>>  For
>> > example, a co-worker was scanning using the shell and was caught off
>> guard
>> > by the encoded values.  Also, out of the box, the StatsCombiner example
>> > works using String values, not Long values so we built a custom piece to
>> > essentially do the same thing with Long values instead.
>> >
>> > It looks to me like most of the examples I've seen just store things are
>> > String values, rather than encoding them.  What are the tradeoffs?
>>  We're at
>> > a point where we could pretty easily switch things to just use strings
>> - it
>> > seems like that might make things more convenient from a maintenance
>> > perspective (human readable values) and would allow us to re-use some
>> > existing components (e.g. StatsCombiner).  Any thoughts?
>> >
>> > Thanks,
>> >
>> > Mike
>>
>
>

Re: Should I store Long values as String or Long?

Posted by Mike Hugo <mi...@piragua.com>.
Thanks - String it is!


On Mon, May 13, 2013 at 7:47 PM, Christopher <ct...@apache.org> wrote:

> Well, encoding it might save space, but strings are nice and
> human-readable, especially in the shell, and in the overall scheme of
> things, a string probably isn't really that much larger on disk,
> especially after compression.
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Mon, May 13, 2013 at 6:09 PM, Mike Hugo <mi...@piragua.com> wrote:
> > I've been playing around with the LongCombiner on a table that's summing
> up
> > the counts of output of a MapReduce job, very similar to the WordCount
> > example from the user manual.
> >
> > I started out encoding the values using LongCombiner.FIXED_LEN_ENCODER,
> but
> > have noticed that this can lead to some confusion later on downstream.
>  For
> > example, a co-worker was scanning using the shell and was caught off
> guard
> > by the encoded values.  Also, out of the box, the StatsCombiner example
> > works using String values, not Long values so we built a custom piece to
> > essentially do the same thing with Long values instead.
> >
> > It looks to me like most of the examples I've seen just store things are
> > String values, rather than encoding them.  What are the tradeoffs?
>  We're at
> > a point where we could pretty easily switch things to just use strings -
> it
> > seems like that might make things more convenient from a maintenance
> > perspective (human readable values) and would allow us to re-use some
> > existing components (e.g. StatsCombiner).  Any thoughts?
> >
> > Thanks,
> >
> > Mike
>

Re: Should I store Long values as String or Long?

Posted by Christopher <ct...@apache.org>.
Well, encoding it might save space, but strings are nice and
human-readable, especially in the shell, and in the overall scheme of
things, a string probably isn't really that much larger on disk,
especially after compression.

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Mon, May 13, 2013 at 6:09 PM, Mike Hugo <mi...@piragua.com> wrote:
> I've been playing around with the LongCombiner on a table that's summing up
> the counts of output of a MapReduce job, very similar to the WordCount
> example from the user manual.
>
> I started out encoding the values using LongCombiner.FIXED_LEN_ENCODER, but
> have noticed that this can lead to some confusion later on downstream.  For
> example, a co-worker was scanning using the shell and was caught off guard
> by the encoded values.  Also, out of the box, the StatsCombiner example
> works using String values, not Long values so we built a custom piece to
> essentially do the same thing with Long values instead.
>
> It looks to me like most of the examples I've seen just store things are
> String values, rather than encoding them.  What are the tradeoffs?  We're at
> a point where we could pretty easily switch things to just use strings - it
> seems like that might make things more convenient from a maintenance
> perspective (human readable values) and would allow us to re-use some
> existing components (e.g. StatsCombiner).  Any thoughts?
>
> Thanks,
>
> Mike

Re: Should I store Long values as String or Long?

Posted by John Vines <vi...@apache.org>.
If it's just as the value, it's really up to your preference. Since it
sounds like you have issues using encoded data as the value for shell
users, you can  switch to String representations. A possible alternative is
using the views we have in the shell (transformations? I don't remember the
name, I don't know much about them).

Another concern is you have iterators/combiners running on the values, they
need to be aware of the format. But ultimately, the point is that your
format really doesn't matter, but it's that you're going to have to be
consistent from then on.


On Mon, May 13, 2013 at 6:09 PM, Mike Hugo <mi...@piragua.com> wrote:

> I've been playing around with the LongCombiner on a table that's summing
> up the counts of output of a MapReduce job, very similar to the WordCount
> example from the user manual.
>
> I started out encoding the values using LongCombiner.FIXED_LEN_ENCODER,
> but have noticed that this can lead to some confusion later on downstream.
>  For example, a co-worker was scanning using the shell and was caught off
> guard by the encoded values.  Also, out of the box, the StatsCombiner
> example works using String values, not Long values so we built a custom
> piece to essentially do the same thing with Long values instead.
>
> It looks to me like most of the examples I've seen just store things are
> String values, rather than encoding them.  What are the tradeoffs?  We're
> at a point where we could pretty easily switch things to just use strings -
> it seems like that might make things more convenient from a maintenance
> perspective (human readable values) and would allow us to re-use some
> existing components (e.g. StatsCombiner).  Any thoughts?
>
> Thanks,
>
> Mike
>