You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by "kulkarni.swarnim@gmail.com" <ku...@gmail.com> on 2013/07/01 17:37:15 UTC

Random access in an avro file

Hello,

Is it possible to have random access to a record in an avro file? For
instance, if I have an avro file with a schema containing four
records: *employee
id, name, address *and* phone*. While reading the file, is there any way at
all to directly jump to a record with employee id 100 instead of having to
scan the whole file every single time and filtering out records?

Thanks for the help.

-- 
Swarnim

Re: Random access in an avro file

Posted by Doug Cutting <cu...@apache.org>.
On Mon, Jul 1, 2013 at 3:22 PM, kulkarni.swarnim@gmail.com
<ku...@gmail.com> wrote:
> It seems like the file expects the records to be entered in a sorted order
> inside of it doing the sorting internally[1].

Yes, that's right.  The normal use is as the output of a MapReduce
job, where key/value pairs arrive in sorted order.  If you think the
javadoc should make this more explicit, please file a bug report.

Thanks,

Doug

Re: Random access in an avro file

Posted by "kulkarni.swarnim@gmail.com" <ku...@gmail.com>.
I guess I will answer this question myself. It seems like the file expects
the records to be entered in a sorted order inside of it doing the sorting
internally[1]. I don't think it should hurt us but honestly was a little
surprising. It feels like this should be javadoc'ed somewhere that it is
the responsibility of the consumers to sort the records themselves by the
given key before appending to the file. Otherwise, a very useful addition
to the avro library! :)

[1]
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.avro/avro-mapred/1.7.2/org/apache/avro/hadoop/file/SortedKeyValueFile.java#536


On Mon, Jul 1, 2013 at 1:50 PM, kulkarni.swarnim@gmail.com <
kulkarni.swarnim@gmail.com> wrote:

> Thanks again Doug. SortedKeyValueFile looks really promising and seems to
> fit our use case well.
>
> One last thing I was concerned about was the performance
> of maintaining the sorted order in the file. Especially because in our case
> the file might get pretty large(hundred thousands to million). If there is
> a limit on the file size to achieve maximum performance, we can possibly
> think about closing the file and start writing to another file once we
> start to hit that limit.
>
>
> On Mon, Jul 1, 2013 at 12:51 PM, Doug Cutting <cu...@apache.org> wrote:
>
>> On Mon, Jul 1, 2013 at 10:26 AM, kulkarni.swarnim@gmail.com
>> <ku...@gmail.com> wrote:
>> > Out of curiosity, is maintaining sync markers while writing the file and
>> > then passing these markers to the readers while reading not a good way
>> to
>> > achieve random access in avro?
>>
>> Yes, seeking to the position of a sync marker is possible.  This is
>> what SortedKeyValueFile does.  You need to store the list of positions
>> of sync markers, and if seek is to a column value rather than a row
>> number, then you need to store these values (keys) with the positions.
>>  Those are what's in SortedKeyValueFile's "index" file.
>>
>> Doug
>>
>
>
>
> --
> Swarnim
>



-- 
Swarnim

Re: Random access in an avro file

Posted by "kulkarni.swarnim@gmail.com" <ku...@gmail.com>.
Thanks again Doug. SortedKeyValueFile looks really promising and seems to
fit our use case well.

One last thing I was concerned about was the performance of maintaining the
sorted order in the file. Especially because in our case the file might get
pretty large(hundred thousands to million). If there is a limit on the file
size to achieve maximum performance, we can possibly think about closing
the file and start writing to another file once we start to hit that limit.


On Mon, Jul 1, 2013 at 12:51 PM, Doug Cutting <cu...@apache.org> wrote:

> On Mon, Jul 1, 2013 at 10:26 AM, kulkarni.swarnim@gmail.com
> <ku...@gmail.com> wrote:
> > Out of curiosity, is maintaining sync markers while writing the file and
> > then passing these markers to the readers while reading not a good way to
> > achieve random access in avro?
>
> Yes, seeking to the position of a sync marker is possible.  This is
> what SortedKeyValueFile does.  You need to store the list of positions
> of sync markers, and if seek is to a column value rather than a row
> number, then you need to store these values (keys) with the positions.
>  Those are what's in SortedKeyValueFile's "index" file.
>
> Doug
>



-- 
Swarnim

Re: Random access in an avro file

Posted by Doug Cutting <cu...@apache.org>.
On Mon, Jul 1, 2013 at 10:26 AM, kulkarni.swarnim@gmail.com
<ku...@gmail.com> wrote:
> Out of curiosity, is maintaining sync markers while writing the file and
> then passing these markers to the readers while reading not a good way to
> achieve random access in avro?

Yes, seeking to the position of a sync marker is possible.  This is
what SortedKeyValueFile does.  You need to store the list of positions
of sync markers, and if seek is to a column value rather than a row
number, then you need to store these values (keys) with the positions.
 Those are what's in SortedKeyValueFile's "index" file.

Doug

Re: Random access in an avro file

Posted by Scott Carey <sc...@apache.org>.
There are a couple other index formats that could apply.  You can seek to a
sync marker and scan from there.    For example Avro files can be a target
for Elephant Twin 
(http://www.slideshare.net/squarecog/flexible-insitu-indexing-for-hadoop-via
-elephant-twin ; http://gitrep.com/users/twitter/repos/elephant-twin).

However, that is a light-weight index for marking which blocks have records
that match the index, it does not locate the exact record.

From:  "kulkarni.swarnim@gmail.com" <ku...@gmail.com>
Reply-To:  "user@avro.apache.org" <us...@avro.apache.org>
Date:  Monday, July 1, 2013 10:26 AM
To:  user <us...@avro.apache.org>
Subject:  Re: Random access in an avro file

Thanks for the reply Doug.

Out of curiosity, is maintaining sync markers while writing the file and
then passing these markers to the readers while reading not a good way to
achieve random access in avro? Atleast that's what my understanding from
reading the javadoc[1] was, which could be flawed.

[1] 
http://avro.apache.org/docs/1.3.3/api/java/org/apache/avro/file/DataFileWrit
er.html#sync()


On Mon, Jul 1, 2013 at 12:05 PM, Doug Cutting <cu...@apache.org> wrote:
> Avro data files do not generally support random access.
> 
> SortedKeyValueFile supports random access by key.
> 
> http://avro.apache.org/docs/current/api/java/org/apache/avro/hadoop/file/Sorte
> dKeyValueFile.Reader.html
> 
> From the documentation:
> 
> "The SortedKeyValueFile is a directory with two files, named 'data'
> and 'index'. The 'data' file is an ordinary Avro container file with
> records. Each record has exactly two fields, 'key' and 'value'. The
> keys are sorted lexicographically. The 'index' file is a small Avro
> container file mapping keys in the 'data' file to their byte
> positions. The index file is intended to fit in memory, so it should
> remain small. There is one entry in the index file for each data block
> in the Avro container file."
> 
> Doug
> 
> On Mon, Jul 1, 2013 at 8:37 AM, kulkarni.swarnim@gmail.com
> <ku...@gmail.com> wrote:
>> > Hello,
>> >
>> > Is it possible to have random access to a record in an avro file? For
>> > instance, if I have an avro file with a schema containing four records:
>> > employee id, name, address and phone. While reading the file, is there any
>> > way at all to directly jump to a record with employee id 100 instead of
>> > having to scan the whole file every single time and filtering out records?
>> >
>> > Thanks for the help.
>> >
>> > --
>> > Swarnim



-- 
Swarnim 



Re: Random access in an avro file

Posted by "kulkarni.swarnim@gmail.com" <ku...@gmail.com>.
Thanks for the reply Doug.

Out of curiosity, is maintaining sync markers while writing the file and
then passing these markers to the readers while reading not a good way to
achieve random access in avro? Atleast that's what my understanding from
reading the javadoc[1] was, which could be flawed.

[1]
http://avro.apache.org/docs/1.3.3/api/java/org/apache/avro/file/DataFileWriter.html#sync()


On Mon, Jul 1, 2013 at 12:05 PM, Doug Cutting <cu...@apache.org> wrote:

> Avro data files do not generally support random access.
>
> SortedKeyValueFile supports random access by key.
>
>
> http://avro.apache.org/docs/current/api/java/org/apache/avro/hadoop/file/SortedKeyValueFile.Reader.html
>
> From the documentation:
>
> "The SortedKeyValueFile is a directory with two files, named 'data'
> and 'index'. The 'data' file is an ordinary Avro container file with
> records. Each record has exactly two fields, 'key' and 'value'. The
> keys are sorted lexicographically. The 'index' file is a small Avro
> container file mapping keys in the 'data' file to their byte
> positions. The index file is intended to fit in memory, so it should
> remain small. There is one entry in the index file for each data block
> in the Avro container file."
>
> Doug
>
> On Mon, Jul 1, 2013 at 8:37 AM, kulkarni.swarnim@gmail.com
> <ku...@gmail.com> wrote:
> > Hello,
> >
> > Is it possible to have random access to a record in an avro file? For
> > instance, if I have an avro file with a schema containing four records:
> > employee id, name, address and phone. While reading the file, is there
> any
> > way at all to directly jump to a record with employee id 100 instead of
> > having to scan the whole file every single time and filtering out
> records?
> >
> > Thanks for the help.
> >
> > --
> > Swarnim
>



-- 
Swarnim

Re: Random access in an avro file

Posted by Doug Cutting <cu...@apache.org>.
Avro data files do not generally support random access.

SortedKeyValueFile supports random access by key.

http://avro.apache.org/docs/current/api/java/org/apache/avro/hadoop/file/SortedKeyValueFile.Reader.html

>From the documentation:

"The SortedKeyValueFile is a directory with two files, named 'data'
and 'index'. The 'data' file is an ordinary Avro container file with
records. Each record has exactly two fields, 'key' and 'value'. The
keys are sorted lexicographically. The 'index' file is a small Avro
container file mapping keys in the 'data' file to their byte
positions. The index file is intended to fit in memory, so it should
remain small. There is one entry in the index file for each data block
in the Avro container file."

Doug

On Mon, Jul 1, 2013 at 8:37 AM, kulkarni.swarnim@gmail.com
<ku...@gmail.com> wrote:
> Hello,
>
> Is it possible to have random access to a record in an avro file? For
> instance, if I have an avro file with a schema containing four records:
> employee id, name, address and phone. While reading the file, is there any
> way at all to directly jump to a record with employee id 100 instead of
> having to scan the whole file every single time and filtering out records?
>
> Thanks for the help.
>
> --
> Swarnim