You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@vxquery.apache.org by Eldon Carman <ec...@ucr.edu> on 2014/01/17 00:41:14 UTC

VXQuery File Size

Question:
What is our target file size? VXQuery has been designed to work on many
small files, but what is a small file? Are we talking 64mb or 64kb?

Background:
The issue has come to my attention as I ran out of inodes one of the nodes
when replicating the weather data set. Apparently one node in our cluster
has a 2 TB drive and is limited to 132816896. My naive partitioning method
for benchmarking has replicated the weather data five times and that
exceeds the number inodes available.

In researching the issue, we ran the following command to count the number
of files:
   time find -type f | wc -l
Here are the results:
  ** I am still waiting after about 4 hours, update when its finished **

It seems we have a huge performance hit for my current configuration of
weather data. The average size is probably 32kb. The XML documents are from
querying a web service provided by NOAA. Each file holds a month's records
of sensor data.

The concern is how the query time is affected by the act of opening and
closing so many files.

Options:
1. Treat this as the parameters we defined for our test.
2. Change the amount of data return by web service query. Example: Query
for a years worth of data, thus reducing the number of files by a factor of
12.
3. Create a way to store XML files appended together. Thus reducing the
number of times a file must be opened and closed.

Re: VXQuery File Size

Posted by Eldon Carman <ec...@ucr.edu>.

FYI on weather data.

The Weather Web Service offer weather data through queries to their
website. In researching the data queries that are possible, I have found
the way to get the largest amount of real data in a single query. The site
limits all data queries to be a single month. In addition results are paged
when the station has more than four sensor to report. The resulting XML
document is at most 32KB. While sizes vary because of the size and number
of data points for each sensor, we have an upper bound around 32KB.

The file size averages for three sections of the larger dataset.
Size                # of Files             Average File Size
~50 MB             7,476                    7KB
~500 MB          30,982                  17KB

I am working on getting a larger test size of ~8.5 GB set up. The average
size difference changes based on the sensors chosen. Each of the larger
datasets include the smaller versions.

I am working on a new way of partitioning the data with symbolic links to
get around my issue with inodes. Right now I am sticking with the real data
and working around the inode issue.

On Thu, Jan 16, 2014 at 9:07 PM, Eldon Carman <ec...@ucr.edu> wrote:

> On Thu, Jan 16, 2014 at 8:41 PM, Vinayak Borkar <vi...@gmail.com>wrote:
>
>> On 1/16/14, 3:41 PM, Eldon Carman wrote:
>>
>>> Question:
>>> What is our target file size? VXQuery has been designed to work on many
>>> small files, but what is a small file? Are we talking 64mb or 64kb?
>>>
>>
>> The restriction is on the size of objects (or documents). In VXQuery each
>> document has to fit in a frame as per the current implementation and since
>> one xml file contains an XML document, this translates to file sizes. I
>> think we should do Option 3 in your mail below and support files that have
>> multiple documents concatenated and stored in the same file (This should be
>> fine since the collection function returns a collection of items).
>>
>>
> Ok, lets this more. Some of the new rewrite rules for pushing the child
> steps into the data source scan may also help with processing larger files,
> while keeping our frame size relatively small.
>
>
>>
>>
>>> Background:
>>> The issue has come to my attention as I ran out of inodes one of the
>>> nodes
>>> when replicating the weather data set. Apparently one node in our cluster
>>> has a 2 TB drive and is limited to 132816896. My naive partitioning
>>> method
>>>
>>
>> Do you mean 2GB?
>>
>
> Let me clarify:
> Most nodes have a 3TB drive with a limit of 182,591,488 inodes.
> I found one had a drive replaced. On that node we have a 2TB drive with a
> limit of 132,816,896 inodes. The weather data had caused the drive to
> exceed the 130 million inodes.
>
>
>>
>> Vinayak
>>
>>
>>  for benchmarking has replicated the weather data five times and that
>>> exceeds the number inodes available.
>>>
>>> In researching the issue, we ran the following command to count the
>>> number
>>> of files:
>>>     time find -type f | wc -l
>>> Here are the results:
>>>    ** I am still waiting after about 4 hours, update when its finished **
>>>
>>> It seems we have a huge performance hit for my current configuration of
>>> weather data. The average size is probably 32kb. The XML documents are
>>> from
>>> querying a web service provided by NOAA. Each file holds a month's
>>> records
>>> of sensor data.
>>>
>>> The concern is how the query time is affected by the act of opening and
>>> closing so many files.
>>>
>>> Options:
>>> 1. Treat this as the parameters we defined for our test.
>>> 2. Change the amount of data return by web service query. Example: Query
>>> for a years worth of data, thus reducing the number of files by a factor
>>> of
>>> 12.
>>> 3. Create a way to store XML files appended together. Thus reducing the
>>> number of times a file must be opened and closed.
>>>
>>>
>>
>

Re: VXQuery File Size

Posted by Eldon Carman <ec...@ucr.edu>.

On Thu, Jan 16, 2014 at 8:41 PM, Vinayak Borkar <vi...@gmail.com> wrote:

> On 1/16/14, 3:41 PM, Eldon Carman wrote:
>
>> Question:
>> What is our target file size? VXQuery has been designed to work on many
>> small files, but what is a small file? Are we talking 64mb or 64kb?
>>
>
> The restriction is on the size of objects (or documents). In VXQuery each
> document has to fit in a frame as per the current implementation and since
> one xml file contains an XML document, this translates to file sizes. I
> think we should do Option 3 in your mail below and support files that have
> multiple documents concatenated and stored in the same file (This should be
> fine since the collection function returns a collection of items).
>
>
Ok, lets this more. Some of the new rewrite rules for pushing the child
steps into the data source scan may also help with processing larger files,
while keeping our frame size relatively small.


>
>
>> Background:
>> The issue has come to my attention as I ran out of inodes one of the nodes
>> when replicating the weather data set. Apparently one node in our cluster
>> has a 2 TB drive and is limited to 132816896. My naive partitioning method
>>
>
> Do you mean 2GB?
>

Let me clarify:
Most nodes have a 3TB drive with a limit of 182,591,488 inodes.
I found one had a drive replaced. On that node we have a 2TB drive with a
limit of 132,816,896 inodes. The weather data had caused the drive to
exceed the 130 million inodes.


>
> Vinayak
>
>
>  for benchmarking has replicated the weather data five times and that
>> exceeds the number inodes available.
>>
>> In researching the issue, we ran the following command to count the number
>> of files:
>>     time find -type f | wc -l
>> Here are the results:
>>    ** I am still waiting after about 4 hours, update when its finished **
>>
>> It seems we have a huge performance hit for my current configuration of
>> weather data. The average size is probably 32kb. The XML documents are
>> from
>> querying a web service provided by NOAA. Each file holds a month's records
>> of sensor data.
>>
>> The concern is how the query time is affected by the act of opening and
>> closing so many files.
>>
>> Options:
>> 1. Treat this as the parameters we defined for our test.
>> 2. Change the amount of data return by web service query. Example: Query
>> for a years worth of data, thus reducing the number of files by a factor
>> of
>> 12.
>> 3. Create a way to store XML files appended together. Thus reducing the
>> number of times a file must be opened and closed.
>>
>>
>

Re: VXQuery File Size

Posted by Vinayak Borkar <vi...@gmail.com>.

On 1/16/14, 3:41 PM, Eldon Carman wrote:
> Question:
> What is our target file size? VXQuery has been designed to work on many
> small files, but what is a small file? Are we talking 64mb or 64kb?

The restriction is on the size of objects (or documents). In VXQuery 
each document has to fit in a frame as per the current implementation 
and since one xml file contains an XML document, this translates to file 
sizes. I think we should do Option 3 in your mail below and support 
files that have multiple documents concatenated and stored in the same 
file (This should be fine since the collection function returns a 
collection of items).


>
> Background:
> The issue has come to my attention as I ran out of inodes one of the nodes
> when replicating the weather data set. Apparently one node in our cluster
> has a 2 TB drive and is limited to 132816896. My naive partitioning method

Do you mean 2GB?


Vinayak

> for benchmarking has replicated the weather data five times and that
> exceeds the number inodes available.
>
> In researching the issue, we ran the following command to count the number
> of files:
>     time find -type f | wc -l
> Here are the results:
>    ** I am still waiting after about 4 hours, update when its finished **
>
> It seems we have a huge performance hit for my current configuration of
> weather data. The average size is probably 32kb. The XML documents are from
> querying a web service provided by NOAA. Each file holds a month's records
> of sensor data.
>
> The concern is how the query time is affected by the act of opening and
> closing so many files.
>
> Options:
> 1. Treat this as the parameters we defined for our test.
> 2. Change the amount of data return by web service query. Example: Query
> for a years worth of data, thus reducing the number of files by a factor of
> 12.
> 3. Create a way to store XML files appended together. Thus reducing the
> number of times a file must be opened and closed.
>