You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sis.apache.org by Martin Desruisseaux <ma...@geomatys.fr> on 2014/02/25 20:01:47 UTC

About ShapeFile reader

Hello all

I would like to propose the following modifications to the Shapefile reader:

  * Rename as ShapefileStore for consistency with NetcdfStore.
  * Declare ShapefileStore as a subclass of DataStore:
      o Implement the getMetadata() method using the information
        currently stored in fields like version, xmin, xmax, etc.
  * Turn public fields into private fields instead.
  * Add a getEnvelope(Query) method which returns the (xmin, ymin, etc.)
    values.
      o The Query argument would be an empty class for now, but still
        defined as a placeholder for future developments.
  * Add a getFeatures(Query) method which returns a FeatureCollection
    (extends Collection<Feature>).


To explain more about the last point: the intend is to be able to read 
large set of Features without loading all of them in memory. So instead 
than storing all Features in a HashMap, we would allow implementations 
to return a collection backed by an iterator instantiating the Features 
on the fly. Such Iterator is the same idea than java.sql.ResultSet.

Using a method instead than direct to a public field has two purposes:

  * Allows to specify a filter or other query aspects.
  * Allows to returns an Autocloseable collection for implementations
    that perform their I/O operations on the fly.


So instead of iterating on features like below:

    for (Feature f : shapefile.FeatureMap.values()) {
         // ... do some stuff ...
    }


We would do:

    try (FeatureCollection features = shapefile.getFeatures(myQuery)) {
         for (Feature f : features) {
             // ... do some stuff ...
         }
    }


What do peoples think?

     Martin


Re: About ShapeFile reader

Posted by Martin Desruisseaux <ma...@geomatys.fr>.
Hello Travis

Thanks for your feedbacks!

Le 26/02/14 13:43, Travis L Pinney a écrit :
> Could you go into more detail on how the filter
> works? Would the filter be able to use indexes opportunistically if
> they existed?
Yes, the plan is to allow to use index when it exists. In addition to 
Shapefile, there is also PostGIS indexes to leverage.

Leveraging index is possible on the assumption that the DataStore 
implementations know about the index details. For example a 
ShapefileStore would know that the index is provided in a ".shx" file, 
while a PostgisStore would know how to write SQL statements that use the 
indexes. Since the getFeatures(Query) method would be defined in 
FeatureStore, the store implementation would hopefully have enough 
information for leveraging the indexes.

For the API, I would propose to leverage the new JDK8 Stream API. For 
the trunk which still on JDK6, we could not use the JDK8 classes but we 
could provide some custom methods as close to JDK8 as we can.

The Stream interface [1] has a "filter" function, which could be used 
like below (using lambda expressions):

Stream<Feature> features = datastore.getFeatures(null).stream();
features = features.filter(f -> f.boundedBy().intersect(mySearchArea));

The above would work out-of-the-box using the default Stream 
implementation provided by JDK8. However a ShapefileStore and 
PostgisStore would provide their own Stream implementation and override 
the filter method in order to leverage indexing. Since the stream is 
created by Collection.stream() (a new method in JDK8) and the Collection 
is itself created by the FeatureStore, the store has indirect control 
over the Stream implementation, and consequently over the 'filter' 
method implementation.

The advantage of using the JDK8 Stream API is that it is designed for 
paralellization. We are entering in a new world here...

> In the case of a shapefile, it will have have ".shx" file which would
> allow you to jump to the first record in a slice. For example, if you
> wanted to read only the records 10000 through 10010 out of the
> shapefile, you could read the ".shx" file to find the byte offset of
> record 10000. Only the minimum amount of data is read from the file
> when done in this manner.
>
> Some file formats will not have record offsets like a shapefile. For
> those formats it would be possible to start at the beginning of the
> file and skip over records until it reaches the start of the slice.
Right. The first case would be a data store providing its own 
implementation of Stream, while the second case would be a data store 
relying on the default Stream implementation provided by JDK8.


> There is missing functionality on sis-shapefile for some of the Shape
> types. Should I work on those features in the Shapefile branch?
You could also commit on trunk, JDK6 or JDK7 branch, and we could close 
the Shapefile branch, as you wish. If working on the JDK7 branch is easy 
for you, it may be easier for synchronizing our work. But we would just 
need to make sure that we do not edit the shapefile class in same time...

     Martin


[1] http://download.java.net/jdk8/docs/api/java/util/stream/Stream.html


Re: About ShapeFile reader

Posted by Travis L Pinney <tr...@gmail.com>.
+1 on the changes. Could you go into more detail on how the filter
works? Would the filter be able to use indexes opportunistically if
they existed?

It would be useful to filter by slicing or pagination.

In the case of a shapefile, it will have have ".shx" file which would
allow you to jump to the first record in a slice. For example, if you
wanted to read only the records 10000 through 10010 out of the
shapefile, you could read the ".shx" file to find the byte offset of
record 10000. Only the minimum amount of data is read from the file
when done in this manner.

Some file formats will not have record offsets like a shapefile. For
those formats it would be possible to start at the beginning of the
file and skip over records until it reaches the start of the slice.

There is missing functionality on sis-shapefile for some of the Shape
types. Should I work on those features in the Shapefile branch?

Thanks,
Travis



On Tue, Feb 25, 2014 at 2:01 PM, Martin Desruisseaux
<ma...@geomatys.fr> wrote:
> Hello all
>
> I would like to propose the following modifications to the Shapefile reader:
>
>  * Rename as ShapefileStore for consistency with NetcdfStore.
>  * Declare ShapefileStore as a subclass of DataStore:
>      o Implement the getMetadata() method using the information
>        currently stored in fields like version, xmin, xmax, etc.
>  * Turn public fields into private fields instead.
>  * Add a getEnvelope(Query) method which returns the (xmin, ymin, etc.)
>    values.
>      o The Query argument would be an empty class for now, but still
>        defined as a placeholder for future developments.
>  * Add a getFeatures(Query) method which returns a FeatureCollection
>    (extends Collection<Feature>).
>
>
> To explain more about the last point: the intend is to be able to read large
> set of Features without loading all of them in memory. So instead than
> storing all Features in a HashMap, we would allow implementations to return
> a collection backed by an iterator instantiating the Features on the fly.
> Such Iterator is the same idea than java.sql.ResultSet.
>
> Using a method instead than direct to a public field has two purposes:
>
>  * Allows to specify a filter or other query aspects.
>  * Allows to returns an Autocloseable collection for implementations
>    that perform their I/O operations on the fly.
>
>
> So instead of iterating on features like below:
>
>    for (Feature f : shapefile.FeatureMap.values()) {
>         // ... do some stuff ...
>    }
>
>
> We would do:
>
>    try (FeatureCollection features = shapefile.getFeatures(myQuery)) {
>         for (Feature f : features) {
>             // ... do some stuff ...
>         }
>    }
>
>
> What do peoples think?
>
>     Martin
>