You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by sk...@apache.org on 2018/09/21 22:13:13 UTC
[incubator-mxnet] branch master updated: Add docstring in im2rec.py
(#12621)
This is an automated email from the ASF dual-hosted git repository.
skm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-mxnet.git
The following commit(s) were added to refs/heads/master by this push:
new 504d24c Add docstring in im2rec.py (#12621)
504d24c is described below
commit 504d24c596c6e1e9d471afea0540d1c54fe9198c
Author: Jake Lee <gs...@gmail.com>
AuthorDate: Fri Sep 21 15:13:01 2018 -0700
Add docstring in im2rec.py (#12621)
* address feedback from Aaron
* link the example in https://mxnet.incubator.apache.org/architecture/note_data_loading.html
* update the link to im2py section and fix the text format
* add one blank to pass the markdown linter
* add docstring for im2rec.py
* fix the wording
* add parameter and return type
* add one missing return details
* fix the wording
---
docs/architecture/note_data_loading.md | 1 +
docs/faq/recordio.md | 12 ++++--
docs/tutorials/basic/data.md | 7 +--
tools/im2rec.py | 79 +++++++++++++++++++++++++++++++---
4 files changed, 88 insertions(+), 11 deletions(-)
diff --git a/docs/architecture/note_data_loading.md b/docs/architecture/note_data_loading.md
index 7c423bb..7c92c86 100644
--- a/docs/architecture/note_data_loading.md
+++ b/docs/architecture/note_data_loading.md
@@ -102,6 +102,7 @@ then compress into JPEG format.
After that, we save a header that indicates the index and label
for that image to be used when constructing the *Data* field for that record.
We then pack several images together into a file.
+You may want to also review the [example using im2rec.py to create a RecordIO dataset](https://mxnet.incubator.apache.org/tutorials/basic/data.html#loading-data-using-image-iterators).
### Access Arbitrary Parts Of Data
diff --git a/docs/faq/recordio.md b/docs/faq/recordio.md
index f615718..3091052 100644
--- a/docs/faq/recordio.md
+++ b/docs/faq/recordio.md
@@ -6,7 +6,13 @@ RecordIO implements a file format for a sequence of records. We recommend storin
* Packing data together allows continuous reading on the disk.
* RecordIO has a simple way to partition, simplifying distributed setting. We provide an example later.
-We provide the [im2rec tool](https://github.com/dmlc/mxnet/blob/master/tools/im2rec.cc) so you can create an Image RecordIO dataset by yourself. The following walkthrough shows you how. Note that there is python version of [im2rec tool](https://github.com/apache/incubator-mxnet/blob/master/tools/im2rec.py) and [example](https://mxnet.incubator.apache.org/tutorials/basic/data.html) using real-world data.
+We provide two tools for creating a RecordIO dataset.
+
+* [im2rec.cc](https://github.com/dmlc/mxnet/blob/master/tools/im2rec.cc) - implements the tool using the C++ API.
+* [im2rec.py](https://github.com/apache/incubator-mxnet/blob/master/tools/im2rec.py) - implements the tool using the Python API.
+
+Both provide the same output: a RecordIO dataset.
+You may want to also review the [example using real-world data with im2rec.py.](https://mxnet.incubator.apache.org/tutorials/basic/data.html#loading-data-using-image-iterators)
### Prerequisites
@@ -14,7 +20,7 @@ Download the data. You don't need to resize the images manually. You can use ```
### Step 1. Make an Image List File
-* Note that the im2rec.py provide a param `--list` to generate the list for you but im2rec.cc doesn't support it.
+* Note that the im2rec.py provides a param `--list` to generate the list for you, but im2rec.cc doesn't support it.
After you download the data, you need to make an image list file. The format is:
@@ -39,7 +45,7 @@ This is an example file:
### Step 2. Create the Binary File
-To generate a binary image, use `im2rec` in the tool folder. `im2rec` takes the path of the `_image list file_` you generated, the `_root path_` of the images, and the `_output file path_` as input. This process usually takes several hours, so be patient.
+To generate a binary image, use `im2rec` in the tool folder. `im2rec` takes the path of the `image list file` you generated, the `root path` of the images, and the `output file path` as input. This process usually takes several hours, so be patient.
Sample command:
diff --git a/docs/tutorials/basic/data.md b/docs/tutorials/basic/data.md
index b5d0884..4a682e8 100644
--- a/docs/tutorials/basic/data.md
+++ b/docs/tutorials/basic/data.md
@@ -16,7 +16,8 @@ To complete this tutorial, we need:
$ pip install opencv-python requests matplotlib jupyter
```
-## MXNet Data Iterator
+## MXNet Data Iterator
+
Data Iterators in *MXNet* are similar to Python iterator objects.
In Python, the function `iter` allows fetching items sequentially by calling `next()` on
iterable objects such as a Python `list`.
@@ -312,10 +313,10 @@ print(mx.recordio.unpack_img(s))
```
#### Using tools/im2rec.py
-You can also convert raw images into *RecordIO* format using the ``im2rec.py`` utility script that is provided in the MXNet [src/tools](https://github.com/dmlc/mxnet/tree/master/tools) folder.
+You can also convert raw images into *RecordIO* format using the [__im2rec.py__](https://github.com/apache/incubator-mxnet/blob/master/tools/im2rec.py) utility script that is provided in the MXNet [src/tools](https://github.com/dmlc/mxnet/tree/master/tools) folder.
An example of how to use the script for converting to *RecordIO* format is shown in the `Image IO` section below.
-* Note that there is a C++ version of [im2rec](https://github.com/dmlc/mxnet/blob/master/tools/im2rec.cc), please refer to [here](https://mxnet.incubator.apache.org/faq/recordio.html) for more information.
+* Note that there is a C++ API implementation of [im2rec](https://github.com/dmlc/mxnet/blob/master/tools/im2rec.cc), please refer to [RecordIO FAQ](https://mxnet.incubator.apache.org/faq/recordio.html) for more information.
## Image IO
diff --git a/tools/im2rec.py b/tools/im2rec.py
index ef3e3f3..da3a1dd 100755
--- a/tools/im2rec.py
+++ b/tools/im2rec.py
@@ -36,6 +36,18 @@ except ImportError:
multiprocessing = None
def list_image(root, recursive, exts):
+ """Traverses the root of directory that contains images and
+ generates image list iterator.
+ Parameters
+ ----------
+ root: string
+ recursive: bool
+ exts: string
+ Returns
+ -------
+ image iterator that contains all the image under the specified path
+ """
+
i = 0
if recursive:
cat = {}
@@ -61,6 +73,15 @@ def list_image(root, recursive, exts):
i += 1
def write_list(path_out, image_list):
+ """Hepler function to write image list into the file.
+ The format is as below,
+ integer_image_index \t float_label_index \t path_to_image
+ Note that the blank between number and tab is only used for readability.
+ Parameters
+ ----------
+ path_out: string
+ image_list: list
+ """
with open(path_out, 'w') as fout:
for i, item in enumerate(image_list):
line = '%d\t' % item[0]
@@ -70,6 +91,11 @@ def write_list(path_out, image_list):
fout.write(line)
def make_list(args):
+ """Generates .lst file.
+ Parameters
+ ----------
+ args: object that contains all the arguments
+ """
image_list = list_image(args.root, args.recursive, args.exts)
image_list = list(image_list)
if args.shuffle is True:
@@ -95,6 +121,14 @@ def make_list(args):
write_list(args.prefix + str_chunk + '_train.lst', chunk[sep_test:sep_test + sep])
def read_list(path_in):
+ """Reads the .lst file and generates corresponding iterator.
+ Parameters
+ ----------
+ path_in: string
+ Returns
+ -------
+ item iterator that contains information in .lst file
+ """
with open(path_in) as fin:
while True:
line = fin.readline()
@@ -102,17 +136,26 @@ def read_list(path_in):
break
line = [i.strip() for i in line.strip().split('\t')]
line_len = len(line)
+ # check the data format of .lst file
if line_len < 3:
- print('lst should at least has three parts, but only has %s parts for %s' %(line_len, line))
+ print('lst should have at least has three parts, but only has %s parts for %s' % (line_len, line))
continue
try:
item = [int(line[0])] + [line[-1]] + [float(i) for i in line[1:-1]]
except Exception as e:
- print('Parsing lst met error for %s, detail: %s' %(line, e))
+ print('Parsing lst met error for %s, detail: %s' % (line, e))
continue
yield item
def image_encode(args, i, item, q_out):
+ """Reads, preprocesses, packs the image and put it back in output queue.
+ Parameters
+ ----------
+ args: object
+ i: int
+ item: list
+ q_out: queue
+ """
fullpath = os.path.join(args.root, item[1])
if len(item) > 3 and args.pack_label:
@@ -145,10 +188,10 @@ def image_encode(args, i, item, q_out):
return
if args.center_crop:
if img.shape[0] > img.shape[1]:
- margin = (img.shape[0] - img.shape[1]) // 2;
+ margin = (img.shape[0] - img.shape[1]) // 2
img = img[margin:margin + img.shape[1], :]
else:
- margin = (img.shape[1] - img.shape[0]) // 2;
+ margin = (img.shape[1] - img.shape[0]) // 2
img = img[:, margin:margin + img.shape[0]]
if args.resize:
if img.shape[0] > img.shape[1]:
@@ -167,6 +210,14 @@ def image_encode(args, i, item, q_out):
return
def read_worker(args, q_in, q_out):
+ """Function that will be spawned to fetch the image
+ from the input queue and put it back to output queue.
+ Parameters
+ ----------
+ args: object
+ q_in: queue
+ q_out: queue
+ """
while True:
deq = q_in.get()
if deq is None:
@@ -175,6 +226,14 @@ def read_worker(args, q_in, q_out):
image_encode(args, i, item, q_out)
def write_worker(q_out, fname, working_dir):
+ """Function that will be spawned to fetch processed image
+ from the output queue and write to the .rec file.
+ Parameters
+ ----------
+ q_out: queue
+ fname: string
+ working_dir: string
+ """
pre_time = time.time()
count = 0
fname = os.path.basename(fname)
@@ -204,6 +263,11 @@ def write_worker(q_out, fname, working_dir):
count += 1
def parse_args():
+ """Defines all arguments.
+ Returns
+ -------
+ args object that contains all the params
+ """
parser = argparse.ArgumentParser(
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
description='Create an image list or \
@@ -260,8 +324,10 @@ def parse_args():
if __name__ == '__main__':
args = parse_args()
+ # if the '--list' is used, it generates .lst file
if args.list:
make_list(args)
+ # otherwise read .lst file to generates .rec file
else:
if os.path.isdir(args.prefix):
working_dir = args.prefix
@@ -279,13 +345,16 @@ if __name__ == '__main__':
if args.num_thread > 1 and multiprocessing is not None:
q_in = [multiprocessing.Queue(1024) for i in range(args.num_thread)]
q_out = multiprocessing.Queue(1024)
+ # define the process
read_process = [multiprocessing.Process(target=read_worker, args=(args, q_in[i], q_out)) \
for i in range(args.num_thread)]
+ # process images with num_thread process
for p in read_process:
p.start()
+ # only use one process to write .rec to avoid race-condtion
write_process = multiprocessing.Process(target=write_worker, args=(q_out, fname, working_dir))
write_process.start()
-
+ # put the image list into input queue
for i, item in enumerate(image_list):
q_in[i % len(q_in)].put((i, item))
for q in q_in: