You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/09/21 22:13:03 UTC

[GitHub] sandeep-krishnamurthy closed pull request #12621: Add docstring in im2rec.py

sandeep-krishnamurthy closed pull request #12621: Add docstring in im2rec.py
URL: https://github.com/apache/incubator-mxnet/pull/12621
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/docs/architecture/note_data_loading.md b/docs/architecture/note_data_loading.md
index 7c423bb6407..7c92c86a9c6 100644
--- a/docs/architecture/note_data_loading.md
+++ b/docs/architecture/note_data_loading.md
@@ -102,6 +102,7 @@ then compress into JPEG format.
 After that, we save a header that indicates the index and label
 for that image to be used when constructing the *Data* field for that record.
 We then pack several images together into a file.
+You may want to also review the [example using im2rec.py to create a RecordIO dataset](https://mxnet.incubator.apache.org/tutorials/basic/data.html#loading-data-using-image-iterators).
 
 ### Access Arbitrary Parts Of Data
 
diff --git a/docs/faq/recordio.md b/docs/faq/recordio.md
index f61571882bd..3091052ef6f 100644
--- a/docs/faq/recordio.md
+++ b/docs/faq/recordio.md
@@ -6,7 +6,13 @@ RecordIO implements a file format for a sequence of records. We recommend storin
 * Packing data together allows continuous reading on the disk.
 * RecordIO has a simple way to partition, simplifying distributed setting. We provide an example later.
 
-We provide the [im2rec tool](https://github.com/dmlc/mxnet/blob/master/tools/im2rec.cc) so you can create an Image RecordIO dataset by yourself. The following walkthrough shows you how. Note that there is python version of [im2rec tool](https://github.com/apache/incubator-mxnet/blob/master/tools/im2rec.py) and [example](https://mxnet.incubator.apache.org/tutorials/basic/data.html) using real-world data.
+We provide two tools for creating a RecordIO dataset.
+
+* [im2rec.cc](https://github.com/dmlc/mxnet/blob/master/tools/im2rec.cc) - implements the tool using the C++ API.
+* [im2rec.py](https://github.com/apache/incubator-mxnet/blob/master/tools/im2rec.py) - implements the tool using the Python API.
+
+Both provide the same output: a RecordIO dataset.
+You may want to also review the [example using real-world data with im2rec.py.](https://mxnet.incubator.apache.org/tutorials/basic/data.html#loading-data-using-image-iterators)
 
 ### Prerequisites
 
@@ -14,7 +20,7 @@ Download the data. You don't need to resize the images manually. You can use ```
 
 ### Step 1. Make an Image List File
 
-* Note that the im2rec.py provide a param `--list` to generate the list for you but im2rec.cc doesn't support it.
+* Note that the im2rec.py provides a param `--list` to generate the list for you, but im2rec.cc doesn't support it.
 
 After you download the data, you need to make an image list file.  The format is:
 
@@ -39,7 +45,7 @@ This is an example file:
 
 ### Step 2. Create the Binary File
 
-To generate a binary image, use `im2rec` in the tool folder. `im2rec` takes the path of the `_image list file_` you generated, the `_root path_` of the images, and the `_output file path_` as input. This process usually takes several hours, so be patient.
+To generate a binary image, use `im2rec` in the tool folder. `im2rec` takes the path of the `image list file` you generated, the `root path` of the images, and the `output file path` as input. This process usually takes several hours, so be patient.
 
 Sample command:
 
diff --git a/docs/tutorials/basic/data.md b/docs/tutorials/basic/data.md
index b5d0884f749..4a682e83f9f 100644
--- a/docs/tutorials/basic/data.md
+++ b/docs/tutorials/basic/data.md
@@ -16,7 +16,8 @@ To complete this tutorial, we need:
 $ pip install opencv-python requests matplotlib jupyter
 ```
 
-## MXNet Data Iterator  
+## MXNet Data Iterator
+
 Data Iterators in *MXNet* are similar to Python iterator objects.
 In Python, the function `iter` allows fetching items sequentially by calling  `next()` on
  iterable objects such as a Python `list`.
@@ -312,10 +313,10 @@ print(mx.recordio.unpack_img(s))
 ```
 
 #### Using tools/im2rec.py
-You can also convert raw images into *RecordIO* format using the ``im2rec.py`` utility script that is provided in the MXNet [src/tools](https://github.com/dmlc/mxnet/tree/master/tools) folder.
+You can also convert raw images into *RecordIO* format using the [__im2rec.py__](https://github.com/apache/incubator-mxnet/blob/master/tools/im2rec.py) utility script that is provided in the MXNet [src/tools](https://github.com/dmlc/mxnet/tree/master/tools) folder.
 An example of how to use the script for converting to *RecordIO* format is shown in the `Image IO` section below.
 
-* Note that there is a C++ version of [im2rec](https://github.com/dmlc/mxnet/blob/master/tools/im2rec.cc), please refer to [here](https://mxnet.incubator.apache.org/faq/recordio.html) for more information.
+* Note that there is a C++ API implementation of [im2rec](https://github.com/dmlc/mxnet/blob/master/tools/im2rec.cc), please refer to [RecordIO FAQ](https://mxnet.incubator.apache.org/faq/recordio.html) for more information.
 
 ## Image IO
 
diff --git a/tools/im2rec.py b/tools/im2rec.py
index ef3e3f3cf74..da3a1dddc87 100755
--- a/tools/im2rec.py
+++ b/tools/im2rec.py
@@ -36,6 +36,18 @@
     multiprocessing = None
 
 def list_image(root, recursive, exts):
+    """Traverses the root of directory that contains images and
+    generates image list iterator.
+    Parameters
+    ----------
+    root: string
+    recursive: bool
+    exts: string
+    Returns
+    -------
+    image iterator that contains all the image under the specified path
+    """
+
     i = 0
     if recursive:
         cat = {}
@@ -61,6 +73,15 @@ def list_image(root, recursive, exts):
                 i += 1
 
 def write_list(path_out, image_list):
+    """Hepler function to write image list into the file.
+    The format is as below,
+    integer_image_index \t float_label_index \t path_to_image
+    Note that the blank between number and tab is only used for readability.
+    Parameters
+    ----------
+    path_out: string
+    image_list: list
+    """
     with open(path_out, 'w') as fout:
         for i, item in enumerate(image_list):
             line = '%d\t' % item[0]
@@ -70,6 +91,11 @@ def write_list(path_out, image_list):
             fout.write(line)
 
 def make_list(args):
+    """Generates .lst file.
+    Parameters
+    ----------
+    args: object that contains all the arguments
+    """
     image_list = list_image(args.root, args.recursive, args.exts)
     image_list = list(image_list)
     if args.shuffle is True:
@@ -95,6 +121,14 @@ def make_list(args):
             write_list(args.prefix + str_chunk + '_train.lst', chunk[sep_test:sep_test + sep])
 
 def read_list(path_in):
+    """Reads the .lst file and generates corresponding iterator.
+    Parameters
+    ----------
+    path_in: string
+    Returns
+    -------
+    item iterator that contains information in .lst file
+    """
     with open(path_in) as fin:
         while True:
             line = fin.readline()
@@ -102,17 +136,26 @@ def read_list(path_in):
                 break
             line = [i.strip() for i in line.strip().split('\t')]
             line_len = len(line)
+            # check the data format of .lst file
             if line_len < 3:
-                print('lst should at least has three parts, but only has %s parts for %s' %(line_len, line))
+                print('lst should have at least has three parts, but only has %s parts for %s' % (line_len, line))
                 continue
             try:
                 item = [int(line[0])] + [line[-1]] + [float(i) for i in line[1:-1]]
             except Exception as e:
-                print('Parsing lst met error for %s, detail: %s' %(line, e))
+                print('Parsing lst met error for %s, detail: %s' % (line, e))
                 continue
             yield item
 
 def image_encode(args, i, item, q_out):
+    """Reads, preprocesses, packs the image and put it back in output queue.
+    Parameters
+    ----------
+    args: object
+    i: int
+    item: list
+    q_out: queue
+    """
     fullpath = os.path.join(args.root, item[1])
 
     if len(item) > 3 and args.pack_label:
@@ -145,10 +188,10 @@ def image_encode(args, i, item, q_out):
         return
     if args.center_crop:
         if img.shape[0] > img.shape[1]:
-            margin = (img.shape[0] - img.shape[1]) // 2;
+            margin = (img.shape[0] - img.shape[1]) // 2
             img = img[margin:margin + img.shape[1], :]
         else:
-            margin = (img.shape[1] - img.shape[0]) // 2;
+            margin = (img.shape[1] - img.shape[0]) // 2
             img = img[:, margin:margin + img.shape[0]]
     if args.resize:
         if img.shape[0] > img.shape[1]:
@@ -167,6 +210,14 @@ def image_encode(args, i, item, q_out):
         return
 
 def read_worker(args, q_in, q_out):
+    """Function that will be spawned to fetch the image
+    from the input queue and put it back to output queue.
+    Parameters
+    ----------
+    args: object
+    q_in: queue
+    q_out: queue
+    """
     while True:
         deq = q_in.get()
         if deq is None:
@@ -175,6 +226,14 @@ def read_worker(args, q_in, q_out):
         image_encode(args, i, item, q_out)
 
 def write_worker(q_out, fname, working_dir):
+    """Function that will be spawned to fetch processed image
+    from the output queue and write to the .rec file.
+    Parameters
+    ----------
+    q_out: queue
+    fname: string
+    working_dir: string
+    """
     pre_time = time.time()
     count = 0
     fname = os.path.basename(fname)
@@ -204,6 +263,11 @@ def write_worker(q_out, fname, working_dir):
             count += 1
 
 def parse_args():
+    """Defines all arguments.
+    Returns
+    -------
+    args object that contains all the params
+    """
     parser = argparse.ArgumentParser(
         formatter_class=argparse.ArgumentDefaultsHelpFormatter,
         description='Create an image list or \
@@ -260,8 +324,10 @@ def parse_args():
 
 if __name__ == '__main__':
     args = parse_args()
+    # if the '--list' is used, it generates .lst file
     if args.list:
         make_list(args)
+    # otherwise read .lst file to generates .rec file
     else:
         if os.path.isdir(args.prefix):
             working_dir = args.prefix
@@ -279,13 +345,16 @@ def parse_args():
                 if args.num_thread > 1 and multiprocessing is not None:
                     q_in = [multiprocessing.Queue(1024) for i in range(args.num_thread)]
                     q_out = multiprocessing.Queue(1024)
+                    # define the process
                     read_process = [multiprocessing.Process(target=read_worker, args=(args, q_in[i], q_out)) \
                                     for i in range(args.num_thread)]
+                    # process images with num_thread process
                     for p in read_process:
                         p.start()
+                    # only use one process to write .rec to avoid race-condtion
                     write_process = multiprocessing.Process(target=write_worker, args=(q_out, fname, working_dir))
                     write_process.start()
-
+                    # put the image list into input queue
                     for i, item in enumerate(image_list):
                         q_in[i % len(q_in)].put((i, item))
                     for q in q_in:


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services