First Attempt

Monday August 20, 2018 at 11:29 am CDT

The past couple weeks have been spent sorting data and building out the network. The following script was used to extract frames from the webm files I recorded. Part of it has already been shared in a previous post.

#!/bin/bash

for filename in $(ls webmfiles/*.webm | sort -n -t _ -k 3); do
  num=$(echo "$filename" | grep -o '[0-9]\+')

  img_file="./training_set/train_img_${num}_%03d.jpeg"

  ffmpeg -i $filename -r 21 -s 42x32 -frames:v 105 -f image2 $img_file
done

This iterates through all of the recordings in the webmfiles/ directory and extracts 105 frames from each at a rate of 21 frames a second. Each image extracted is 42 x 32 pixels. Each image is placed in a directory called training_set/ and given a name according to the pattern train_img_[webm_file_number]_[image_number]. Note that all of the webm files end in _[webm_file_number] where webm_file_number is 3 digits long (zero padding is added for numbers less than 100)

After this, the frames have to be separated into two categories - features and targets. I used the script below to place frames in the appropriate directory:

#!/bin/bash

for filename in $(find . -maxdepth 2 -name "*.jpeg"); do
  nums=$(echo $filename | ggrep -oP '[0-9]+')

  vid_num=$(echo $nums | ggrep -oP '[0-9]+\s')
  zero_pad_img_num=$(echo $nums | ggrep -oP '\s[0-9]+')

  img_num=$(echo $zero_pad_img_num | sed 's/^0*//')

  file_basename=$(basename $filename)

  if (($img_num % 3 == 0)); then
    mv "./training_set/$file_basename" "./training_set/labels/$file_basename"
  else
    mv "./training_set/$file_basename" "./training_set/features/$file_basename"
  fi
done

The script iterates through all of the images created. If the image number is divisible by three, it is placed in the labels/ directory. All other files are placed in the features/ directory. Ultimately, the files in the features directory will be concatenated such that the result has dimensions of 42 x 32 x 6. Each image starts with dimensions of 42 x 32 x 3. The 3 refers to the 3 rgb channels each image has.

The data is loaded and transformed like so:

features_path = "./training_set/features/"
labels_path = "./training_set/labels/"

def load_data(img_dir):
    return np.array([cv2.imread(os.path.join(img_dir, img)) for img in sorted(os.listdir(img_dir)) if img.endswith(".jpeg")])

feature_imgs = load_data(features_path) # (7280, 32, 42, 3)
label_imgs = load_data(labels_path) # (3640, 32, 42, 3)

def concat_frames(samples):
    num_samples = samples.shape[0]
    paired_samples = np.array(np.split(samples, num_samples / 2))

    concatenated_frames = []

    for pair in paired_samples:
        concatenated_frames.append(np.concatenate((pair[0], pair[1]), axis=2))

    return np.array(concatenated_frames)

paired_frames = concat_frames(feature_imgs)  # (3640, 32, 42, 6)

The concatenated images are then run through two layers of convolution and max pooling. Each kernel is 5 x 5 pixels.

X = tf.placeholder(tf.float32, [None, 32, 42, 6], name="X")

# Convolution Layer #1
# output is 32 channels, 32 rows, 42 cols
conv1 = tf.layers.conv2d(
          inputs=X,
          filters=32,
          kernel_size=[5, 5],
          padding="same",
          activation=tf.nn.relu)

# Pooling Layer #1
# output is 32 channels, 16 rows, 21 cols
pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2)

# Convolution Layer #2
# output is 64 channels, 16 rows, 21 cols
conv2 = tf.layers.conv2d(
          inputs=pool1,
          filters=64,
          kernel_size=[5, 5],
          padding="same",
          activation=tf.nn.relu)

# Pooling Layer #2
# output is 64 channels, 8 rows, 11 cols = 5632
pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2, padding="same")
pool2_flat = tf.reshape(pool2, [-1, 8 * 11 * 64])

The output of the convolutional layers is fed into a four-layer autoencoder.

d1_neurons = 300
d2_neurons = 150
d3_neurons = 300

num_outputs = 32 * 42 * 3

# He Initialization
he_init = tf.contrib.layers.variance_scaling_initializer()

# L2 Regularization
l2_reg = tf.contrib.layers.l2_regularizer(0.0001)

# Dense Layer #1
dense1 = tf.layers.dense(inputs=pool2_flat,
                         units=d1_neurons,
                         kernel_initializer=he_init,
                         kernel_regularizer=l2_reg,
                         activation=tf.nn.relu)

# Dense Layer #2
dense2 = tf.layers.dense(inputs=dense1,
                         units=d2_neurons,
                         kernel_initializer=he_init,
                         kernel_regularizer=l2_reg,
                         activation=tf.nn.relu)

# Dense Layer #3
dense3 = tf.layers.dense(inputs=dense2,
                         units=d3_neurons,
                         kernel_initializer=he_init,
                         kernel_regularizer=l2_reg,
                         activation=tf.nn.relu)

# Dense Layer #4
dense4 = tf.layers.dense(inputs=dense3,
                         units=num_outputs,
                         kernel_initializer=he_init,
                         kernel_regularizer=l2_reg,
                         activation=None)


reconstruction_loss = tf.reduce_mean(tf.square(dense4 - tf.cast(tf.reshape(y, [-1, 4032]), tf.float32)))

reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
loss = tf.add_n([reconstruction_loss] + reg_losses)

optimizer = tf.train.AdamOptimizer(learning_rate)
training_op = optimizer.minimize(loss)

The results after training the network are rather underwhelming. This upcoming week will be spent making improvements to the network and displaying the output of the autoencoder. Hopefully, seeing the output will give me some insight as to the sorts of improvements that need to be made.


Photo by Calum MacAulay on Unsplash