5. High-level visual chirality
While analysis of chiralities that arise in image processing
have useful implications in forensics, we are also interested
in understanding what kinds of high-level visual content
(objects, object regions, etc.) reveals visual chirality, and
whether we can discover these cues automatically. As de-
scribed in Section 4, if we try to train a network from scratch,
it invariably starts to pick up on uninterpretable, low-level
image signals. Instead, we hypothesize that if we start with
a ResNet network that has been pre-trained on ImageNet ob-
ject classification, then it will have a familiarity with objects
that will allow it to avoid picking up on low-level cues. Note,
that such ImageNet-trained networks should not have fea-
tures sensitive to specifically to chirality—indeed, as noted
above, many ImageNet classifiers are trained using random
horizontal flips as a form of data augmentation.
Data.
What distribution of images do we use for training?
We could try to sample from the space of all natural images.
However, because we speculate that many chirality cues have
to do with people, and with manmade objects and scenes, we
start with images that feature people. In particular, we utilize
the StreetStyle dataset of Matzen et al. [
17
], which consists
of millions of images of people gathered from Instagram.
For our work, we select a random subset of 700K images
from StreetStyle, and refer to this as the Instagram dataset;
example images are shown in Figures 1 and 5. We randomly
sample 5K images as a test set
S
test
, and split the remaining
images into training and validation sets with a ratio of 9:1
(unless otherwise stated, we use this same train/val/test split
strategy for all experiments in this paper).
Training.
We trained the chirality classification approach
described in Section 3 on Instagram, starting from an
ImageNet-pretrained model. As it turns out, the transfor-
mations applied to images before feeding them to a network
are crucial to consider. Initially, we downsampled all input
images bilinearly to a resolution of 512
×
512. A network so
trained achieves a
92%
accuracy on the Instagram test set,
a surprising result given that determining whether an image
has been flipped can be difficult even for humans.
As discussed above, it turns out that our networks were
still picking up on traces left by low-level processing, such
as boundary artifacts produced by JPEG encoding, as evi-
denced by CAM heatmaps that often fired near the corners of
images. In addition to pre-training on ImageNet, we found
that networks can be made more resistant to the most obvi-
ous such artifacts by performing random cropping of input
images. In particular, we randomly crop a 512
×
512 window
from the input images during training and testing (rather than
simply resizing the images). A network trained in such a
way still achieves a test accuracy to 80%, still a surprisingly
high result.
Non-text cues.
Examining the most confident classifica-
Training set Preprocessing Test Accuracy
Instagram F100M
Instagram Resizing 0.92 0.57
Instagram RandCrop 0.80 0.59
Instagram (no-text) RandCrop 0.74 0.55
Table 1.
Chirality classification performance of models trained
on Instagram.
Hyper-parameters were selected by cross valida-
tion. The first column indicates the training dataset, and the second
column the processing that takes place on input images. The last
columns report on a held-out test set, and on an unseen dataset
(Flickr100M, or F100M for short). Note that the same preprocess-
ing scheme (resize vs. random crop) is applied to both the training
and test sets, and the model trained on Instagram without text is
also tested on Instagram without text.
tions, we found that many involved text (e.g., on clothing or
in the background), and that CAM heatmaps often predomi-
nantly focused on text regions. Indeed, text is such a strong
signal for chirality that it seems to drown out other signals.
This yields a useful insight: we may be able to leverage
chirality to learn a text detector via self-supervision, for any
language (so long as the writing is chiral, which is true for
many if not all languages).
However, for the purpose of the current analysis, we wish
to discover non-text chiral cues as well. To make it easier
to identify such cues, we ran an automatic text detector [
25
]
on Instagram, split it into text and no-text subsets, and then
randomly sampled the no-text subset to form new training
and text set. On the no-text subset, chirality classification
accuracy drops from 80% to 74%—lower, but still well
above chance.
Generalization.
Perhaps our classifier learns features spe-
cific to Instagram images. To test this, Table 1 (last column)
shows the evaluation accuracy of all models (without fine-
tuning) on another dataset of Internet photos, a randomly
selected subset of photos from Flickr100M [
22
]. Note that
there is a significant domain gap between Instagram and
Flickr100M, in that images in our Instagram dataset all
contain people, whereas Flickr100M features more general
content (landscapes, macro shots, etc.) in addition al people.
While the performance on Flickr100M is naturally lower
than on Instagram, our Instagram-trained models still per-
form above chance rates, with an accuracy of 55% (or 59% if
text is considered), suggesting that our learned chiral features
can generalize to new distributions of photos.
5.1. Revealing object-level chiral features
Inspecting the CAM heatmaps derived from our non-text-
trained Instagram model reveals a network that focuses on
a coherent set of local regions, such as smart phones and
shirt pockets, across different photos. To further understand