Jekyll2022-11-04T21:34:06+00:00https://starkdg.github.io/feed.xmlPHash Blogperceptual hashing informationA Better Way to Redis GeoHash2020-06-30T00:00:00+00:002020-06-30T00:00:00+00:00https://starkdg.github.io/posts/redis-geohash-reventis<p><img src="/resources/post_4/reventis.png" alt="reventis" /></p>
<p>Redis, an in-memory data store cache, is a popular database for many
applications. One of its many unique features is the <code class="language-plaintext highlighter-rouge">geohash</code> set of
commands. These commands allow you to add locations to a sorted set,
find the distance between any two locations, or even get a list of
all locations that fall within a given radius of a particular location.
A typical usage might run as follows:
<!--more--></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>127.0.0.1:6379> geoadd myplaces -72.677169 41.761841 hartford-ct
(integer) 1
127.0.0.1:6379> geoadd myplaces -72.620889 41.764183 easthartford-ct
(integer) 1
127.0.0.1:6379> geoadd myplaces -72.667689 41.702897 wethersfield-ct
(integer) 1
127.0.0.1:6379> geoadd myplaces -72.746997 41.749971 westhartford-ct
(integer) 1
127.0.0.1:6379> geoadd myplaces -72.339749 41.564189 colchester-ct
(integer) 1
127.0.0.1:6379> geoadd myplaces -72.234068 41.782345 mansfield-ct
(integer) 1
</code></pre></div></div>
<p>Then, to get all distances within 5 miles of hartford-ct, just do:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>127.0.0.1:6379> georadiusbymember myplaces hartford-ct 5 mi
1) "westhartford-ct"
2) "wethersfield-ct"
3) "hartford-ct"
4) "easthartford-ct"
</code></pre></div></div>
<p>To get the distance in miles between colchester-ct and mansfield-ct:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>127.0.0.1:6379> geodist myplaces colchester-ct mansfield-ct mi
"16.0341"
</code></pre></div></div>
<p>To lookup individual locations’ coordinates:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>127.0.0.1:6379> geopos myplaces colchester-ct mansfield-ct
1) 1) "-72.33974844217300415"
2) "41.56418901188302328"
2) 1) "-72.23406940698623657"
2) "41.78234485790384412"
</code></pre></div></div>
<p>Indeed, the Redis geohash can be extremely useful tool for web applications
offering location-based services.</p>
<p>However, that’s about all it can do. It lacks many other features
of a full spatial-temporal or object tracking index that would
no doubt be highly useful.</p>
<h2 id="reventis">Reventis</h2>
<p><a href="https://github.com/starkdg/reventis">Reventis</a> - a portmanteau of the
words, Redis and Events - is a Redis Module that introduces a native
data structure capable of indexing events or point locations in time
and space. Events can be inserted into the data structure
by its geo-spatial coordinates along with beginning and ending timestamps.
Then all events within a given geographical area and timespan can be queried.
The sequence mirrors the geohash commands. Here is an example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>127.0.0.1:6379> reventis.insert myplaces -72.569023 41.839269 06-01-2020 10:00 06-01-2020 12:00 "southwindsor-ct"
(integer) 942729159590019073
127.0.0.1:6379> reventis.insert myplaces -72.534727 41.770793 06-01-2020 11:00 06-01-2020 11:35 "manchester-ct"
(integer) 942950844413706242
127.0.0.1:6379> reventis.insert myplaces -72.565348 41.907065 06-01-2020 11:15 06-01-2020 11:20 "eastwindsor-ct"
(integer) 943188820988133379
127.0.0.1:6379> reventis.insert myplaces -72.310697 41.958829 06-01-2020 10:30 06-01-2020 11:05 "stafford-ct"
(integer) 943534029388972036
127.0.0.1:6379> reventis.insert myplaces -72.480640 41.806255 06-02-2020 10:00 06-02-2020 11:00 "buckley-manchester"
(integer) 943880954251706373
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">insert</code> command is followed by a key string, the longitude, latitude, beginning date-time,
end-date-time, and finally, a short descriptive string of the event. The command returns an integer
identifier for that new event.</p>
<p>The <code class="language-plaintext highlighter-rouge">queryradius</code> command retrieves all the events inserted within a 10 mile radius of South
Windsor, Ct for June 1, 2020.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>127.0.0.1:6379> reventis.queryradius myplaces -72.594856 41.814437 10 mi 06-01-2020 8:00 06-01-2020 16:00
1) 1) "manchester-ct"
2) (integer) 942950844413706242
3) "-72.534727000000004"
4) "41.770792999999998"
5) "06-01-2020 11:00:00\x00"
6) "06-01-2020 11:35:00\x00"
2) 1) "southwindsor-ct"
2) (integer) 942729159590019073
3) "-72.569023000000001"
4) "41.839269000000002"
5) "06-01-2020 10:00:00\x00"
6) "06-01-2020 12:00:00\x00"
3) 1) "eastwindsor-ct"
2) (integer) 943188820988133379
3) "-72.565348"
4) "41.907065000000003"
5) "06-01-2020 11:15:00\x00"
6) "06-01-2020 11:20:00\x00"
</code></pre></div></div>
<h3 id="what-if-you-want-to-retrieve-only-certain-kinds-of-events">What if you want to retrieve only certain kinds of events?</h3>
<p>With Reventis, inorder to further filter events, you can assign
integer categories 1 to 64 to each event. Multiple categories per event are possible.
Simply invoke the <code class="language-plaintext highlighter-rouge">addcategory</code> command with the key, the assigned
id number for that event, followed by a list of categories you wish to assign.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>reventis.addcategory myplaces 942729159590019073 10 20 30
reventis.addcategory myplaces 942950844413706242 10 20
reventis.addcategory myplaces 943188820988133379 55 60
</code></pre></div></div>
<p>Then you can query in the same way as above with the desired categories
appended onto the end of the command. The following queries the myplaces
key for categories 10 and 20.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>127.0.0.1:6379> reventis.querybyradius myplaces -72.594856 41.814437 10 mi 06-01-2020 8:00 06-01-2020 16:00 10 20
1) 1) "manchester-ct"
2) (integer) 942950844413706242
3) "-72.534727000000004"
4) "41.770792999999998"
5) "06-01-2020 11:00:00\x00"
6) "06-01-2020 11:35:00\x00"
2) 1) "southwindsor-ct"
2) (integer) 942729159590019073
3) "-72.569023000000001"
4) "41.839269000000002"
5) "06-01-2020 10:00:00\x00"
6) "06-01-2020 12:00:00\x00"
</code></pre></div></div>
<h2 id="object-tracing">Object Tracing</h2>
<p>By attaching an object integer identifier to an event, Reventis provides the ability to
track objects. With the <code class="language-plaintext highlighter-rouge">update</code> command, a new event is inserted to the data
structure with a given object ID provided. In this way, a chain of events can
be tracked with a common object identifier.
The update command is followed by a key string, longitude, latitude, timestamp,
an object id, and finally, a descriptive string for the update. For example,</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>127.0.0.1:6379> reventis.update mytracks -72.514046 41.823773 06-01-2020 8:00 100 "avery st., SW, CT"
(integer) 2692244187721629697
127.0.0.1:6379> reventis.update mytracks -72.679270 41.754280 06-01-2020 9:00 100 "seymour st. Htfd, CT"
(integer) 2692649354707730434
127.0.0.1:6379> reventis.update mytracks -72.673778 41.773129 06-01-2020 10:30 100 "Rensselaers, Htfd, CT"
(integer) 2692919245505495043
127.0.0.1:6379> reventis.update mytracks -72.672338 41.762721 06-01-2020 11:15 100 "Uconn Htfd, CT"
(integer) 2693312935547633668
127.0.0.1:6379> reventis.update mytracks -72.760321 41.762886 06-01-2020 13:00 100 "Front St, Htfd, CT"
(integer) 2693757132817039365
127.0.0.1:6379> reventis.update mytracks -72.554229 41.825927 06-01-2020 17:00 100 "Wapping, SW, CT"
(integer) 2694272864704200710
127.0.0.1:6379> reventis.update mytracks -72.604337 41.695416 06-01-2020 9:30 200 "Sherman, Rd, Glastonbury, CT"
(integer) 2694635424033406983
127.0.0.1:6379> reventis.update mytracks -72.689379 41.748207 06-01-2020 11:00 200 "Trinity College, Htfd.,CT"
(integer) 2694958883051143176
127.0.0.1:6379> reventis.update mytracks -72.668197 41.762560 06-01-2020 12:00 200 "Convention Center, Htfd.,CT"
(integer) 2695270279792492553
127.0.0.1:6379> reventis.update mytracks -72.671490 41.762741 06-01-2020 14:30 200 "Bears Smokehouse BBQ, Htfd, CT"
(integer) 2695768111295037450
127.0.0.1:6379> reventis.update mytracks -72.463723 41.638888 06-01-2020 19:00 200 "Denler, Dr - Marlborough, CT"
(integer) 2696103043806396427
127.0.0.1:6379> reventis.update mytracks -72.568387 41.685582 06-02-2020 11:00 200 "Glastonbury, CT"
(integer) 2696432004334157836```
</code></pre></div></div>
<p>Now we can query for all updates that fall within a 5 mile radius of Hartford, Ct.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>127.0.0.1:6379> reventis.queryobjradius mytracks -72.675968 41.763890 5 mi 06-01-2020 06:00 06-01-2020 23:00
1) 1) "Front St, Htfd, CT"
2) (integer) 2693757132817039365
3) (integer) 100
4) "-72.760321000000005"
5) "41.762886000000002"
6) "06-01-2020 13:00:00\x00"
2) 1) "seymour st. Htfd, CT"
2) (integer) 2692649354707730434
3) (integer) 100
4) "-72.679270000000002"
5) "41.754280000000001"
6) "06-01-2020 09:00:00\x00"
3) 1) "Convention Center, Htfd.,CT"
2) (integer) 2695270279792492553
3) (integer) 200
4) "-72.668197000000006"
5) "41.762560000000001"
6) "06-01-2020 12:00:00\x00"
4) 1) "Rensselaers, Htfd, CT"
2) (integer) 2692919245505495043
3) (integer) 100
4) "-72.673777999999999"
5) "41.773128999999997"
6) "06-01-2020 10:30:00\x00"
5) 1) "Uconn Htfd, CT"
2) (integer) 2693312935547633668
3) (integer) 100
4) "-72.672337999999996"
5) "41.762720999999999"
6) "06-01-2020 11:15:00\x00"
6) 1) "Trinity College, Htfd.,CT"
2) (integer) 2694958883051143176
3) (integer) 200
4) "-72.689379000000002"
5) "41.748207000000001"
6) "06-01-2020 11:00:00\x00"
7) 1) "Bears Smokehouse BBQ, Htfd, CT"
2) (integer) 2695768111295037450
3) (integer) 200
4) "-72.671490000000006"
5) "41.762740999999998"
6) "06-01-2020 14:30:00\x00"
</code></pre></div></div>
<p>Or get a list of objects identifiers that intersect a particular area:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>127.0.0.1:6379> reventis.trackallradius mytracks -72.682303 41.765600 5 mi 06-01-2020 5:00 06-01-2020 23:30
1) (integer) 100
2) (integer) 200
</code></pre></div></div>
<p>There is also the <code class="language-plaintext highlighter-rouge">reventis.hist</code> commmand which allows you to get the history of any object - either the
entire history or only for a designated time duration.</p>
<h2 id="but-how-well-does-it-scale">But How Well Does it Scale?</h2>
<p>The all important question!</p>
<p>Since Reventis is able to store each event/object in a balanced search tree, query response
time complexity is kept within logarithmic bounds. In the following graph, four different
query sizes are plotted over increasing index size. The x-axis is for N, the total number
of indexed events; the y-axis is response time in milliseconds. As you can see, most queries
stabilize to under 1ms over increasing N. Query #4 does not stabilize, because it is
literally the size of the state of Texas for over a year in time duration. Also, the test adds
uniformly distributed random events, so larger query regions do contain proportionally more
retrieved results. However, this ordinarily does not happen in practice,
since actual data is usually more clustered.</p>
<p><img src="/resources/post_4/results.png" alt="QueryResults" /></p>
<h2 id="summary">Summary</h2>
<p>Reventis introduces a robust set of commands for the management of spatio-temporal point
data, and presents an efficient and invaluable solution for any applications that need to manage
location-based data.</p>
<p>More documentaion on the various commands is available <a href="https://github.com/starkdg/reventis">here</a>.
A client-side library of functions is also available to automate interactions with Redis.</p>
<p>There is also an interesting application of Reventis to the GDELT - Global Database of Events, Locations
and Tones - dataset on the project README. I’ll leave further exploration for a future
blog post.</p>starkdgClipSeekr™: Video Clip Recognition System2019-06-01T00:00:00+00:002019-06-01T00:00:00+00:00https://starkdg.github.io/posts/ClipSeekr-VideoClipDetection<p>ClipSeekr is a real-time video clip recognition system<br />
designed to detect video sequences that occur in a video<br />
stream.
<!--more--></p>
<h2 id="how-it-works">How It Works</h2>
<p>Clipseekr works by indexing fingerprints of video clips.<br />
A 64-bit fingerprint is created for each frame of the clip<br />
from the spatial frequency information extracted from its<br />
discrete cosine transform. These 64-bit integers are then<br />
stored in a reverse index. This reverse index is simply a<br />
redis database of key-value pairs, where the key is a frame’s<br />
fingerprint pointing to a value consisting of an ID and some<br />
sequence information. Unknown streams can then be monitored<br />
to recognized the appearance of these indexed clips. The<br />
basic principle is simple. When the number of consecutive<br />
frames recognized for a particular ID reaches a specified<br />
threshold, the clip can then be identified together with its<br />
timestamp in the stream. This threshold is adjustable, but<br />
a good value for a 29.97 fps stream seems to be between 5<br />
and 10 consecutive frames.</p>
<h2 id="code">Code</h2>
<p>The code can be found in the github repository here:</p>
<p><a href="https://github.com/starkdg/clipseekr">ClipSeekr</a></p>
<h2 id="test-results">Test Results</h2>
<p>To evaluate this method, we streamed four hours of television<br />
and copied the commercial spots into new files for indexing.<br />
Altogether, there were 142 of these ad spots, 135 of which<br />
being unique video sequences. In brief, only one spot failed<br />
to be detected outright - i.e. a “false negative” - while five<br />
were detected falsely - i.e. “false positives”. The rest were<br />
successfuly detected within seconds of the occurence in the stream.<br />
This would roughly make for a false posive rate of 3.3%, and a<br />
false negative rate of 0.01%. The following table logs the<br />
results more precisely. The first two columns mark the clips<br />
and the timestamps for where they actually occur in the stream.<br />
The next two columns indicate the clips that get recognized<br />
along with their timestamps.</p>
<p>A black font represents correct detections; a red font<br />
represents false positives; and blue is for false negatives.</p>
<div class="viewport" id="includedContent"></div>
<p><br />
<br />
\</p>
<p>The only one that failed to be detected was a McDonald’s<br />
commercial, called “Uber Eats”. The only thing noteworthy<br />
is that the frames seemed exceptionally dark in contrast.<br />
Perhaps not enough definition in the fingerprints. Another<br />
noteworthy issue is the second detection of the spot called<br />
“Jack Daniels”. While the first one was a correct match,<br />
the second detection, even though it was a different clip,<br />
it shared enough of the first clip in common that the second was<br />
recognized as the first. This is an inherent weakness in the<br />
fingerprinting system, since there is not enough temporal<br />
information preserved to differentiate the two in real-time.</p>
<h2 id="a-few-notes-for-further-study">A few notes for further study:</h2>
<ul>
<li>
<p>While the fingerprinting method is fairly robust to many<br />
distortions, it is not robust to changes in the screen format.<br />
In other words, many broadcast streams manipulate the screen<br />
format to include varying amounts of black space in the margins.<br />
Also, the presence of various logos and other textual occlusions<br />
further obfuscate the spatial information of the frames.<br />
Alternative fingerprinting methods can be explored for this:<br />
scale-invariant feature points, or feature points combined with<br />
region-based descriptors.</p>
</li>
<li>
<p>The limited temporal information restricts the ability of the<br />
system to differentiate between clips that share a significant<br />
portion of frames in common. In other words, two commercial<br />
spots are often composed from common sequences only edited<br />
differently. Unfortunately, the real-time nature of the problem<br />
prohibits a second pass of the data. Recognition decisions are<br />
constrained to only looking at past frames.</p>
</li>
<li>
<p>Given the success of convolutional neural nets for image<br />
recognition tasks, it would be interesting to add in a<br />
recurrence property to better model a sequence of frames.<br />
Previous work in extracting image fingerprints from convolutional<br />
network models shows promise in differentiating images:</p>
</li>
</ul>
<p><a href="https://blog.phash.org/posts/concise-image-descriptor">pyConvnetPhash</a>.</p>
<p>However, this is an extremely slow approach when dealing with 30<br />
fps streams. Recurrent Neural Networks could possibly add in<br />
some temporary information, but might be limited to video clips<br />
of a fixed set length.</p>
<p>Thanks you for your time in reading this post.<br />
Comments and suggestions are welcome.</p>
<p><a href="https://github.com/starkdg/clipseekr/issues">Comments and Suggestions</a></p>starkdgClipSeekr is a real-time video clip recognition system designed to detect video sequences that occur in a video stream.AudioScout™: Audio Fingerprint Retrieval System2019-04-30T00:00:00+00:002019-04-30T00:00:00+00:00https://starkdg.github.io/posts/audioscout<p>Audio track recognition is about identifying short audio clips among a larger collection of indexed tracks.
To get a better idea for what it is:
<!--more--></p>
<h2 id="what-it-can-do">What it can do:</h2>
<ul>
<li>
<p>You hear a song playing. It sounds familiar, but you can’t quite put your
finger on it. You reach for your cell phone and manage to record a few seconds.
Submitting the recording to a track recognition system can identify its source.</p>
</li>
<li>
<p>Or: You are constantly receiving new audio tracks to add to your ever expanding collection. The main
problem is duplicates are mixed in with these new arrivals. File names and metadata may provide an occasional
clue but are inconsistent. Even the raw data come in varying formats and levels of quality, so
a straight bitstream comparison is of no use, not to mention cumbersome. Even if you have virtually
unlimited storage, just storing duplicates is no way to bring order to your collection. You need track recognition.</p>
</li>
<li>
<p>An invaluable tool to monitor how often you encounter specific audio signals and collect relevant statistics.</p>
</li>
</ul>
<h2 id="what-it-cannot-do">What it cannot do:</h2>
<ul>
<li>
<p>Identify a song from a particular performance, or a story from a particular narration. For example,
various renditions of a popular song.</p>
</li>
<li>
<p>Learn to recognize spoken commands - like “tell me”, “define”, or “what is the weather?”. Conceivably,
you might think you can index reference tracks of such commands and expect queries to be correctly identified
as such. This will not work. Two different people saying a word amounts to two different tracks.</p>
</li>
<li>
<p>Recognize someone’s voice or a specific instrument. This is Voice Recognition, not track recognition.</p>
</li>
<li>
<p>Any kind of semantic analysis. This can be done with various machine learning techniques. Nevertheless, it is
still an entirely different problem.</p>
</li>
</ul>
<p>While track recognition can still be robust against various distortions, audio that you expect to be matched still
needs to be from the same source.</p>
<h2 id="a-reliable-solution">A Reliable Solution</h2>
<p>This post introduces the second iteration of AudioScout™, an audio track recognition system that, while
admittedly not the optimal or latest algorithm, is yet not without its advantages. Indeed, the basis of the
algorithm has been around for at least two decades now. You can read more about it here <a href="https://pdfs.semanticscholar.org/4f92/768276a0823cffeb9435ccda67beaca1f542.pdf">haitsma, kalker, 2002</a>,
but I will proceed to give you my peculiar implementation of it. It is a surprisingly simple and elegant algorithm.</p>
<p>First off, some of the advantages:</p>
<ul>
<li>
<p>For one, it does not depend on a giant corpus of audio files to <em>train</em> the model. There are no machine learning
techniques involved. The algorithm is entirely deterministic and not data dependent, and it does not require constant
fine-tuning.</p>
</li>
<li>
<p>Relatively low storage overhead to index the collection. The audio fingerprints consume only 1.5% of the space of
audio files, so indexing audio content does not consume much additional space relative to the size of the collection.</p>
</li>
<li>
<p>Recall accuracy - the percentage of correct results for distorted queries - is at least competitive, if not ideal.
In my tests, I have seen near 95% accuracy for 4-6 second queries. Of course, this all depends on the magnitude of
the distortion, but these tests allowed for a severe level. Precision accuracy - or number of false positives - is
quite favorable too. False matches are rare to infrequent, but can be dismissed
by thresholding the confidence score that is returned with all results.</p>
</li>
</ul>
<h2 id="audio-fingerprinting-method">Audio Fingerprinting Method</h2>
<p>The fingerprinting method is summarized in figure 1. Condensed fingerprints are extracted from the audio.
The signal is segmented into tightly overlapping frames, each of 0.40 second duration. Overlapping
increases the chance two sequences of hash frames can be matched. A short-time fourier transform - stft -
is applied to each frame, and 33 perceptually significant frequencies are selected. The 33 frequencies are
used to make a 32-bit binary hash, h, according to relative differences between adjacent frequencies:</p>
<p>$h[i] = 0$ if $freq[i] - freq[i+1] < 0$ <br />
$h[i] = 1$ if $freq[i] - freq[i+1] > 0$</p>
<p>for i = 1 … 32 and frequency magnitudes, $freq[j], j = [1 … 33]$</p>
<p><img src="/resources/post_2/figure1.png" alt="figure 1" /></p>
<p>Once the binary hashes - or binhashes as referred to in figure 1 - for a pair of audio tracks are calculated,
substring matching can be used to see if any of the binhash frames from one signal match up with the frames
from another by looking for sections where the bit error rate stays below a predetermined threshold. Bit error
rates can be ascertained by a simple normalized hamming distance - the percentage of bits that differ.</p>
<p>But what makes this technique especially powerful and robust to distortion is that even more information can be gleaned from
the absolute differences between adjacent frequencies. By ordering these differences
from the smallest to the greatest in absolute value, and keeping track of
the original position indices, we then have a rough idea for which bit positions in the array are most likely to flip
through distortion. These position indices can then be saved in 32-bit integers by setting those positions to 1.
With one 32-bit integer for each binhash, this means a parallel array to describe which bits are likely to be toggled in
distortion.</p>
<p>While this does double the payload for each fingerprint, that payload is still only 3.0% of the original signal, since
the number of bin hashes is only 1.5% of the audio signal. What is more is that these
toggles only have to be computed for the query signals of a few seconds in duration.</p>
<p>These toggle arrays can then be used to find extra candidates to compare to prospective matches in our substring
matching algorithm. For each binhash of a query signal, just try all the permutations for flipping the bits indicated
in the toggle. If just one candidate for a given frame falls below a given threshold,
it can be considered a match within the prospective sequence. This greatly increases the odds for finding a match.</p>
<p>Of course, this comes with a catch. As you increase the number of bit toggles to consider - lets call it p = 0 to 32 -
the number of candidate hashes explodes exponentially. In other words, while p = 1 means $2^1=2$ extra candidates,
p=4 means $2^4=16$ extra candidates, and so on. So, it is imperative to keep a relatively low value for p to
avoid overburdening the matching algorithm. For our purposes, a value of p = 6 seems to mark a good upper limit.</p>
<h2 id="fingerprint-indexing">Fingerprint Indexing</h2>
<p>AudioScout™ stores the fingerprints of all tracks in a reverse index. A reverse index is basically just
a hash table, where the binhash is a key pointing to a data unit containing the unique track ID along with sequence information.
This sequence information tells us where the binhash fits into the track’s sequence.</p>
<p>In this way, unknown queries can be checked against the index. Each binhash of a query can be looked up in the
index. If a match is found, the match is stored in a list to track the number of finds for that unique ID. More candidate
binhash’s are generated from the toggle information by permuting all the marked bits.</p>
<p>Retrieval performance is fast, taking no more than a second or two to identify 2 to 6 second queries. Obviously, the longer
the query, the longer the wait time.</p>
<p>The index can hold a virtually unlimited number of tracks.</p>
<h2 id="test-results">Test Results</h2>
<p>To test the AudioScout retrieval system, over 1200 music cd tracks were indexed. From these files, 15 second clips
were chosen randomly to use as query signals, applying a telephone-like distortion and adding a 0.05 amplitude noise
on top of the signal. This seems to be a fair if somewhat extreme level of distortion one might reasonably expect.</p>
<p>Here’s a plot of the distortion used:</p>
<p><img src="/resources/post_2/query-distortion.png" alt="distortion" /></p>
<p>The following tables show the classification accuracy for the test queries. The entries are the percentage of queries
that obtained a correct matching result. Queries are performed for various signal duration - across the columns - and for
various number of bit toggles - across the rows. There are three tables, one for each threshold (T = 0.025, 0.050, 0.075)</p>
<p><img src="/resources/post_2/table3.png" alt="table 1" /></p>
<p><img src="/resources/post_2/table2.png" alt="table 2" /></p>
<p><img src="/resources/post_2/table1.png" alt="table 3" /></p>
<p>As you can see, almost 95% accuracy was obtained for 5 second query clip using a p=5 number of toggles.</p>
<p>Undistorted queries - that is, queries from the indexed files that are not at all distorted - do return a 100% correct
match rate. So, the clearer the signal, the better the recall rate.</p>
<h2 id="code">Code</h2>
<p>Access to the index is controlled by a server program - Auscoutd. It exposes network interface through which
client applications can submit new audio tracks as well as query unknown tracks. A demo client program is included,
called AudioScout. The github repository can be found here:</p>
<p><a href="https://github.com/starkdg/JAudioScout">JAudioScout</a></p>
<p>The api for creating client applications contains the audio fingerprinting functions. The github
repository can be found here:</p>
<p><a href="https://github.com/starkdg/JPhashAudio">JPHashAudio</a></p>
<p>Here’s the github repository for the c library and java bindings for reading audio data:</p>
<p><a href="https://github.com/starkdg/libAudioData">JAudioData</a></p>
<p>Comments and suggestions are welcome on the issues page:</p>
<p><a href="https://github.com/starkdg/JAudioScout/issues/1">Comments and Suggestions</a></p>starkdgAudio track recognition is about identifying short audio clips among a larger collection of indexed tracks. To get a better idea for what it is:On Extracting Concise Image Descriptors from Natural Images2019-04-07T00:00:00+00:002019-04-07T00:00:00+00:00https://starkdg.github.io/posts/concise-image-descriptor<p>Image descriptors are essential for organizing a collection of images for the purpose
of search and retrieval. These can be as simple as unique file names, where an exact
file name is needed to retrieve each and every source image. In order to preserve a
sense of distance between images, however, where smaller distance indicate similarity and larger
distances wholy different images, a more elaborate scheme is necessary. The descriptors
need to preserve some of the essential features of the source image.<br />
<!--more--></p>
<p>As is well known by now, convolutional neural networks can be trained to give
good performance on image classification tasks. Several models perform at or above 95% classification
accuracy (<a href="https://github.com/tensorflow/models/tree/master/research/slim#Pretrained">tf-slim</a>). These neural networks are many layers deep. They take an image
as input and output an indication as to which class among many is most probable.
Interestingly, new classification layers can be fine-tuned on top of the model’s hidden layers
and achieve comparable performance on new classification tasks. This suggests that the
network’s hidden layers have learned fundamental features of the images. Furthermore,
the outputs from these hidden layers can serve as quality descriptors, which preserve smoothness
with subtle changes in input. However, they still exist in a relatively high dimensional feature
space, making it difficult for fast indexing.</p>
<p>This post explores the cross correlations in these hidden layer values - also called feature vectors -
in the hopes of reducing its size. It is our expectation that concise descriptors can be extracted
from these feature vectors with the minimal additional overhead of training a few extra layers that
can piggy-back on top of these classification models.</p>
<p>Here’s the general idea:</p>
<p><img src="/resources/post_1/figure1.png" alt="figure_1" /></p>
<h1 id="objectives-for-this-post">Objectives For This Post:</h1>
<ul>
<li>
<p>The development of a concise image descriptor suitable for long-term storage. It should be
robust enough for fast comparison using a distance metric - like euclidean distance. This metric
should preserve small distances for perceptually similar images, while keeping wholely unique images
far enough apart.</p>
</li>
<li>
<p>Provide some fresh insight into the topology of these models’ feature vectors. Together
with the right kind of indexing structure, we should be able to retrieve all nearest neighbors of a given
image and get a good visual indication for what similar means in terms of the model’s feature vector.
We will use the MobilenetV2 neural net for all our work, but all the code is easy to modify to explore other models.</p>
</li>
<li>
<p>Develop a testing framework to evaluate these descriptors for this application. This
framework can be used to test the descriptors as well as the feature vector itself. In this way, we will be
able to see just where the descriptors lose the ability to discriminate between images as well as compare
the effectiveness of the new descriptors.</p>
</li>
</ul>
<p>All code can be found in the github repository here: <a href="https://github.com/starkdg/pyConvnetPhash"><strong>View On GitHub</strong></a></p>
<h2 id="a-test-framework">A Test Framework</h2>
<p>We need a framework to test all our models. First, we start with a corpus of natural images.
This is separate from the training set to be used to train the additional layers. It is strictly for evaluating
the efficiency of our descriptors in representing images. For now, no specific type of images are selected.
Just a random collection of natural images.</p>
<p>Next, we distort each image in multiple ways: gaussian blurr, additive noise, crop, occluded with text overlay,
compression, rotation, horizontal and verticle flip, shear affine transformation, resize, histogram equalization, etc.</p>
<p>Here’s the script for how to prepare this test set of data: <a href="https://github.com/starkdg/pyConvnetPhash/blob/master/preprocess_image_files.py">preprocess_image_files.py</a></p>
<p>To establish what a normal distance is between arbitrary wholly dissimilar images, we draw random pairs
from the original image set, extract its descriptors, and calculate the distances of all the pairs. A histogram
is then formed from which to draw statistics, like mean, $\mu$ and standard deviation, $\sigma$.
We set two thresholds:</p>
<p>$T_1$ = $\mu$ - 2<em>$\sigma$ <br />
$T_2$ = $\mu$ - 1</em>$\sigma$</p>
<p>These thresholds will be used to evaluate our similar distance comparisons.</p>
<p>For each distortion class, distances are measured between original images and their distorted counterparts.
A histogram for each original/distorted class comparison is then formed and displayed next to the above histogram
of arbitrary image pairs. Ideally, the peaks of the two histograms should stand far apart and give a good visual
separation. The histogram of similar distances should markedly less than the histogram for dissimilar distances.
This should give us a good visual clue as to how well the descriptor performs for our application.</p>
<p>Here’s an example of a histogram from one such test run:</p>
<p><img src="/resources/post_1/histogram-of-distances.png" alt="distance_histogram" /></p>
<p>As a quick way to compare different models without getting confused by a multitude of histograms, we will take
the percentage of distances that fall above each threshold. The closer to zero, the better.
This can be readily put in tabular form for easy perusal.</p>
<h2 id="analysis">Analysis</h2>
<p>To get an idea of how the model’s feature vector breaks down, we do a principle component analysis of the model’s
feature space. This can be done by calculating the covariance matrix for the feature vectors calculated from a set
of images - resulting in a [1792x1792] matrix. The covariance matrix is a way of measuring the cross correlations
between the components of the feature vector. A SVD Decomposition of the covariance matrix will gives us the leading
eigen vectors and their corresponding eigen values of the feature space - in descending order.</p>
<p>cov = U * $\Sigma$ * $V^T$</p>
<p>The eigen values can be found in diagonal[$\Sigma$]
Plotting the cumulative sum of these eigen values:</p>
<p><img src="/resources/post_1/mobilenetv2-pca-by-svd-1792-256-singular_values.png" alt="singular_values" /></p>
<p>As you can see, the curve tops out at around 500 of the most significant eigen vectors. That is, most of the
information in the feature vector would fit in a 500 dimensional vector, making for a compression rate of
500/1792 = 0.28. What is more is that the slope significantly slows after 250. In other words,
that is the point of diminishing returns of new information that comes with each additional eigen vector.</p>
<p>It is important to keep in mind that memory limitations prohibit this experiment from being carried out with more
than a random selection of 500 images. However, we get similar results for different random selections.</p>
<p>Here’s the script for how I build the pca transform model using svd decomposition: <a href="https://github.com/starkdg/pyConvnetPhash/blob/master/train_pca_with_svd.ipynb">train_pca_with_svd.ipynb</a>
It relies on test images in Tensorflow’s .tfrecord format in your google drive. There’s another script in the repository
for putting images in .tfrecord format.</p>
<p>We can use the eigen vectors from the SVD decomposition - the columns of U - to transform the feature vector into its
principle componenets. We run our comparison tests on the resulting transforms for the first N coordinates -
for N = 512, 256, 128, 64 and 32. So, we get a sense of how many principle components are needed for each discrimination
task. Here are the results:</p>
<p><img src="/resources/post_1/mobilenetv2-pcabysvd-test_results-table1.png" alt="table1" /></p>
<p>The raw feature vector, raw-1792, and the full pca transform, pca-1792, do indeed appear to be pretty good descriptors for the
image content, at least according to this test. As expected, this ability is preserved when only the 512 leading eigen vectors
are kept, the only category that fails being the vertical flip. Although it is somewhat curious that the horizontal flip distortion
remains good. By pca-256, it fails in noise, shear, occlusion with text overlay, and, of course, vertical flip, which shows there’s
some crucial information in those dropped eigen vectors for those categories. Notice how the metric steadily degrades going down
the columns.</p>
<h2 id="analasis-update">Analasis <strong>UPDATE</strong></h2>
<p>It turns out that using more images to compute the covariance matrix - upon which the svd decomposition is based - does
indeed change the above cumulative sum plot of eigen values. Here is the plot for 4000 images:</p>
<p><img src="/resources/post_1/mobilenetv2-pca-by-svd-1792-256-singular_values2.png" alt="singular_values2" /></p>
<p>As you can see, while the overall shape of the plot remains the same - that is, the point of diminishing
returns is still reached at around 250 leading eigenvectors - there is more variance in the smaller eigenvectors.
This would explain why the ability to discriminate with respect to such distortion as vertical flipping is lost
with as few as 500 leading eigenvectors.</p>
<p>Fortunately, it doesn’t appear to have changed the results of our test:</p>
<p><img src="/resources/post_1/mobilenetv2-pcabysvd-test_results-table1a.png" alt="table1a" /></p>
<p>As a matter of fact, if anything, there appears to be some modest improvements.</p>
<h2 id="training-linear-pca-models-on-a-large-set-of-images">Training Linear PCA models On a Large Set of Images</h2>
<p>Next, we train on a larger set of images. We use images from the <a href="http://press.liacs.nl/mirflickr/mirdownload.html">mirflickr25k</a> data set of 25,000 images.
There’s a script in the repository to put them in Tensorflow’s .tfrecord format. For taining, we break the image
set down as follows: 20,000 for training set, 2,000 for validation set, and 3,000 for a test set.</p>
<p>The model used is an autoencoder trained to learn a mapping from a 1792-dimension feature vector to a 256-dimension
descriptor. The new layers of the autoencoder model can be summarized like so:</p>
<p>h = $\sigma$(W<em>f + b1) <br />
y = $W^T$</em>h + b2</p>
<p>for weights, W and bias vectors, b1 and b2. f is the feature vector, h the hidden layer and y the reconstruction of f.
$\sigma$(.) is the transfer function. We try three different variants of a transfer function: linear identity, non-linear
sigmoid activation, and a non-linear relu activation.</p>
<p><img src="/resources/post_1/mobilenetv2-pca-1792to256-models-test_results-table2.png" alt="table2" /></p>
<p>The top row is merely a repeat from the above results from the svd decomposition. The important takeaway here is that
all three models, trained on the full set of 20k images, get some improvement over the first row on this test. Even the
pca-256-linear strictly linear auto-encoder model appears to gain some advantage. The non-linearities introduced by the
sigmoid and relu activation functions offer additional benefit.</p>
<h2 id="libphash-results">libpHash Results</h2>
<p>To give us some indication for how these results stack up, we put our earlier developed perceptual hashes to the test.</p>
<table>
<thead>
<tr>
<th>pHash</th>
<th>description</th>
<th>length</th>
<th>distance metric</th>
</tr>
</thead>
<tbody>
<tr>
<td>dctimage</td>
<td>DCT transform</td>
<td>64-bit</td>
<td>hamming distance</td>
</tr>
<tr>
<td>radial</td>
<td>Radon transform</td>
<td>180-byte</td>
<td>peak cross correlation</td>
</tr>
<tr>
<td>bmb</td>
<td>block mean bit</td>
<td>256-bit</td>
<td>hamming distance</td>
</tr>
<tr>
<td>mh</td>
<td>mexican hat wavelet</td>
<td>1024-bit</td>
<td>hamming distance</td>
</tr>
</tbody>
</table>
<p>An important distinction is that these perceptual hashes are much more concise than the above derived descriptors. Properly
quantized, the pca-256 would be 256 bytes long, which would be 2k bits. Also, they are completely deterministic and not
derived from any machine learning techniques whatsoever.</p>
<h3 id="results">Results:</h3>
<p><img src="/resources/post_1/mobilenetv2-libpHash-phashes-test_results-table3.png" alt="table3" /></p>
<p>Quite an improvement compared to the old hashes. Though sufficient in certain areas - like blur, noise, occlusion
with text overlay, compression, decimation, upscale and downscale - there are serious deficiencies in the others.<br />
It is noteworthy that the MH image hash is far superior in the bright, dark and histeq categories.</p>
<h1 id="ideas-for-other-post">Ideas For Other Post</h1>
<ul>
<li>
<p>Explore other models.</p>
</li>
<li>
<p>multiple layer auto-encoder. Try adding layers so as to diminish the features gradually.
Like three layers 1792->1024->512->256 with a sigmoid activation in each layer.
However, my efforts so far have met with limited success in this area - at least, in terms
of this testing procedure.</p>
</li>
<li>
<p>Contractive Auto-encoder. Add a Jacobian norm as a regularization term to the learning process.
The Jacobian matrix is a matrix of first derivatives of the hidden layer with respect to the
weights. Its norm is a measure of rate of change of the hidden layer. A regularization term
should theoretically smooth the area around the immediate vicinity of each training sample.</p>
</li>
</ul>
<p>I’ll leave it at that for this post. Questions or comments can be sent to me at <a href="mailto:starkd88@gmail.com"><em>Contact</em></a></p>
<p>Or simply reply in the issues on github: <a href="https://github.com/starkdg/pyConvnetPhash/issues"><em>Comments and Suggestions</em></a></p>starkdgImage descriptors are essential for organizing a collection of images for the purpose of search and retrieval. These can be as simple as unique file names, where an exact file name is needed to retrieve each and every source image. In order to preserve a sense of distance between images, however, where smaller distance indicate similarity and larger distances wholy different images, a more elaborate scheme is necessary. The descriptors need to preserve some of the essential features of the source image.