The O(n2) time complexity for full image DB matching is discussed on the main SnapMatcher page. A few different algorithms have been looked at to improve search performance, however these methods all have significant drawbacks which are discussed below.
Spatial Partitioning with Multi-Dimensional Hash Table
To improve performance a spatial partitioning method has been experimented with which can significantly reduce the number of pairings each image is tested in. Basically, each image is placed into a multidimensional hash table (one dimension for each value in the signature vector), and images are only tested against nearby images in this multidimensional space. The drawback to this approach is that spatial partitioning is a somewhat poor predictor of correlation coefficients, so over-aggressive partitioning will generally miss some image matches. Because of this partitioning is currently disabled, but can be restored to different levels by modifying the HASH_STEPS value in the code.
To improve the speedup achieved using partitioning without losing image matches a few different methods have been tried to here to generate the image signature, including using pixel brightness directly and the current method, storing the differences between adjacent pixels. While both methods provide reliable signature correlation coefficients between similar images, the difference method seems to be slightly more efficient to choosing good match candidates. Here is some experimental data (note that the brightness method included some false positives that pushed up the number of matches found, which is why difference achieve a perfect success rate with fewer matches). Unfortunately, the image collection I used to generate this data was discarded in a system upgrade:
Method | # Splits | Comparisons Made | Pairs Found | Success | Speedup |
---|---|---|---|---|---|
Difference | 1 | 4656 | 101 | 1.000 | 1.0 |
2 | 4656 | 101 | 1.000 | 1.0 | |
4 | 4493 | 101 | 1.000 | 1.0 | |
8 | 3320 | 100 | 0.990 | 1.4 | |
16 | 1181 | 91 | 0.901 | 3.9 | |
32 | 241 | 76 | 0.752 | 19.3 | |
64 | 80 | 60 | 0.594 | 58.2 | |
128 | 47 | 47 | 0.465 | 99.1 | |
256 | 40 | 40 | 0.396 | 116.4 | |
512 | 32 | 32 | 0.317 | 145.5 | |
1024 | 23 | 23 | 0.228 | 202.4 | |
Brightness | 1 | 4656 | 109 | 1.000 | 1.0 |
2 | 4656 | 109 | 1.000 | 1.0 | |
4 | 4656 | 109 | 1.000 | 1.0 | |
8 | 2619 | 99 | 0.908 | 1.8 | |
16 | 770 | 80 | 0.734 | 6.0 | |
32 | 154 | 76 | 0.697 | 30.2 | |
64 | 54 | 50 | 0.459 | 86.2 | |
128 | 44 | 44 | 0.404 | 105.8 | |
256 | 33 | 33 | 0.303 | 141.1 | |
512 | 27 | 27 | 0.248 | 172.4 | |
1024 | 18 | 18 | 0.165 | 258.7 |
Unfortunately, as you can see the success rate (percentage of actual matches found after partitioning) starts to decline just when speedup starts to appear. A 19 times speedup with 75% success rate (difference method, 32 splits) isn't too shabby though, if speed is more important than finding every matching pair. I think that a better signature method could improve this, but what I'd really like is a way to organize the DB so that correlating signatures cluster together, rather than using the current partitioning method that poorly approximates correlation. I'm thinking that something like PCA (principle component analysis) might be a tool to use here, but I haven't investigated this in depth. Any statisticians or search experts reading this feel free to contact me with any suggestions.
Cluster Methods
After deciding that spatial partitioning would likely provide at best marginal speedups while avoiding a large number of missed matches, I looked into true clustering methods which seemed better suited for this type of problem. It turns out that while clustering could provide substantial speedup in the search phase, reducing single image search from O(n) to O(log2n) for an optimally built hierarchical cluster, the process of building the cluster in the first place likely requires O(n3) time, significantly larger than the O(n2) required for a one time, brute force, all-pairs matching.
Most of the research I've skimmed concerning image matching in large databases involve large numbers of users querying a large public image repositories. In this sort of setup the clustering process can be performed offline by power computing clusters using parallel clustering algorithms, so that live queries can be made quickly. For a home user this approach does not make much sense because the burden of the image cluster creation cannot be hidden away from the end user making the queries. Perhaps a different clustering method could be used (possibly trading optimal hierarchical structure for slightly sub-optimal query times) that would strike a more appropriate balance, but unless the cluster creation time can be dropped to at least O(log2n) time, while improving search time without high risk of missed image matches, cluster based speedup methods will be put on hold indefinitely.
Conclusion
As it stands now huge repository of 1 million images can be compared against a single image query in only a few seconds, so I believe the most pragmatic approach to take with SnapMatcher is to improve image DB maintenance so that redundant queries are eliminated and DB updates are made incrementally. By comparing, say, 100 new images only to each other and to all images previously analyzed in a collection already containing thousands of images, full updates can be performed in only seconds. The only painful part of the process will be the creation of a completely new image DB for a significant image collection, and based on existing algorithms it looks like this phase likely cannot be improved to better than O(n2) without a significant percentage of missed matches.
Contact me at arkaein@monsterden.net with any questions, suggestions, bug reports or patches for SnapMatcher.