SnapMatcher - Monstrous Software

Spatial Partitioning Clustering Conclusion

The O(n²) time complexity for full image DB matching is discussed on the main SnapMatcher page. A few different algorithms have been looked at to improve search performance, however these methods all have significant drawbacks which are discussed below.

Spatial Partitioning with Multi-Dimensional Hash Table

To improve performance a spatial partitioning method has been experimented with which can significantly reduce the number of pairings each image is tested in. Basically, each image is placed into a multidimensional hash table (one dimension for each value in the signature vector), and images are only tested against nearby images in this multidimensional space. The drawback to this approach is that spatial partitioning is a somewhat poor predictor of correlation coefficients, so over-aggressive partitioning will generally miss some image matches. Because of this partitioning is currently disabled, but can be restored to different levels by modifying the HASH_STEPS value in the code.

To improve the speedup achieved using partitioning without losing image matches a few different methods have been tried to here to generate the image signature, including using pixel brightness directly and the current method, storing the differences between adjacent pixels. While both methods provide reliable signature correlation coefficients between similar images, the difference method seems to be slightly more efficient to choosing good match candidates. Here is some experimental data (note that the brightness method included some false positives that pushed up the number of matches found, which is why difference achieve a perfect success rate with fewer matches). Unfortunately, the image collection I used to generate this data was discarded in a system upgrade:

Method	# Splits	Comparisons Made	Pairs Found	Success	Speedup
Difference	1	4656	101	1.000	1.0
	2	4656	101	1.000	1.0
	4	4493	101	1.000	1.0
	8	3320	100	0.990	1.4
	16	1181	91	0.901	3.9
	32	241	76	0.752	19.3
	64	80	60	0.594	58.2
	128	47	47	0.465	99.1
	256	40	40	0.396	116.4
	512	32	32	0.317	145.5
	1024	23	23	0.228	202.4
Brightness	1	4656	109	1.000	1.0
	2	4656	109	1.000	1.0
	4	4656	109	1.000	1.0
	8	2619	99	0.908	1.8
	16	770	80	0.734	6.0
	32	154	76	0.697	30.2
	64	54	50	0.459	86.2
	128	44	44	0.404	105.8
	256	33	33	0.303	141.1
	512	27	27	0.248	172.4
	1024	18	18	0.165	258.7

Unfortunately, as you can see the success rate (percentage of actual matches found after partitioning) starts to decline just when speedup starts to appear. A 19 times speedup with 75% success rate (difference method, 32 splits) isn't too shabby though, if speed is more important than finding every matching pair. I think that a better signature method could improve this, but what I'd really like is a way to organize the DB so that correlating signatures cluster together, rather than using the current partitioning method that poorly approximates correlation. I'm thinking that something like PCA (principle component analysis) might be a tool to use here, but I haven't investigated this in depth. Any statisticians or search experts reading this feel free to contact me with any suggestions.

Cluster Methods

After deciding that spatial partitioning would likely provide at best marginal speedups while avoiding a large number of missed matches, I looked into true clustering methods which seemed better suited for this type of problem. It turns out that while clustering could provide substantial speedup in the search phase, reducing single image search from O(n) to O(log₂n) for an optimally built hierarchical cluster, the process of building the cluster in the first place likely requires O(n³) time, significantly larger than the O(n²) required for a one time, brute force, all-pairs matching.

Most of the research I've skimmed concerning image matching in large databases involve large numbers of users querying a large public image repositories. In this sort of setup the clustering process can be performed offline by power computing clusters using parallel clustering algorithms, so that live queries can be made quickly. For a home user this approach does not make much sense because the burden of the image cluster creation cannot be hidden away from the end user making the queries. Perhaps a different clustering method could be used (possibly trading optimal hierarchical structure for slightly sub-optimal query times) that would strike a more appropriate balance, but unless the cluster creation time can be dropped to at least O(log₂n) time, while improving search time without high risk of missed image matches, cluster based speedup methods will be put on hold indefinitely.

Conclusion

As it stands now huge repository of 1 million images can be compared against a single image query in only a few seconds, so I believe the most pragmatic approach to take with SnapMatcher is to improve image DB maintenance so that redundant queries are eliminated and DB updates are made incrementally. By comparing, say, 100 new images only to each other and to all images previously analyzed in a collection already containing thousands of images, full updates can be performed in only seconds. The only painful part of the process will be the creation of a completely new image DB for a significant image collection, and based on existing algorithms it looks like this phase likely cannot be improved to better than O(n²) without a significant percentage of missed matches.

Contact me at arkaein@monsterden.net with any questions, suggestions, bug reports or patches for SnapMatcher.

Back to SnapMatcher

Back to Monstrous Software

The SnapMatcher Speedup Problem

Spatial Partitioning with Multi-Dimensional Hash Table

Cluster Methods

Conclusion