This talk describes new approximate nearest-neighbor methods employed in a scalable audio-feature database system called "AudioDB." This open-source system is designed to scale to storing and searching hundreds of millions of
feature vectors on standard UNIX workstation platforms. A radius-bounded nearest-neighbor vector-sequence search algorithm, based on
locality sensitive hashing (LSH), achieves sublinear retrieval times at this scale. The performance of the LSH-based algorithm depends critically on the choice of radius bound supplied-the wrong value impacts retrieval
accuracy or retrieval time. An optimal radius
estimator is
derived by modeling the minimum value distribution of a
random sample of a data set's pairwise distance distribution. When used with LSH this yields
accurate search results with retrieval times several orders of magnitude faster than exhaustive search methods and space-partitioning methods. The same
statistical sampling method is used to perform retrieval tasks at successively higher levels of specificity on labeled or unlabeled audio collections. The result is a system that (a) unifies audio retrieval tasks across a range of specificities, using the
statistical framework of background distance-distribution sampling and
hypothesis testing (b) is as
accurate as exhaustive search methods and (c) is three orders of magnitude faster than exhaustive search methods.