Manual

User Manual:

Open the PDF directly: View PDF .
Page Count: 42

Non-Metric Space Library (NMSLIB) Manual

Non-Metric Space Library (NMSLIB) Manual

Bilegsaikhan Naidan, Leonid Boytsov, Yury Malkov, and David Novak

Maintainer: Leonid Boytsov leo@boytsov.info

Version 1.8

Thursday 6th June, 2019

Abstract. This document covers a library for fast similarity (k-NN)

search. It describes only search methods and distances (spaces). Details

about building, installing, Python bindings can be found online:

https://github.com/searchivarius/nmslib/tree/v1.8/. Even though

the library contains a variety of exact metric-space access methods, our

main focus is on more generic and approximate search methods, in par-

ticular, on methods for non-metric spaces. NMSLIB is possibly the ﬁrst

library with a principled support for non-metric space searching.

1 Objectives and History

Non-Metric Space Library (NMSLIB) is an eﬃcient and extendable cross-

platform similarity search library and a toolkit for evaluation of similarity search

methods. The core-library does not have any third-party dependencies.

The goal of the project is to create an eﬀective and comprehensive toolkit

for searching in generic metric and non-metric spaces. Even though the

library contains a variety of metric-space access methods, our main focus is on

generic and approximate search methods, in particular, on methods for non-

metric spaces. NMSLIB is possibly the ﬁrst library with a principled support for

non-metric space searching.

NMSLIB is an extendible library, which means that is possible to add new

search methods and distance functions. NMSLIB can be used directly in C++

and Python (via Python bindings). In addition, it is also possible to build a

query server, which can be used from Java (or other languages supported by

Apache Thrift). Java has a native client, i.e., it works on many platforms without

requiring a C++ library to be installed.

NMSLIB started as a personal project of Bilegsaikhan Naidan, who created

the initial code base, the Python bindings, and participated in earlier evalua-

tions. The most successful class of methods–neighborhood/proximity graphs–is

represented by the Hierarchical Navigable Small World Graph (HNSW) due to

Malkov and Yashunin [33]. Other most useful methods, include a modiﬁcation

of the Vantage-Point Tree (VP-tree) [6], a Neighborhood APProximation index

(NAPP) [43], which was improved by David Novak, as well as a vanilla uncom-

pressed inverted ﬁle.

The current version of the manual focuses solely on the description of:

2 Naidan et al

–search spaces and distances (see §3);

–search methods (see §4).

Details about building/installing, benchmarking, extending the code, as well

as basic tuning guidelines, can be found online:

https://github.com/nmslib/nmslib/tree/v1.8/manual/README.md.

2 Terminology and Problem Formulation

Similarity search is an essential part of many applications, which include, among

others, content-based retrieval of multimedia and statistical machine learning.

The search is carried out in a ﬁnite database of objects {oi}, using a search query

qand a dissimilarity measure (the term data point or simply a point is often

used a synonym to denote either a data object or a query). The dissimilarity

measure is typically represented by a distance function d(oi, q). The ultimate

goal is to answer a query by retrieving a subset of database objects suﬃciently

similar to the query q. These objects will be called answers. A combination of

data points and the distance function is called a search space, or simply a space.

Note that we use the terms distance and the distance function in a broader

sense than some of the textbooks: We do not assume that the distance is a

true metric distance. The distance function can disobey the triangle inequality

and/or be even non-symmetric.

Two retrieval tasks are typically considered: a nearest neighbor and a range

search. The nearest neighbor search aims to ﬁnd the least dissimilar object, i.e.,

the object at the smallest distance from the query. Its direct generalization is

the k-nearest neighbor search (the k-NN search), which looks for the kclosest

objects. Given a radius r, the range query retrieves all objects within a query

ball (centered at the query object q) with the radius r, or, formally, all the ob-

jects {oi}such that d(oi, q)≤r. In generic spaces, the distance is not necessarily

symmetric. Thus, two types of queries can be considered. In a left query, the ob-

ject is the left argument of the distance function, while the query is the right

argument. In a right query, qis the ﬁrst argument and the object is the second,

i.e., the right, argument.

The queries can be answered either exactly, i.e., by returning a complete

result set that does not contain erroneous elements, or, approximately, e.g.,

by ﬁnding only some answers. Thus, the methods are evaluated in terms of

eﬃciency-eﬀectiveness trade-oﬀs rather than merely in terms of their eﬃciency.

One common eﬀectiveness metric is recall. In the case of the nearest neighbor

search, it is computed as an average fraction of true neighbors returned by the

method with ties broken arbitrarily.

3 Spaces

Currently we provide implementations mostly for vector spaces. Vector-space

input ﬁles can come in either regular, i.e., dense, or sparse variant. A detailed list

of spaces, their parameters, and performance characteristics is given in Table 1.

NMSLIB Manual 3

The mnemonic name of the space is passed to python bindings function as

well as to the benchmarking utility see:

https://github.com/nmslib/nmslib/tree/v1.8/manual/benchmarking.md.

There can be more than one version of a distance function, which have diﬀerent

space-performance trade-oﬀ. In particular, for distances that require computa-

tion of logarithms we can achieve an order of magnitude improvement (e.g., for

the GNU C++ and Clang) by pre-computing logarithms at index time. This

comes at a price of extra storage. In the case of Jensen-Shannon distance func-

tions, we can pre-compute some of the logarithms and accurately approximate

those we cannot pre-compute. The details are explained in §3.2-3.6.

Straightforward slow implementations of the distance functions may have

the substring slow in their names, while faster versions contain the substring

fast. Fast functions that involve approximate computations contain additionally

the substring approx. For non-symmetric distance function, a space may have

two variants: one variant is for left queries (the data object is the ﬁrst, i.e., left,

argument of the distance function while the query object is the second argument)

and another is for right queries (the data object is the second argument and the

query object is the ﬁrst argument). In the latter case the name of the space ends

on rq. Separating spaces by query types, might not be the best approach. Yet,

it seems to be unavoidable, because, in many cases, we need separate indices to

support left and right queries [10]. If you know a better approach, feel free, to

tell us.

3.1 Details of Distance Eﬃciency Evaluation

Distance computation eﬃciency was evaluated on a Core i7 laptop (3.4 Ghz peak

frequency) in a single-threaded mode by the utility:

bench distfunc. It is measured in millions of computations per second for single-

precision ﬂoating pointer numbers (double precision computations are, of course,

more costly). The code was compiled using the GNU compiler. All data sets were

small enough to ﬁt in a CPU cache, which may have resulted in slightly more

optimistic performance numbers for cheap distances such as L2.

Somewhat higher eﬃciency numbers can be obtained by using the Intel com-

piler or the Visual Studio (Clang seems to be equally eﬃcient to the GNU com-

piler). In fact, performance is much better for distances relying on “heavy” math

functions: slow versions of KL- and Jensen-Shannon divergences and Jensen-

Shannon metrics, as well as for Lpspaces, where p6∈ {1,2,∞}.

In the eﬃciency test, all dense vectors have 128 elements. For all dense-vector

distances except the Jensen-Shannon divergence, their elements were generated

randomly and uniformly. For the Jensen-Shannon divergence, we ﬁrst generate

elements randomly, and next we randomly select elements that are set to zero

(approximately half of all elements). Additionally, for KL-divergences and the

JS-divergence, we normalize vector elements so that they correspond a true

discrete probability distribution.

4 Naidan et al

Sparse space distances were tested using sparse vectors from two sample ﬁles

in the sample data directory. Sparse vectors in the ﬁrst and the second ﬁle on

average contain about 100 and 600 non-zero elements, respectively.

String distances were tested using DNA sequences sampled from a human

genome.1The length of each string was sampled from a normal distribution

N(32,4).

The Signature Quadratic Form Distance (SQFD) [3,2] was tested using signa-

tures extracted from LSVRC-2014 data set [38], which contains 1.2 million high

resolution images. We implemented our own code to extract signatures following

the method of Beecks [2]. For each image, we selected 104pixels randomly and

mapped them into 7-dimensional feature space: three color, two position, and

two texture dimensions. The features were clustered by the standard k-means al-

gorithm with 20 clusters. Then, each cluster was represented by an 8-dimensional

vector, which included a 7-dimensional centroid and a cluster weight (the number

of cluster points divided by 104).

3.2 Lp-norms and the Hamming Distance

The Lpdistance between vectors xand yare given by the formula:

Lp(x, y) = n

i=1

|xi−yi|p!1/p

(1)

In the limit (p→ ∞), the Lpdistance becomes the Maximum metric, also known

as the Chebyshev distance:

L∞(x, y) = n

max

i=1 |xi−yi|(2)

L∞and all spaces Lpfor p≥1 are true metrics. They are symmetric, equal

to zero only for identical elements, and, most importantly, satisfy the triangle

inequality. However, the Lpnorm is not a metric if p < 1.

In the case of dense vectors, we have reasonably eﬃcient implementations

for Lpdistances where pis either integer or inﬁnity. The most eﬃcient imple-

mentations are for L1(Manhattan), L2(Euclidean), and L∞(Chebyshev). As

explained in the author’s blog, we compute exponents through square rooting.

This works best when the number of digits (after the binary digit) is small, e.g.,

if p= 0.125.

Any Lpspace can have a dense and a sparse variant. Sparse vector spaces

have their own mnemonic names, which are diﬀerent from dense-space mnemonic

names in that they contain a suﬃx sparse (see also Table 1). For instance l1

and l1 sparse are both L1spaces, but the former is dense and the latter is

sparse. The mnemonic names of L1,L2, and L∞spaces (passed to the bench-

marking utility) are l1,l2, and linf, respectively. Other generic Lphave the

1http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/

NMSLIB Manual 5

Table 1: Description of implemented spaces

Space Mnemonic Name & Formula Eﬃciency

(million op/sec)

Metric Spaces

Hamming bit hamming Pn

i=1 |xi−yi|50-400

Jaccard bit jaccard,jaccard sparse Pn

i=1 min(xi, yi)/Pn

i=1 max(xi, yi) 500-200, 2

L1l1,l1 sparse Pn

i=1 |xi−yi|35, 1.6

L2l2,l2 sparse pPn

i=1 |xi−yi|230, 1.6

L∞linf,linf sparse maxn

i=1 |xi−yi|34 , 1.6

Lp(generic p≥1) lp:p=...,lp sparse:p=... Pn

i=1 |xi−yi|p1/p 0.1-3, 0.1-1.2

Angular distance angulardist,angulardist sparse,angulardist sparse fast 13, 1.4, 3.5

arccos Pn

i=1 xiyi

√Pn

i=1 x2

i√Pn

i=1 y2

i

Jensen-Shan. metr. jsmetrslow, jsmetrfast, jsmetrfastapprox 0.3, 1.9, 4.8

2Pn

i=1 xilog xi+yilog yi−(xi+yi) log xi+yi

2

Levenshtein leven (see §3.8 for details) 0.2

SQFD sqfd minus func,sqfd heuristic func:alpha=..., 0.05, 0.05, 0.03

sqfd gaussian func:alpha=... (see §3.9 for details)

Non-metric spaces (symmetric distance)

Lp(generic p < 1) lp:p=..., lp sparse:p=... Pn

i=1 |xi−yi|p1/p 0.1-3, 0.1-1

Jensen-Shan. div. jsdivslow, jsdivfast, jsdivfastapprox 0.3, 1.9, 4.8

2Pn

i=1 xilog xi+yilog yi−(xi+yi) log xi+yi

2

Cosine distance cosinesimil,cosinesimil sparse,cosinesimil sparse fast 13, 1.4, 3.5

1−Pn

i=1 xiyi

√Pn

i=1 x2

i√Pn

i=1 y2

Norm. Levenshtein normleven, see §3.8 for details 0.2

Non-metric spaces (non-symmetric distance)

Regular KL-div. left queries: kldivfast 0.5, 27

right queries: kldivfastrq

i=1 xilog xi

Generalized KL-div. left queries: kldivgenslow,kldivgenfast 0.5, 27

right queries: kldivgenfastrq 27

i=1 hxilog xi

yi−xi+yii

Itakura-Saito left queries: itakurasaitoslow, itakurasaitofast 0.2, 3, 14

right queries: itakurasaitofastrq 14

i=1 hxi

yi−log xi

yi−1i

R´enyi divergence renyidiv slow,renyidiv fast 1

α−1log m

i=1

xα

iy1−α

i0.4, 0.5-1.5

6 Naidan et al

name lp, which is used in combination with a parameter. For instance, L3is

denoted as lp:p=3.

Distance functions for sparse-vector spaces are far less eﬃcient, due to a

costly, branch-heavy, operation of matching sparse vector indices (between two

sparse vectors).

In the special case of L1for binary vectors, the L1distance becomes the

Hamming distance. This case is represented by the bit hamming space, where

data points are stored as compact bit vectors.

3.3 Scalar-product Related Distances

We have two distance function whose formulas include normalized scalar prod-

uct. One is the cosine distance, which is equal to:

d(x, y)=1−Pn

i=1 xiyi

pPn

i=1 x2

ipPn

i=1 y2

The cosine distance is not a true metric, but it can be converted into one by

applying a monotonic transformation (i.e.., subtracting the cosine distance from

one and taking an inverse cosine). The resulting distance function is a true

metric, which is called the angular distance. The angular distance is computed

using the following formula:

d(x, y) = arccos Pn

i=1 xiyi

pPn

i=1 x2

ipPn

i=1 y2

In the case of sparse spaces, to compute the scalar product, we need to ob-

tain an intersection of vector element ids corresponding to non-zero elements.

A classic text-book intersection algorithm (akin to a merge-sort) is not particu-

larly eﬃcient, apparently, due to frequent branching. For single-precision ﬂoating

point vector elements, we provide a more eﬃcient implementation that relies on

the all-against-all comparison SIMD instruction mm cmpistrm. This implemen-

tation (inspired by the set intersection algorithm of Schlegel et al. [41]) is about

2.5-3 times faster than a pure C++ implementation based on the merge-sort

approach.

3.4 Jaccard Distance

The Jaccard distance is true metric. Given two binary vectors xand y, the

Jaccard distance (also called the index) is computed using the following formula:

i=1 min(xi, yi)

i=1 max(xi, yi)

For the Jaccard distance, there are two ways to represent data:

–A sparse set of dimensions (space jaccard sparse);

–A dense binary bit vector (space bit jaccard).

NMSLIB Manual 7

3.5 Jensen-Shannon Divergence

Jensen-Shannon divergence is a symmetrized and smoothed KL-divergence:

i=1 xilog xi+yilog yi−(xi+yi) log xi+yi

2(3)

This divergence is symmetric, but it is not a metric function. However, the square

root of the Jensen-Shannon divergence is a proper a metric [19], which we call

the Jensen-Shannon metric.

A straightforward implementation of Eq. 3 is ineﬃcient for two reasons (at

least when one uses the GNU C++ compiler) (1) computation of logarithms is a

slow operation (2) the case of zero xiand/or yirequires conditional processing,

i.e., costly branches.

A better method is to pre-compute logarithms of data at index time. It is also

necessary to compute logarithms of a query vector. However, this operation has

a little cost since it is carried out once for each nearest neighbor or range query.

Pre-computation leads to a 3-10 fold improvement depending on the sparsity of

vectors, albeit at the expense of requiring twice as much space. Unfortunately,

it is not possible to avoid computation of the third logarithm: it needs to be

computed in points that are not known until we see the query vector.

However, it is possible to approximate it with a very good precision, which

should be suﬃcient for the purpose of approximate searching. Let us rewrite

Equation 3 as follows:

i=1 xilog xi+yilog yi−(xi+yi) log xi+yi

2=

i=1

[xilog xi+yilog yi]−

i=1 (xi+yi)

2log xi+yi

2=

i=1

xilog xi+yilog yi−

i=1

(xi+yi)

2log 1

2+ log max(xi, yi) + log 1 + min(xi, yi)

max(xi, yi) (4)

We can pre-compute all the logarithms in Eq. 4 except for log 1 + min(xi,yi)

max(xi,yi).

However, its argument value is in a small range: from one to two. We can dis-

cretize the range, compute logarithms in many intermediate points and save the

computed values in a table. Finally, we employ the SIMD instructions to im-

plement this approach. This is a very eﬃcient approach, which results in a very

little (around 10−6on average) relative error for the value of the Jensen-Shannon

divergence.

Another possible approach is to use an eﬃcient approximation for logarithm

computation. As our tests show, this method is about 1.5x times faster (1.5 vs

8 Naidan et al

1.0 billions of logarithms per second), but for the logarithms in the range [1,2],

the relative error is one order magnitude higher (for a single logarithm) than for

the table-based discretization approach.

3.6 Bregman Divergences

Bregman divergences are typically non-metric distance functions, which are equal

to a diﬀerence between some convex diﬀerentiable function fand its ﬁrst-order

Taylor expansion [8,10]. More formally, given the convex and diﬀerentiable func-

tion f(of many variables), its corresponding Bregman divergence df(x, y) is

equal to:

df(x, y) = f(x)−f(y)−(f(y)·(x−y))

where x·ydenotes the scalar product of vectors xand y. In this library, we im-

plement the generalized KL-divergence and the Itakura-Saito divergence, which

correspond to functions f=Pxilog xi−Pxiand f=−Plog xi. The gener-

alized KL-divergence is equal to:

i=1 xilog xi

−xi+yi,

while the Itakura-Saito divergence is equal to:

i=1 xi

−log xi

−1.

If vectors xand yare proper probability distributions, Pxi=Pyi= 1. In this

case, the generalized KL-divergence becomes a regular KL-divergence:

i=1 xilog xi

yi.

Computing logarithms is costly: We can considerably improve eﬃciency of

Itakura-Saito divergence and KL-divergence by pre-computing logarithms at in-

dex time. The spaces that implement this functionality contain the substring

fast in their mnemonic names (see also Table 1).

3.7 R´enyi Divergence

The R´enyi divergence is a family of generally non-symmetric distances computed

by the formula:

α−1log "m

i=1

xα

iy1−α

The value of the parameter αshould be greater than zero. For all values except

α= 0.5 these distances are non-symmetric. There are two variants of each

NMSLIB Manual 9

space: renyidiv slow and renyidiv fast. They have a parameter alpha. The

slower variant computes the exponents using a standard function pow. For the

fast variant, as explained in the author’s blog, we compute exponents through

square rooting. This works best when the number of digits (after the binary

digit) is small, e.g., if p= 0.125. The slow variant can actually be faster, if the

compiler, e.g., MS Visual Studio or the Intel Compiler, implements an eﬃcient

approximate variant of the pow function.

3.8 String Distances

We currently provide implementations for the Levenshtein distance and its length-

normalized variant. The original Levenshtein distance is equal to the minimum

number of insertions, deletions, and substitutions (but not transpositions) re-

quired to obtain one string from another [27]. The distance between strings p

and sis computed using the classic O(m×n) dynamic programming solution,

where mand nare lengths of strings pand s, respectively. The normalized Lev-

enshtein distance is obtained by dividing the original Levenshtein distance by

the maximum of string lengths. If both strings are empty, the distance is equal

to zero.

While the original Levenshtein distance is a metric distance, the normalized

Levenshtein function is not, because the triangle inequality may not hold. In

practice, when there is little variance in string length, the violation of the tri-

angle inequality is infrequent and, thus, the normalized Levenshtein distance is

approximately metric for many real data sets.

Technically, the classic Levenshtein distance is equal to Cn,m, where Ci,j is

computed via the classic recursion:

Ci,j = min 









0,if i=j= 0

Ci−1,j + 1,if i > 0

Ci,j−1+ 1,if j > 0

Ci−1,j−1+ [pi6=sj],if i, j > 0

(5)

Because computation time is proportional to both strings’ length, this can be

a costly operation: for the sample data set described in §3.1, it is possible to

compute only about 200K distances per second.

The classic algorithm to compute the Levenshtein distance was independently

discovered by several researchers in various contexts, including speech recogni-

tion [47,46,39] and computational biology [36] (see Sankoﬀ [40] for a historical

perspective). Despite the early discovery, the algorithm was generally unknown

before a publication by Wagner and Fischer [48] in a computer science journal.

3.9 Signature Quadratic Form Distance (SQFD)

Images can be compared using a family of metric functions called the Signature

Quadratic Form Distance (SQFD). During the preprocessing stage, each image

10 Naidan et al

is converted to a set of nsignatures (the number of signatures nis a parameter).

To this end, a ﬁxed number of pixels is randomly selected. Then, each pixel

is represented by a 7-dimensional vector with the following components: three

color, two position, and two texture elements. These 7-dimensional vectors are

clustered by the standard k-means algorithm with ncenters. Finally, each cluster

is represented by an 8-dimensional vector, called signature. A signature includes

a 7-dimensional centroid and a cluster weight (the number of cluster points

divided by the total number of randomly selected pixels). Cluster weights form

asignature histogram.

The SQFD is computed as a quadratic form applied to a 2n-dimensional vec-

tor constructed by combining images’ signature histograms. The combination

vector includes nunmodiﬁed signature histogram values of the ﬁrst image fol-

lowed by nnegated signature histogram values of the second image. Unlike the

classic quadratic form distance, where the quadratic form matrix is ﬁxed, in the

case of the SQFD, the matrix is re-computed for each pair of images. This can

be seen as computing the distance between inﬁnite-dimensional vectors each of

which has only a ﬁnite number of non-zero elements.

To compute the quadratic form matrix, we introduce the new global enu-

meration of signatures, in which a signature kfrom the ﬁrst image has number

k, while the signature kfrom the second image has number n+k. To obtain

a quadratic form matrix element in row icolumn jwe ﬁrst compute the Eu-

clidean distance dbetween the i-th and the j-th signature. Then, the value dis

transformed using one of the three functions: negation (the minus function −d),

aheuristic function 1

α+d, and the Gaussian function exp(−αd2). The larger is

the distance, the smaller is the coeﬃcient in the matrix of the quadratic form.

Note that the SQFD is a family of distances parameterized by the choice of

the transformation function and α. For further details, please, see the thesis of

Beecks [2].

4 Search Methods

Implemented search methods can be broadly divided into the following cate-

gories:

–Space partitioning methods (including a specialized method bbtree for Breg-

man divergences) §4.1;

–Locality Sensitive Hashing (LSH) methods §4.2;

–Filter-and-reﬁne methods based on projection to a lower-dimensional space

§4.3;

–Filtering methods based on permutations §4.4;

–Methods that construct a proximity graph §4.5;

–Miscellaneous methods §4.6.

In the following subsections (§4.1-4.6), we describe implemented methods,

explain their parameters, and provide examples of their use via the benchmarking

utility experiment (experiment.exe on Windows). The details on building the

NMSLIB Manual 11

utility experiment can be found online:

https://github.com/nmslib/nmslib/tree/v1.8/manual/README.md.

4.1 Space Partitioning Methods

Parameters of space partitioning methods are summarized in Table 2. Most of

these methods are hierarchical partitioning methods.

Hierarchical space partitioning methods create a hierarchical decomposition

of the space (often in a recursive fashion), which is best represented by a tree

(or a forest). There are two main partitioning approaches: pivoting and compact

partitioning schemes [14].

Pivoting methods rely on embedding into a vector space where vector ele-

ments are distances from the object to pivots. Partitioning is based on how far

(or close) the data points are located with respect to pivots. 2

Hierarchical partitions produced by pivoting methods lack locality: a single

partition can contain not-so-close data points. In contrast, compact partitioning

schemes exploit locality. They either divide the data into clusters or create,

possibly approximate, Voronoi partitions. In the latter case, for example, we can

select several centers/pivots πiand associate data points with the closest center.

If the current partition contains fewer than bucketSize (a method parame-

ter) elements, we stop partitioning of the space and place all elements belonging

to the current partition into a single bucket. If, in addition, the value of the

parameter chunkBucket is set to one, we allocate a new chunk of memory that

contains a copy of all bucket vectors. This method often halves retrieval time at

the expense of extra memory consumed by a testing utility (e.g., experiment)

as it does not deallocate memory occupied by the original vectors. 3

Classic hierarchical space partitioning methods for metric spaces are exact. It

is possible to make them approximate via an early termination technique, where

we terminate the search after exploring a pre-speciﬁed number of partitions. To

implement this strategy, we deﬁne an order of visiting partitions. In the case

of clustering methods, we ﬁrst visit partitions that are closer to a query point.

In the case of hierarchical space partitioning methods such as the VP-tree, we

greedily explore partitions containing the query.

In NMSLIB, the early termination condition is deﬁned in terms of the max-

imum number of buckets (parameter maxLeavesToVisit) to visit before termi-

nating the search procedure. By default, the parameter maxLeavesToVisit is

set to a large number (2147483647), which means that no early termination is

employed. The parameter maxLeavesToVisit is supported by many, but not all

space partitioning methods.

2If the original space is metric, mapping an object to a vector of distances to pivots

deﬁnes the contractive embedding in the metric spaces with L∞distance. That is,

the L∞distance in the target vector space is a lower bound for the original distance.

3Keeping original vectors simpliﬁes the testing workﬂow. However, this is not nec-

essary for a real production system. Hence, storing bucket vectors at contiguous

memory locations does not have to result in a larger memory footprint.

12 Naidan et al

4.1.1 VP-tree A VP-tree [45,50] (also known as a ball-tree) is a pivoting

method. During indexing, a (random) pivot is selected and a set of data objects

is divided into two parts based on the distance to the pivot. If the distance

is smaller than the median distance, the objects are placed into one (inner)

partition. If the distance is larger than the median, the objects are placed into

the other (outer) partition. If the distance is exactly equal to the median, the

placement can be arbitrary.

We attempt to select the pivot multiple times. Each time, we measure the

variance of distances to the pivot. Eventually, we use the pivot that corresponds

to the maximum variance. The number of attempts to select the pivot is con-

trolled by the index-time parameter selectPivotAttempts.

The VP-tree in metric spaces is an exact search method, which relies on the

triangle inequality. It can be made approximate by applying the early termina-

tion strategy (as described in the previous subsection). Another approximate-

search approach, which is currently implemented only for the VP-tree, is based

on the relaxed version of the triangle inequality.

Assume that πis the pivot in the VP-tree, qis the query with the radius

r, and Ris the median distance from πto every other data point. Due to the

triangle inequality, pruning is possible only if r≤ |R−d(π, q)|. If this latter

condition is true, we visit only one partition that contains the query point. If

r > |R−d(π, q)|, there is no guarantee that all answers are in the same partition

as q. Thus, to guarantee retrieval of all answers, we need to visit both partitions.

The pruning condition based on the triangle inequality can be overly pes-

simistic. By selecting some α > 1 and opting to prune when r≤α|R−d(π, q)|,

we can improve search performance at the expense of missing some valid an-

swers. The eﬃciency-eﬀectiveness trade-oﬀ is aﬀected by the choice of α: Note

that for some (especially low-dimensional) data sets, a modest loss in recall (e.g.,

by 1-5%) can lead to an order of magnitude faster retrieval. Not only the triangle

inequality can be overly pessimistic in metric spaces, but it often fails to capture

the geometry of non-metric spaces. As a result, if the metric space method is

applied to a non-metric space, the recall can be too low or retrieval time can be

too long.

Yet, in non-metric spaces, it is often possible to answer queries, when using α

possibly smaller than one [6,34]. More generally, we assume that there exists an

unknown decision/pruning function D(R, d(π, q)) and that pruning is done when

r≤D(R, d(π, q)). The decision function D(), which can be learned from data,

is called a search oracle. A pruning algorithm based on the triangle inequality is

a special case of the search oracle described by the formula:

Dπ,R(x) = (αleft|x−R|explef t ,if x≤R

αright|x−R|expright ,if x≥R(6)

There are several ways to obtain/specify optimal parameters for the VP-tree:

–using the auto-tuning procedure ﬁred before creation of the index;

–using the standalone tuning utility tune vptree (tune vptree.exe for Win-

dows);

NMSLIB Manual 13

–fully manually.

It is, perhaps, easiest to initiate the tunning procedure during creation of

the index. To this end, one needs to specify parameters desiredRecall (the

minimum desired recall), bucketSize (the size of the bucket), tuneK or tuneR,

and (optionally) parameters tuneQty,minExp and maxExp. Parameters tuneK

and tuneR are used to specify the value of kfor k-NN search, or the search

radius rfor the range search.

The parameter tuneQty deﬁnes the maximum number of records in a subset

that is used for tuning. The tunning procedure will sample tuneQty records from

the main set to make a (potentially) smaller data test. Additional query sets will

be created by further random sampling of points from this smaller data set.

The tuning procedure considers all possible values for exponents between

minExp and maxExp with a restriction that expleft =expright. By default, minExp

=maxExp = 1, which is usually a good setting. For each value of the exponents,

the tunning procedure carries out a grid-like search procedure for parameters

αleft and αright with several random restarts. It creates several indices for the

tuning subset and runs a batch of mini-experiments to ﬁnd parameters yield-

ing the desired recall value at the minimum cost. If it is necessary to produce

more accurate estimates, the tunning method may use automatically adjusted

values for parameters tuneQty,bucketSize, and desiredRecall. The tunning

algorithm cannot adjust the parameter maxLeavesToVisit: please, do not use

it with the auto-tunning procedure.

The disadvantage of automatic tuning is that it might fail to obtain param-

eters for a desired recall level. Another limitations is that a tunning procedure

cannot run on very small data sets (less than two thousand entries).

The standalone tuning utility tune vptree exploits an almost identical tun-

ing procedure. It diﬀers from index-time auto-tuning in several ways:

–It can be used with other VP-tree based methods, in particular, with the

projection VP-tree (see §4.3.2).

–It allows the user to specify a separate query set, which can be useful when

queries cannot be accurately modelled by a bootstrapping approach (sam-

pling queries from the main data set).

–Once the optimal values are computed, they can be further re-used without

the need to start the tunning procedure each time the index is created.

–However, the user is fully responsible for specifying the size of the test data

set and the value of the parameter desiredRecall: the system will not try

to change them for optimization purposes.

If automatic tunning fails, the user can restart the procedure with the smaller

value of desiredRecall. Alternatively, the user can manually specify values of

parameters: alphaLeft,alphaRight,expLeft, and expRight (by default expo-

nents are one).

The following is an example of testing the VP-tree with the benchmarking

utility experiment without the auto-tunning (note the separation into index-

and query-time parameters):

14 Naidan et al

release/experiment \

--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \

--knn 1 --range 0.1 \

--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \

--method vptree \

--createIndex bucketSize=10,chunkBucket=1 \

--queryTimeParams alphaLeft=2.0,alphaRight=2.0,\

expLeft=1,expRight=1,\

maxLeavesToVisit=500

To initiate auto-tuning, one may use the following command line (note that

we do not use the parameter maxLeavesToVisit here):

release/experiment \

--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \

--knn 1 --range 0.1 \

--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \

--method vptree \

--createIndex tuneK=1,desiredRecall=0.9,\

bucketSize=10,chunkBucket=1

4.1.2 Multi-Vantage Point Tree It is possible to have more than one pivot

per tree level. In the binary version of the multi-vantage point tree (MVP-tree),

which is implemented in NMSLIB, there are two pivots. Thus, each partition

divides the space into four parts, which are similar to partitions created by two

levels of the VP-tree. The diﬀerence is that the VP-tree employs three pivots to

divide the space into four parts, while in the MVP-tree two pivots are used.

In addition, in the MVP-tree we memorize distances between a data object

and the ﬁrst maxPathLen (method parameter) pivots on the path connecting the

root and the leaf that stores this data object. Because mapping an object to a

vector of distances (to maxPathLen pivots) deﬁnes the contractive embedding in

the metric spaces with L∞distance, these values can be used to improve the

ﬁltering capacity of the MVP-tree and, consequently to reduce the number of

distance computations.

The following is an example of testing the MVP-tree with the benchmarking

utility experiment:

release/experiment \

--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \

--knn 1 --range 0.1 \

--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \

--method mvptree \

--createIndex maxPathLen=4,bucketSize=10,chunkBucket=1 \

--queryTimeParams maxLeavesToVisit=500

Our implementation of the MVP-tree permits to answer queries both exactly

and approximately (by specifying the parameter maxLeavesToVisit). Yet, this

implementation should be used only with metric spaces.

NMSLIB Manual 15

Table 2: Parameters of space partitioning methods

Common parameters

bucketSize A maximum number of elements in a bucket/leaf.

chunkBucket Indicates if bucket elements should be stored contiguously in

memory (1 by default).

maxLeavesToVisit An early termination parameter equal to the maximum num-

ber of buckets (tree leaves) visited by a search algorithm

(2147483647 by default).

VP-tree (vptree) [45,50]

Common parameters: bucketSize,chunkBucket, and

maxLeavesToVisit

selectPivotAttempts A number of pivot selection attempts (5 by default)

alphaLeft/alphaRight A stretching coeﬃcient αleft/αright in Eq. (6)

expLeft/expRight The left/right exponent in Eq. (6)

tuneK The value of kused in the auto-tunning procedure (in the case

of k-NN search)

tuneR The value of the radius rused in the auto-tunning procedure

(in the case of the range search)

minExp/maxExp The minimum/maximum value of exponent used in the auto-

tunning procedure

Multi-Vantage Point Tree (mvptree) [7]

Common parameters: bucketSize,chunkBucket, and

maxLeavesToVisit

maxPathLen the maximum number of top-level pivots for which we memorize

distances to data objects in the leaves

GH-tree (ghtree) [45]

Common parameters: bucketSize,chunkBucket, and

maxLeavesToVisit

List of clusters (list clusters) [13]

Common parameters: bucketSize,chunkBucket, and

maxLeavesToVisit. Note maxLeavesToVisit is a query-

time parameter.

useBucketSize If equal to one, we use the parameter bucketSize to determine

the number of points in the cluster. Otherwise, the size of the

cluster is deﬁned by the parameter radius.

radius The maximum radius of a cluster (used when useBucketSize

is set to zero).

strategy A cluster selection strategy. It is one of the follow-

ing: random,closestPrevCenter,farthestPrevCenter,

minSumDistPrevCenters,maxSumDistPrevCenters.

SA-tree (satree) [35]

No parameters

bbtree (bbtree) [10]

Common parameters: bucketSize,chunkBucket, and

maxLeavesToVisit

Note: mnemonic method names are given in round brackets.

16 Naidan et al

4.1.3 GH-Tree A GH-tree [45] is a binary tree. In each node the data set

is divided using two randomly selected pivots. Elements closer to one pivot are

placed into a left subtree, while elements closer to the second pivot are placed

into a right subtree.

The following is an example of testing the GH-tree with the benchmarking

utility experiment:

release/experiment \

--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \

--knn 1 --range 0.1 \

--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \

--method ghtree \

--createIndex bucketSize=10,chunkBucket=1 \

--queryTimeParams maxLeavesToVisit=10

Our implementation of the GH-tree permits to answer queries both exactly

and approximately (by specifying the parameter maxLeavesToVisit). Yet, this

implementation should be used only with metric spaces.

4.1.4 List of Clusters The list of clusters [13] is an exact search method for

metric spaces, which relies on ﬂat (i.e., non-hierarchical) clustering. Clusters are

created sequentially starting by randomly selecting the ﬁrst cluster center. Then,

close points are assigned to the cluster and the clustering procedure is applied

to the remaining points. Closeness is deﬁned either in terms of the maximum

radius, or in terms of the maximum number (bucketSize) of points closest to

the center.

Next we select cluster centers according to one of the policies: random selec-

tion, a point closest to the previous center, a point farthest from the previous

center, a point that minimizes the sum of distances to the previous center, and

a point that maximizes the sum of distances to the previous center. In our ex-

perience, a random selection strategy (a default one) works well in most cases.

The search algorithm iterates over the constructed list of clusters and checks

if answers can potentially belong to the currently selected cluster (using the

triangle inequality). If the cluster can contain an answer, each cluster element

is compared directly against the query. Next, we use the triangle inequality to

verify if answers can be outside the current cluster. If this is not possible, the

search is terminated.

We modiﬁed this exact algorithm by introducing an early termination condi-

tion. The clusters are visited in the order of increasing distance from the query to

a cluster center. The search process stops after vising a maxLeavesToVisit clus-

ters. Our version is supposed to work for metric spaces (and symmetric distance

functions), but it can also be used with mildly-nonmetric symmetric distances

such as the cosine distance.

NMSLIB Manual 17

An example of testing the list of clusters using the bucketSize as a parameter

to deﬁne the size of the cluster:

release/experiment \

--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \

--knn 1 --range 0.1 \

--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \

--method list_clusters \

--createIndex useBucketSize=1,bucketSize=100,strategy=random \

--queryTimeParams maxLeavesToVisit=5

An example of testing the list of clusters using the radius as a parameter to

deﬁne the size of the cluster:

release/experiment \

--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \

--knn 1 --range 0.1 \

--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \

--method list_clusters \

--createIndex useBucketSize=0,radius=0.2,strategy=random \

--queryTimeParams maxLeavesToVisit=5

4.1.5 SA-tree The Spatial Approximation tree (SA-tree) [35] aims to approx-

imate the Voronoi partitioning. A data set is recursively divided by selecting

several cluster centers in a greedy fashion. Then, all remaining data points are

assigned to the closest cluster center.

A cluster-selection procedure ﬁrst randomly chooses the main center point

and arranges the remaining objects in the order of increasing distances to this

center. It then iteratively ﬁlls the set of clusters as follows: We start from the

empty cluster list. Then, we iterate over the set of data points and check if there

is a cluster center that is closer to this point than the main center point. If no

such cluster exists (i.e., the point is closer to the main center point than to any

of the already selected cluster centers), the point becomes a new cluster center

(and is added to the list of clusters). Otherwise, the point is added to the nearest

cluster from the list.

After the cluster centers are selected, each of them is indexed recursively

using the already described algorithm. Before this, however, we check if there

are points that need to be reassigned to a diﬀerent cluster. Indeed, because the

list of clusters keeps growing, we may miss the nearest cluster not yet added to

the list. To ﬁx this, we need to compute distances among every cluster point and

cluster centers that were not selected at the moment of the point’s assignment

to the cluster.

Currently, the SA-tree is an exact search method for metric spaces without

any parameters. The following is an example of testing the SA-tree with the

benchmarking utility experiment:

release/experiment \

--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \

--knn 1 --range 0.1 \

18 Naidan et al

--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \

--method satree

4.1.6 bbtree A Bregman ball tree (bbtree) is an exact search method for

Bregman divergences [10]. The bbtree divides data into two clusters (each cov-

ered by a Bregman ball) and recursively repeats this procedure for each cluster

until the number of data points in a cluster falls below bucketSize. Then, such

clusters are stored as a single bucket.

At search time, the method relies on properties of Bregman divergences to

compute the shortest distance to a covering ball. This is a rather expensive

iterative procedure that may require several computations of direct and inverse

gradients, as well as of several distances.

Additionally, Cayton [10] employed an early termination method: The algo-

rithm can be told to stop after processing a maxLeavesToVisit buckets. The

resulting method is an approximate search procedure.

Our implementation of the bbtree uses the same code to carry out the nearest-

neighbor and the range searching. Such an implementation of the range searching

is somewhat suboptimal and a better approach exists [11].

The following is an example of testing the bbtree with the benchmarking

utility experiment:

release/experiment \

--distType float --spaceType kldivgenfast \

--testSetQty 5 --maxNumQuery 100 \

--knn 1 --range 0.1 \

--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \

--method bbtree \

--createIndex bucketSize=10 \

--queryTimeParams maxLeavesToVisit=20

4.2 Locality-sensitive Hashing Methods

Locality Sensitive Hashing (LSH) [25,26] is a class of methods employing hash

functions that tend to have the same hash values for close points and diﬀerent

hash values for distant points. It is a probabilistic method in which the probabil-

ity of having the same hash value is a monotonically decreasing function of the

distance between two points (that we compare). A hash function that possesses

this property is called locality sensitive.

Our library embeds the LSHKIT which provides locality sensitive hash func-

tions in L1and L2. It supports only the nearest-neighbor (but not the range)

search. Parameters of LSH methods are summarized in Table 3. The LSH meth-

ods are not available under Windows.

Random projections is a common approach to design locality sensitive hash

functions. These functions are composed from Mbinary hash functions hi(x). A

concatenation of the binary hash function values, i.e., h1(x)h2(x). . . hM(x), is

interpreted as a binary representation of the hash function value h(x). Pointers

NMSLIB Manual 19

Table 3: Parameters of LSH methods

Common parameters

WA width of the window [16].

MA number of atomic (binary hash functions), which are con-

catenated to produce an integer hash value.

HA size of the hash table.

LThe number hash tables.

Multiprobe LSH: only for L2(lsh multiprobe) [30,18,16]

Common parameters: W,M,H, and L

Ta number of probes

desiredRecall a desired recall

numSamplePairs a number of samples (P in lshkit)

numSampleQueries a number of sample queries (Q in lshkit)

tuneK ﬁnd optimal parameter for k-NN , search where kis deﬁned by

this parameter

LSH Gaussian: only for L2(lsh gaussian)[12]

Common parameters: W,M,H, and L

LSH Cauchy: only for L1(lsh cauchy) [12]

Common parameters: W,M,H, and L

LSH thresholding: only for L1(lsh threshold) [49,29]

Common parameters: M,H, and L(Wis not used)

Note: mnemonic method names are given in round brackets.

to objects with equal hash values (modulo H) are stored in same cells of the

hash table (of the size H). If we used only one hash table, the probability of

collision for two similar objects would be too low. To increase the probability of

ﬁnding a similar object multiple hash tables are used. In that, we use a separate

(randomly selected) hash function for each hash table.

To generate binary hash functions we ﬁrst select a parameter W(called a

width). Next, for every binary hash function, we draw a value aifrom a p-stable

distribution [15], and a value bifrom the uniform distribution with the support

[0, W ]. Finally, we deﬁne hi(x) as:

hi(x) = ai·x+bi

W,

20 Naidan et al

where bxcis the floor function and x·ydenotes the scalar product of xand y.

For the L2a standard Guassian distribution is p-stable, while for L1distance

one can generate hash functions using a Cauchy distribution [15]. For L1, the

LSHKIT deﬁnes another (“thresholding”) approach based on sampling. It is

supposed to work best for data points enclosed in a cube [a, b]d. We omit the

description here and refer the reader to the papers that introduced this method

[49,29].

One serious drawback of the LSH is that it is memory-greedy. To reduce

the number of hash tables while keeping the collision probability for similar

objects suﬃciently high, it was proposed to “multi-probe” the same hash table

more than once. When we obtain the hash value h(x), we check (i.e., probe) not

only the contents of the hash table cell h(x) mod H, but also contents of cells

whose binary codes are “close” to h(x) (i.e, they may diﬀer by a small number of

bits). The LSHKIT, which is embedded in our library, contains a state-of-the-art

implementation of the multi-probe LSH that can automatically select optimal

values for parameters Mand Wto achieve a desired recall (remaining parameters

still need to be chosen manually).

The following is an example of testing the multi-probe LSH with the bench-

marking utility experiment. We aim to achieve the recall value 0.25 (parameter

desiredRecall) for the 1-NN search (parameter tuneK):

release/experiment \

--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \

--knn 1 \

--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \

--method lsh_multiprobe \

--createIndex desiredRecall=0.25,tuneK=1,\

T=5,L=25,H=16535

The classic version of the LSH for L2can be tested as follows:

release/experiment \

--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \

--knn 1 \

--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \

--method lsh_gaussian \

--createIndex W=2,L=5,M=40,H=16535

There are two ways to use LSH for L1. First, we can invoke the implemen-

tation based on the Cauchy distribution:

release/experiment \

--distType float --spaceType l1 --testSetQty 5 --maxNumQuery 100 \

--knn 1 \

--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \

--method lsh_cauchy \

--createIndex W=2,L=5,M=10,H=16535

Second, we can use L1implementation based on thresholding. Note that it

does not use the width parameter W:

NMSLIB Manual 21

release/experiment \

--distType float --spaceType l1 --testSetQty 5 --maxNumQuery 100 \

--knn 1 \

--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \

--method lsh_threshold \

--createIndex L=5,M=60,H=16535

4.3 Projection-based Filter-and-Reﬁne Methods

Projection-based ﬁlter-and-reﬁne methods operate by mapping data and query

points to a low(er) dimensional space (a projection space) with a simple, easy

to compute, distance function. The search procedure consists in generation of

candidate entries by searching in a low-dimensional projection space with sub-

sequent reﬁnement, where candidate entries are directly compared against the

query using the original distance function.

The number of candidate records is an important method parameter, which

can be speciﬁed as a fraction of the total number of data base entries (parameter

dbScanFrac).

Diﬀerent projection-based methods arise depending on: the type of a projec-

tion, the type of the projection space, and on the type of the search algorithm

for the projection space. A type of the projection can be speciﬁed via a method’s

parameter projType. A dimensionality of the projection space is speciﬁed via a

method’s parameter projDim.

We support four well-known types of projections:

–Classic random projections using random orthonormal vectors (mnemonic

name rand);

–Fastmap (mnemonic name fastmap);

–Distances to random reference points/pivots (mnemonic name randrefpt);

–Based on permutations perm;

All but the classic random projections are distance-based and can be applied to

an arbitrary space with the distance function. Random projections can be applied

only to vector spaces. A more detailed description of projection approaches is

given in §A

We provide two basic implementations to generate candidates. One is based

on brute-force searching in the projected space and another builds a VP-tree

over objects’ projections. In what follows, these methods are described in detail.

4.3.1 Brute-force projection search. In the brute-force approach, we scan

the list of projections and compute the distance between the projected query and

a projection of every data point. Then, we sort all data points in the order of

increasing distance to the projected query. A fraction (deﬁned by dbScanFrac)

of data points is compared directly against the query. Top candidates (most clos-

est entries) are identiﬁed using either the priority queue or incremental sorting

([24]). Incremental sorting is a more eﬃcient approach enabled by default. The

mnemonic code of this method is proj incsort.

22 Naidan et al

A choice of the distance in the projected space is governed by the parameter

useCosine. If it set to 1, the cosine distance is used (this makes most sense if we

use the cosine distance in the original space). By default useCosine = 0, which

forces the use of L2in the projected space.

The following is an example of testing the brute-force search of projections

with the benchmarking utility experiment:

release/experiment \

--distType float --spaceType cosinesimil --testSetQty 5 --maxNumQuery 100 \

--knn 1 --range 0.1 \

--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \

--method proj_incsort \

--createIndex projType=rand,projDim=4 \

--queryTimeParams useCosine=1,dbScanFrac=0.01

4.3.2 Projection VP-tree. To avoid exhaustive search in the space of pro-

jections, it is possible to index projected vectors using a VP-tree. The method’s

mnemonic name is proj vptree. In that, one needs to specify both the param-

eters of the VP-tree (see §4.1.1) and the projection parameters as in the case

of brute-force searching of projections (see §4.3.1).

The major diﬀerence from the brute-force search over projections is that,

instead of choosing between L2and cosine distance as the distance in the pro-

jected space, one uses a methods’ parameter projSpaceType to specify an ar-

bitrary one. Similar to the regular VP-tree implementation, optimal αlef t and

αright are determined by the utility tune vptree via a grid search like procedure

(tune vptree.exe on Windows).

This method, unfortunately, tends to perform worse than the VP-tree ap-

plied to the original space. The only exception are spaces with high intrinsic

(and, perhaps, representational) dimensionality where VP-trees (even with an

approximate search algorithm) are useless unless dimensionality is reduced sub-

stantially. One example is Wikipedia tf-idf vectors see:

https://github.com/nmslib/nmslib/tree/v1.8/manual/datasets.md.

The following is an example of testing the VP-tree over projections with the

benchmarking utility experiment:

release/experiment \

--distType float --spaceType cosinesimil --testSetQty 5 --maxNumQuery 100 \

--knn 1 --range 0.1 \

--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \

--method proj_vptree \

--createIndex projType=rand,projDim=4,projSpaceType=cosinesimil \

--queryTimeParams alphaLeft=2,alphaRight=2,dbScanFrac=0.01

4.3.3 OMEDRANK. In OMEDRANK [21] there is a small set of voting

pivots, each of which ranks data points based on a somewhat imperfect notion

of the distance from points to the query (computed by a classic random pro-

jection or a projection of some diﬀerent kind). While each individual ranking is

NMSLIB Manual 23

imperfect, a more accurate ranking can be achieved by rank aggregation. When

such a consolidating ranking is found, the most highly ranked objects from this

aggregate ranking can be used as answers to a nearest-neighbor query. Finding

the aggregate ranking is an NP-complete problem that Fagin et al. [21] solve

only heuristically.

Technically, during the index time, each point in the original space is pro-

jected into a (low)er dimensional vector space. The dimensionality of the projec-

tion is deﬁned using a method’s parameter numPivot (note that this is diﬀerent

from other projection methods). Then, for each dimension iin the projected

space, we sort data points in the order of increasing value of the i-th element of

its projection.

We also divide the index in chunks each accounting for at most chunkIndexSize

data points. The search algorithm processes one chunk at a time. The idea is to

make a chunk suﬃciently small so that auxiliary data structures ﬁt into L1 or

L2 cache.

The retrieval algorithm uses numPivot pointers lowiand numPivot pointers

highi(lowi≤highi), The i-th pair of pointers (lowi,highi) indicate a start

and an end position in the i-th list. For each data point, we allocate a zero-

initialized counter. We further create a projection of the query and use numPivot

binary searches to ﬁnd numPivot data points that have the closest i-th projection

coordinates. In each of the ilist, we make both highiand lowipoint to the found

data entries. In addition, for each data point found, we increase its counter. Note

that a single data point may appear the closest with respect to more than one

projection coordinate!

After that, we run a series of iterations. In each iteration, we increase numPivot

pointers highiand decrease numPivot pointers lowi(unless we reached the be-

ginning or the end of a list). For each data entry at which the pointer points,

we increase the value of the counter. Obviously, when we complete traversal of

all numPivot lists, each counter will have the value numPivot (recall that each

data point appears exactly once in each of the lists). Thus, sooner or later the

value of a counter becomes equal to or larger than numPivot ×minFreq, where

minFreq is a method’s parameter, e.g., 0.5.

The ﬁrst point whose counter becomes equal to or larger than numPivot ×

minFreq, becomes the ﬁrst candidate entry to be compared directly against the

query. The next point whose counter matches the threshold value numPivot ×

minFreq, becomes the second candidate and so on so forth. The total num-

ber of candidate entries is deﬁned by the parameter dbScanFrac. Instead of

all numPivot lists, it its possible to use only numPivotSearch lists that cor-

respond to the smallest absolute value of query’s projection coordinates. In

this case, the counter threshold is numPivotSearch ×minFreq. By default,

numPivot =numPivotSearch.

Note that parameters numPivotSearch and dbScanFrac were introduced by

us, they were not employed in the original version of OMEDRANK.

24 Naidan et al

The following is an example of testing OMEDRANK with the benchmarking

utility experiment:

release/experiment \

--distType float --spaceType cosinesimil --testSetQty 5 --maxNumQuery 100 \

--knn 1 --range 0.1 \

--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \

--method omedrank \

--createIndex projType=rand,numPivot=8 \

--queryTimeParams minFreq=0.5,dbScanFrac=0.02

4.4 Permutation-based Filtering Methods

Rather than relying on distance values directly, we can assess similarity of objects

based on their relative distances to reference points (i.e., pivots). For each data

point x, we can arrange pivots πin the order of increasing distances from x

(for simplicity we assume that there are no ties). This arrangement is called a

permutation. The permutation is essentially a pivot ranking. Technically, it is a

vector whose i-th element keeps an (ordinal) position of the i-th pivot (in the

set of pivots sorted by a distance from x).

Computation of the permutation is a mapping from a source space, which

may not have coordinates, to a target vector space with integer coordinates. In

our library, the distance between permutations is deﬁned as either L1or L2.

Values of the distance in the source space often correlates well with the distance

in the target space of permutations. This property is exploited in permutation

methods. An advantage of permutation methods is that they are not relying

on metric properties of the original distance and can be successfully applied to

non-metric spaces [6,34].

Note that there is no simple relationship between the distance in the target

space and the distance in the source space. In particular, the distance in the

target space is neither a lower nor an upper bound for the distance in the source

space. Thus, methods based on indexing permutations are ﬁltering methods that

allow us to obtain only approximate solutions. In the ﬁrst step, we retrieve a

certain number of candidate points whose permutations are suﬃciently close to

the permutation of the query vector. For these candidate data points, we compute

an actual distance to the query, using the original distance function. For almost

all implemented permutation methods, the number of candidate objects can be

controlled by a parameter dbScanFrac or minCandidate.

Permutation methods diﬀer in how they index and process permutations. In

the following subsections, we brieﬂy review implemented variants. Parameters

of these methods are summarized in Tables 5-6.

4.4.1 Brute-force permutation search. In the brute-force approach, we

scan the list of permutation methods and compute the distance between the

permutation of the query and a permutation of every data point. Then, we sort

NMSLIB Manual 25

Table 4: Parameters of projection-based ﬁlter-and-reﬁne methods

Common parameters

projType A type of projection.

projDim Dimensionality of projection vectors.

intermDim An intermediate dimensionality used to reduce dimensionality

via the hashing trick (used only for sparse vector spaces).

dbScanFrac A number of candidate records obtained during the ﬁltering

step.

Brute-force Projection Search (proj incsort)

Common parameters: projType,projDim,intermDim,

dbScanFrac

useCosine If set to one, we use the cosine distance in the projected space.

By default (value zero), L2is used.

useQueue If set to one, we use the priority queue instead of incremental

sorting. By default is zero.

Projection VP-tree (proj vptree)

Common parameters: projType,projDim,intermDim,

dbScanFrac

projSpaceType Type of the space of projections

bucketSize A maximum number of elements in a bucket/leaf.

chunkBucket Indicates if bucket elements should be stored contiguously in

memory (1 by default).

maxLeavesToVisit An early termination parameter equal to the maximum num-

ber of buckets (tree leaves) visited by a search algorithm

(2147483647 by default).

alphaLeft/alphaRight A stretching coeﬃcient αleft/αright in Eq. (6)

expLeft/expRight The left/right exponent in Eq. (6)

OMEDRANK [21] (omedrank)

Common parameters: projType,intermDim,dbScanFrac

numPivot Projection dimensionality

numPivotSearch Number of data point lists to be used in search

minFreq The threshold for being considered a candidate entry: whenever

a point’s counter becomes ≥numPivotSearch ×minFreq, this

point is compared directly to the query.

chunkIndexSize A number of documents in one index chunk.

Note: mnemonic method names are given in round brackets.

26 Naidan et al

all data points in the order of increasing distance to the query permutation and

a fraction (dbScanFrac) of data points is compared directly against the query.

In the current version of the library, the brute-force search over regular per-

mutations is a special case of the brute-force search over projections (see 4.3.1),

where the projection type is perm. There is also an additional brute-force ﬁlter-

ing method, which relies on the so-called binarized permutations. It is described

in 4.4.6.

4.4.2 Permutation Preﬁx Index (PP-Index). In a permutation preﬁx

index (PP-index), permutation are stored in a preﬁx tree of limited depth [20].

A parameter prefixLength deﬁnes the depth. The ﬁltering phase aims to ﬁnd

minCandidate candidate data points. To this end, it ﬁrst retrieves the data

points whose preﬁx of the inverse pivot ranking is exactly the same as that of

the query. If we do not get enough candidate objects, we shorten the preﬁx and

repeat the procedure until we get a suﬃcient number of candidate entries. Note

that we do not the use the parameter dbScanFrac here.

The following is an example of testing the PP-index with the benchmarking

utility experiment.

release/experiment \

--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \

--knn 1 --range 0.1 \

--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \

--method pp-index \

--createIndex numPivot=4 \

--queryTimeParams prefixLength=4,minCandidate=100

4.4.3 VP-tree index over permutations. We can use a VP-tree to index

permutations. This approach is similar to that of Figueroa and Fredriksson [23].

We, however, rely on the approximate version of the VP-tree described in §4.1.1,

while Figueroa and Fredriksson use an exact one. The “sloppiness” of the VP-

tree search is governed by the stretching coeﬃcients alphaLeft and alphaRight

as well as by the exponents in Eq. (6). In NMSLIB, the VP-tree index over

permutations is a special case of the projection VP-tree (see §4.3.2). There is

also an additional VP-tree based method that indexes binarized permutations.

It is described in §4.4.6.

4.4.4 Metric Inverted File (MI-File) relies on the inverted index over

permutations [1]. We select (a potentially large) subset of pivots (parameter

numPivot). Using these pivots, we compute a permutation for every data point.

Then, numPivotIndex most closest pivots are memorized in a data ﬁle. If a

pivot number iis the pos-th most distant pivot for the object x, we add the

pair (pos, x) to the posting list number i. All posting lists are kept sorted in the

order of the increasing ﬁrst element (equal to the ordinal position of the pivot

in a permutation).

NMSLIB Manual 27

During searching, we compute the permutation of the query and select post-

ing lists corresponding to numPivotSearch most closest pivots. These posting

lists are processed as follows: Imagine that we selected posting list iand the po-

sition of pivot iin the permutation of the query is pos. Then, using the posting

list i, we retrieve all candidate records for which the position of the pivot iin

their respective permutations is from pos −maxPosDiﬀ to pos + maxPosDiﬀ.

This allows us to update the estimate for the L1distance between retrieved can-

didate records’ permutations and the permutation of the query (see [1] for more

details).

Finally, we select at most dbScanFrac ·Nobjects (Nis the total number of

indexed objects) with the smallest estimates for the L1between their permu-

tations and the permutation of the query. These objects are compared directly

against the query. The ﬁltering step of the MI-ﬁle is expensive. Therefore, this

method is eﬃcient only for computationally-intensive distances.

An example of testing this method using the utility experiment is as follows:

release/experiment \

--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \

--knn 1 --range 0.1 \

--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \

--method mi-file \

--createIndex numPivot=128,numPivotIndex=16 \

--queryTimeParams numPivotSearch=4,dbScanFrac=0.01

4.4.5 Neighborhood APProximation Index (NAPP). Recently it was

proposed to index pivot neighborhoods: For each data point, we compute dis-

tances to numPivot points and select numPivotIndex (typically, much smaller

than numPivot) pivots that are closest to the data point. Then, we associate

these numPivotIndex closest pivots with the data point via an inverted ﬁle [43].

One can hope that for similar points two pivot neighborhoods will have a non-

zero intersection.

To exploit this observation, our implementation of the pivot neighborhood

indexing method retrieves all points that share at least numPivotSearch nearest

neighbor pivots (using an inverted ﬁle). Then, these candidates points can be

compared directly against the query, which works well for cheap distances like

L2.

For computationally expensive distances, one can add an additional ﬁltering

step by setting the parameter useSort to one. If useSort is one, all candidate

entries are additionally sorted by the number of shared pivots (in the decreasing

order). Afterwards, a subset of candidates are compared directly against the

query. The size of the subset is deﬁned by the parameter dbScanFrac. When

selecting the subset, we give priority to candidates sharing more common pivots

with the query. This secondary ﬁltering may eliminate less promising entries,

but it incurs additional computational costs, which may outweigh the beneﬁts

of additional ﬁltering “power”, if the distance is cheap.

In many cases, good performance can be achieved by selecting pivots ran-

domly. However, we ﬁnd that pivots can also be engineered (more information

28 Naidan et al

on this topic will be published soon). To load external pivots, the user should

specify an index-time parameter pivotFile. The pivots should be in the same

format as the data points.

Note that our implementation is diﬀerent from that of Tellez [43] in several

ways. First, we do not use a succinct inverted index. Second, we use a simple

posting merging algorithm based on counting (a ScanCount algorithm). Before

a query is processed, we zero-initialize an array that keeps one counter for every

data point. As we traverse a posting list and encounter an entry corresponding

to object i, we increment a counter number i. The ScanCount is known to be

quite eﬃcient [28].

We also divide the index in chunks each accounting for at most chunkIndexSize

data points. The search algorithm processes one chunk at a time. The idea is to

make a chunk suﬃciently small so that counters ﬁt into L1 or L2 cache.

An example of running NAPP without the additional ﬁltering stage:

release/experiment \

--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \

--knn 1 --range 0.1 \

--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \

--cachePrefixGS napp_gold_standard \

--method napp \

--createIndex numPivot=32,numPivotIndex=8,chunkIndexSize=1024 \

--queryTimeParams numPivotSearch=8 \

--saveIndex napp_index

Note that NAPP is capable of saving/loading the meta index. However, in

the bootstrapping mode this is only possible if gold standard data is cached

(hence, the option --cachePrefixGS).

An example of running NAPP with the additional ﬁltering stage:

release/experiment \

--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \

--knn 1 --range 0.1 \

--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \

--method napp \

--createIndex numPivot=32,numPivotIndex=8,chunkIndexSize=1024 \

--queryTimeParams useSort=1,dbScanFrac=0.01,numPivotSearch=8

4.4.6 Binarized permutation methods. Instead of computing the L2dis-

tance between two permutations, we can binarize permutations and compute

the Hamming distance between binarized permutations. To this end, we select

an adhoc binarization threshold binThreshold (the number of pivots divided

by two is usually a good setting). All integer values smaller than binThreshold

become zeros, and values larger than or equal to binThreshold become ones.

The Hamming distance between binarized permutations can be computed

much faster than L2or L1(see Table 1). This comes at a cost though, as the

Hamming distance appears to be a worse proxy for the original distance than L2

or L1(for the same number of pivots). One can compensate in quality by using

NMSLIB Manual 29

more pivots. In our experiments, it was usually suﬃcient to double the number

of pivots.

The binarized permutation can be searched sequentially. An example of test-

ing such a method using the utility experiment is as follows:

release/experiment \

--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \

--knn 1 --range 0.1 \

--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \

--method perm_incsort_bin \

--createIndex numPivot=32,binThreshold=16 \

--queryTimeParams dbScanFrac=0.05

Alternatively, binarized permutations can be indexed using the VP-tree. This

approach is usually more eﬃcient than searching binarized permutations sequen-

tially, but one needs to tune additional parameters. An example of testing such

a method using the utility experiment is as follows:

release/experiment \

--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \

--knn 1 --range 0.1 \

--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \

--method perm_bin_vptree \

--createIndex numPivot=32 \

--queryTimeParams alphaLeft=2,alphaRight=2,dbScanFrac=0.05

4.5 Proximity/Neighborhood Graphs

One eﬃcient and eﬀective search approach relies on building a graph, where data

points are graph nodes and edges connect suﬃciently close points. When edges

connect nearest neighbor points, such graph is called a k-NN graph (or a nearest

neighbor graph).

In a proximity-graph a search process is a series of greedy sub-searches. A

sub-search starts at some, e.g., random node and proceeds to expanding the

set of traversed nodes in a best-ﬁrst fashion by following neighboring links. The

algorithm resembles a Dijkstra’s shortest-path algorithm in that, in each step,

it selects an unvisited point closest to the query.

There have been multiple stopping heuristics proposed. For example, we can

stop after visiting a certain number of nodes. In NMSLIB, the sub-search termi-

nates essentially when the candidate queue is exhausted. Speciﬁcally, the can-

didate queue is expanded only with points that are closer to the query than

the k0-th closest point already discovered by the sub-search (k0is a search pa-

rameter). When we stop ﬁnding such points, the queue dwindles and eventually

becomes empty. We also terminate the sub-search when the queue contains only

points farther than the k0-th closest point already discovered by the sub-search.

Note that the greedy search is only approximate and does not necessarily return

all true nearest neighbors.

30 Naidan et al

Table 5: Parameters of permutation-based ﬁltering methods

Common parameters

numPivot A number of pivots.

dbScanFrac A number of candidate records obtained during the ﬁltering

step. It is speciﬁed as a fraction (not a percentage!) of the total

number of data points in the data set.

binThreshold Binarization threshold. If a value of an original permutation

vector is below this threshold, it becomes 0 in the binarized

permutation. If the value is above, the value is converted to 1.

Permutation Preﬁx Index (pp-index) [20]

numPivot A number of pivots.

minCandidate a minimum number of candidates to retrieve (note that we do

not use dbScanFrac here.

prefixLength a maximum length of the tree preﬁx that is used to retrieve

candidate records.

chunkBucket 1 if we want to store vectors having the same permutation preﬁx

in the same memory chunk (i.e., contiguously in memory)

Metric Inverted File (mi-file) [1]

Common parameters: numPivot and dbScanFrac.

numPivotIndex a number of (closest) pivots to index

numPivotSearch a number of (closest) pivots to use during searching

maxPosDiff the maximum position diﬀerence permitted for searching in the

inverted ﬁle

Neighborhood Approximation Index (napp) [43]

Common parameter numPivot.

invProcAlg An algorithm to merge posting lists. In practice, only scan

worked well.

chunkIndexSize A number of documents in one index chunk.

indexThreadQty A number of indexhing threads.

numPivotIndex A number of closest pivots to be indexed.

numPivotSearch A candidate entry should share this number of pivots with the

query. This is a query-time parameter.

Note: mnemonic method names are given in round brackets.

NMSLIB Manual 31

Table 6: Parameters of permutation-based ﬁltering methods (continued)

Brute-force search with incremental sorting for binarized permutations

(perm incsort bin) [42]

Common parameters: numPivot,dbScanFrac,binThreshold.

VP-tree index over binarized permutations (perm bin vptree)

Similar to [42], but uses an approximate search in the VP-tree.

Common parameters: numPivot,dbScanFrac,binThreshold.

Note that dbScanFrac is a query-time parameter.

alphaLeft A stretching coeﬃcient αleft in Eq. (6)

alphaRight A stretching coeﬃcient αright in Eq. (6)

Note: mnemonic method names are given in round brackets.

In our library we use several approaches to create proximity graphs, which

are described below. Parameters of these methods are summarized in Table 7.

Note that SW-graph and NN-descent have the parameter with the same name,

namely, NN. However, this parameter has a somewhat diﬀerent interpretation

depending on the method. Also note that our proximity-graph methods support

only the nearest-neighbor, but not the range search.

4.5.1 Small World Graph (SW-graph). In the (Navigable) Small World

graph (SW-graph),4indexing is a bottom-up procedure that relies on the pre-

viously described greedy search algorithm. The number of restarts, though, is

deﬁned by a diﬀerent parameter, i.e., initIndexAttempts. We insert points one

by one. For each data point, we ﬁnd NN closest points using an already con-

structed index. Then, we create an undirected edge between a new graph node

(representing a new point) and nodes that represent NN closest points found by

the greedy search. Each sub-search starts from some, e.g., random node and

proceeds expanding the candidate queue with points that are closer than the

efConstruction-th closest point (efConstruction is an index-time parame-

ter). Similarly, the search procedure executes one or more sub-searches that

start from some node. The queue is expanded only with points that are closer

than the efSearch-th closest point. The number of sub-searches is deﬁned by

the parameter initSearchAttempts.

Empirically, it was shown that this method often creates a navigable small

world graph, where most nodes are separated by only a few edges (roughly

logarithmic in terms of the overall number of objects) [31]. A simpler and less

eﬃcient variant of this algorithm was presented at ICTA 2011 and SISAP 2012

4SW-graph is also known as a Metrized Small-World (MSW) graph and a Navigable

Small World (NSW) graph.

32 Naidan et al

[37,31]. An improved variant appeared as an Information Systems publication

[32]. In the latter paper, however, the values of efSearch and efConstruction

are set equal to NN. The idea of using values of efSearch and efConstruction

potentially (much) larger than NN was proposed by Malkov and Yashunin [33].

The indexing algorithm is rather expensive and we accelerate it by running

parallel searches in multiple threads. The number of threads is deﬁned by the

parameter indexThreadQty. By default, this parameter is equal to the number

of virtual cores. The graph updates are synchronized: If a thread needs to add

edges to a node or obtain the list of node edges, it ﬁrst locks a node-speciﬁc

mutex. Because diﬀerent threads rarely update and/or access the same node

simultaneously, such synchronization creates little contention and, consequently,

our parallelization approach is eﬃcient. It is also necessary to synchronize up-

dates for the list of graph nodes, but this operation takes little time compared

to searching for NN neighboring points.

An example of testing this method using the utility experiment is as follows:

release/experiment \

--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \

--knn 1 \

--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \

--cachePrefixGS sw-graph \

--method sw-graph \

--createIndex NN=3,initIndexAttempts=5,indexThreadQty=4 \

--queryTimeParams initSearchAttempts=1,efSearch=10 \

--saveIndex sw-graph_index

Note that SW-graph is capable of saving/loading the meta index. However,

in the bootstrapping mode this is only possible if gold standard data is cached

(hence, the option --cachePrefixGS).

4.5.2 Hierarchical Navigable Small World Graph (HNSW). The Hi-

erarchical Navigable Small World Graph (HNSW) [33] is a new search method,

a successor of the SW-graph. HNSW can be much faster (especially during in-

dexing) and is more robust. However, the current implementation is still exper-

imental and we will update it in the near future.

HNSW can be seen as a multi-layer and a multi-resolution variant of a prox-

imity graph. A ground (zero-level) layer includes all data points. The higher is

the layer, the fewer points it has. When a data point is added to HNSW, we

select the maximum level mrandomly. In that, the probability of selecting level

mdecreases exponentially with m.

Similarly to the SW-graph, the HNSW is constructed by inserting data

points, one by one. A new point is added to all layers starting from layer mdown

to layer zero. This is done using a search-based algorithm similar to that of the

basic SW-graph. The quality is controlled by the parameter efConstruction.

Speciﬁcally, a search starts from the maximum-level layer and proceeds to

lower layers by searching one layer at a time. For all layers higher than the

ground layer, the search algorithm is a 1-NN search that greedily follows the

NMSLIB Manual 33

closest neighbor (this is equivalent to having efConstruction=1). The closest

point found at the layer h+1 is used as a starting point for the search carried out

at the layer h. For the ground layer, we carry an M-NN search whose quality is

controlled by the parameter efConstruction. Note that the ground-layer search

relies one the same algorithm as we use for the SW-graph (yet, we use only a

single sub-search, which is equivalent to setting initIndexAttempts to one).

An outcome of a search in a layer is a set of data points that are close to the

new point. Using one of the heuristics described by Malkov and Yashunin [33],

we select points from this set to become neighbors of the new point (in the layer’s

graph). Note that unlike the older SW-graph, the new algorithm has a limit on

the maximum number of neighbors. If the limit is exceeded, the heuristics are

used to keep only the best neighbors. Speciﬁcally, the maximum number of neigh-

bors in all layers but the ground layer is maxM (an index-time parameter, which

is equal to Mby default). The maximum number of neighbors for the ground

layer is maxM0 (an index-time parameter, which is equal to 2×Mby default). The

choice of the heuristic is controlled by the parameter delaunay type. Specif-

ically, by default delaunay type is equal to 2. This default is generally quite

good. However, it maybe worth trying other viable options values: 0, 1, and 3.

A search algorithm is similar to the indexing algorithm. It starts from the

maximum-level layer and proceeds to lower-level layers by searching one layer

at a time. For all layers higher than the ground layer, the search algorithm is

a 1-NN search that greedily follows the closest neighbor (this is equivalent to

having efSearch=1). The closest point found at the layer h+ 1 is used as a

starting point for the search carried out at the layer h. For the ground layer, we

carry an k-NN search whose quality is controlled by the parameter efSearch (in

the paper by Malkov and Yashunin [33] this parameter is denoted as ep). The

ground-layer search relies one the same algorithm as we use for the SW-graph,

but it does not carry out multiple sub-searches starting from diﬀerent random

data points.

For L2and the cosine similarity, HNSW has optimized implementations,

which are enabled by default. To enforce the use of the generic algorithm, set

the parameter skip optimized index to one.

Similar to SW-graph, the indexing algorithm can be expensive. It is, there-

fore, accelerated by running parallel searches in multiple threads. The number of

threads is deﬁned by the parameter indexThreadQty. By default, this parameter

is equal to the number of virtual cores.

A sample command line to test HNSW using the utility experiment:

release/experiment \

--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \

--knn 1 \

--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \

--method hnsw \

--createIndex M=10,efConstruction=20,indexThreadQty=4,searchMethod=0 \

--queryTimeParams efSearch=10

HNSW is capable of saving an index for optimized L2and the cosine-similarity

implementations. Here is an example for L2:

34 Naidan et al

release/experiment \

--distType float --spaceType cosinesimil --testSetQty 5 --maxNumQuery 100 \

--knn 1 \

--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \

--cachePrefixGS hnsw \

--method hnsw \

--createIndex M=10,efConstruction=20,indexThreadQty=4,searchMethod=4 \

--queryTimeParams efSearch=10

--saveIndex hnsw_index

4.5.3 NN-Descent. The NN-descent is an iterative procedure initialized with

randomly selected nearest neighbors. In each iteration, a random sample of

queries is selected to participate in neighborhood propagation.

This process is governed by parameters rho and delta. Parameter rho deﬁnes

a fraction of the data set that is randomly sampled for neighborhood propaga-

tion. A good value that works in many cases is rho = 0.5. As the indexing

algorithm iterates, fewer and fewer neighborhoods change (when we attempt to

improve the local neighborhood structure via neighborhood propagation). The

parameter delta deﬁnes a stopping condition in terms of a fraction of modi-

ﬁed edges in the k-NNgraph (the exact deﬁnition can be inferred from code). A

good default value is delta=0.001. The indexing algorithm is multi-threaded:

the method uses all available cores.

When NN-descent was incorporated into NMSLIB, there was no open-source

search algorithm released, only the code to construct a k-NNgraph. Therefore,

we use the same algorithm as for the SW-graph [31,32]. The new, open-source,

version of NN-descent (code-named kgraph), which does include the search al-

gorithm, can be found on GitHub.

Here is an example of testing this method using the utility experiment:

release/experiment \

--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \

--knn 1 \

--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \

--method nndes \

--createIndex NN=10,rho=0.5,delta=0.001 \

--queryTimeParams initSearchAttempts=3

4.6 Miscellaneous Methods

Currently our major miscellaneous methods do not have parameters.

4.6.1 Brute-force (sequential) searching. To verify how the speed of

brute-force searching scales with the number of threads, we provide a refer-

ence implementation of the sequential searching. For example, to benchmark

sequential searching using two threads, one can type the following command:

NMSLIB Manual 35

Table 7: Parameters of proximity-graph based methods

Common parameters

efSearch The search depth: speciﬁcally, a sub-search is stopped, when it

cannot ﬁnd a point closer than efSearch points (seen so far)

closest to the query.

SW-graph (sw-graph) [37,31,32]

NN For a newly added point ﬁnd this number of most closest points

that make the initial neighborhood of the point. When more

points are added, this neighborhood may be expanded.

efConstruction The depth of the search that is used to ﬁnd neighbors during

indexing. This parameter is analogous to efSearch.

initIndexAttempts The number of random search restarts carried out to add one

point.

indexThreadQty The number of indexing threads. The default value is equal to

the number of (logical) CPU cores.

initSearchAttempts A number of random search restarts.

Hierarchical Navigable SW-graph (hnsw) [32]

mult A scaling coeﬃcient to determine the depth of a layered struc-

ture (see the paper by Malkov and Yashunin [32] for details).

A default value seems to be good enough.

skip optimized index Setting this parameter to one disables the use of the optimized

implementations (for L2and the cosine similarity).

maxM The maximum number of neighbors in all layers but the ground

layer (the default value seems to be good enough).

maxM0 The maximum number of neighbors in the ground layer (the

default value seems to be good enough).

MThe size of the initial set of potential neighbors for the indexing

phase. The set may be further pruned so that the overall num-

ber of neighbors does not exceed maxM0 (for the ground layer)

or maxM (for all layers but the ground one).

efConstruction The depth of the search that is used to ﬁnd neighbors during in-

dexing (this parameter is used only for the search in the ground

layer).

delaunay type A type of the pruning heuristic: 0 indicates that we keep only

maxM (or maxM0 for the ground layer) neighbors, 1 activates a

heuristic described by Algorithm 4 [32]

NN-descent (nndes) [17,31,32]

NN For each point ﬁnd this number of most closest points (neigh-

bors).

rho A fraction of the data set that is randomly sampled for neigh-

borhood propagation.

delta A stopping condition in terms of the fraction of updated edges

in the k-NNgraph.

initSearchAttempts A number of random search restarts.

Note: mnemonic method names are given in round brackets.

36 Naidan et al

release/experiment \

--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \

--knn 1 \

--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \

--method seq_search --threadTestQty 2

4.6.2 Term-Based Uncompressed Inverted File. For sparse data sets

(especially when queries are much sparser than documents), it can be useful to

employ the inverted ﬁle. The inverted ﬁle is a mapping from a set of dimensions

to their respective posting lists. The posting list of a dimension is a sorted list of

documents/vectors, where this dimension is non-zero. To answer queries, we use

a classic document-at-a-time (DAAT) processing algorithm [44] with a priority

queue (see Figure 5.3 in [9]), which is a part of NMSLIB. This algorithm is

implemented in Lucene 6.0 (and earlier versions). However, our implementation

is faster than Lucene (1.5-2×faster on compr and stack) for several reasons: our

implementation does not use compression, it does not explicitly compute BM25,

it is written in C++ rather than Java

release/experiment \

--distType float --spaceType negdotprod_sparse_fast --testSetQty 5 \

--maxNumQuery 100 \

--knn 1 \

--dataFile ../sample_data/sparse_wiki_5K.txt --outFilePrefix result \

--method simple_invindx --threadTestQty 2

5 Notes on Eﬃciency

5.1 Eﬃciency of Distance Functions

Note that improvement in eﬃciency and in the number of distance computations

obtained with slow distance functions can be overly optimistic. That is, when

a slow distance function is replaced with a more eﬃcient version, the improve-

ments over sequential search may become far less impressive. In some cases, the

search method can become even slower than the brute-force comparison against

every data point. This is why we believe that optimizing computation of a dis-

tance function is equally important (and sometimes even more important) than

designing better search methods.

In this library, we optimized several distance functions, especially non-metric

functions that involve computations of logarithms. An order of magnitude im-

provement can be achieved by pre-computing logarithms at index time and by

approximating those logarithms that are not possible to pre-compute (see §3.5

and §3.6 for more details). Yet, this doubles the size of an index.

The Intel compiler has a powerful math library, which allows one to eﬃciently

compute several hard distance functions such as the KL-divergence, the Jensen-

Shanon divergence/metric, and the Lpspaces for non-integer values of pmore

eﬃciently than in the case of GNU C++ and Clang. In the Visual Studio’s fast

NMSLIB Manual 37

math mode (which is enabled in the provided project ﬁles) it is also possible to

compute some hard distances several times faster compared to GNU C++ and

Clang. Yet, our custom implementations are often much faster. For example, in

the case of the Intel compiler, the custom implementation of the KL-divergence

is 10 times faster than the standard one while the custom implementation of the

JS-divergence is two times faster. In the case of the Visual studio, the custom

KL-divergence is 7 times as fast as the standard one, while the custom JS-

divergence is 10 times faster. Therefore, doubling the size of the data set by

storing pre-computed logarithms seems to be worthwhile.

Eﬃcient implementations of some other distance functions rely on SIMD

instructions. These instructions, available on most modern Intel and AMD pro-

cessors, operate on small vectors. Some C++ implementations can be eﬃciently

vectorized by both the GNU and Intel compilers. That is, instead of the scalar

operations the compiler would generate more eﬃcient SIMD instructions. Yet,

the code is not always vectorized, e.g., by the Clang. And even the Intel com-

piler, fails to eﬃciently vectorize computation of the KL-divergence (with pre-

computed logarithms).

There are also situations when eﬃcient automatic vectorization is hardly pos-

sible. For instance, we provide an eﬃcient implementation of the scalar product

for sparse single-precision ﬂoating point vectors. It relies on the all-against-all

comparison SIMD instruction mm cmpistrm. However, it requires keeping the

data in a special format, which makes automatic vectorization impossible.

Intel SSE extensions that provide SIMD instructions are automatically de-

tected by all compilers but the Visual Studio. If some SSE extensions are not

available, the compilation process will produce warnings like the following one:

LInfNormSIMD: SSE2 is not available, defaulting to pure C++ implementation!

5.2 Cache-friendly Data Layout

In our previous report [5], we underestimated a cost of a random memory access.

A more careful analysis showed that, on a recent laptop (Core i7, DDR3), a truly

random access “costs” about 200 CPU cycles, which may be 2-3 times longer

than a computation of a cheap distance such as L2.

Many implemented methods use some form of bucketing. For example, in

the VP-tree or bbtree we recursively decompose the space until a partition con-

tains at most bucketSize elements. The buckets are searched sequentially, which

could be done much faster, if bucket objects were stored in contiguous memory

regions. Thus, to check elements in a bucket we would need only one random

memory access.

A number of methods support this optimized storage model. It is activated

by setting a parameter chunkBucket to 1. If chunkBucket is set to 1, indexing

is carried out in two stages. At the ﬁrst stage, a method creates unoptimized

buckets, each of which is an array of pointers to data objects. Thus, objects are

not necessarily contiguous in memory. In the second stage, the method iterates

over buckets, allocates a contiguous chunk of memory, which is suﬃciently large

to keep all bucket objects, and copies bucket objects to this new chunk.

38 Naidan et al

Important note: Note that currently we do not delete old objects and do

not deallocate the memory they occupy. Thus, if chunkBucket is set to 1, the

memory usage is overestimated. In the future, we plan to address this issue.

References

1. G. Amato and P. Savino. Approximate similarity search in metric spaces using in-

verted ﬁles. In Proceedings of the 3rd international conference on Scalable informa-

tion systems, page 28. ICST (Institute for Computer Sciences, Social-Informatics

and Telecommunications Engineering), 2008.

2. C. Beecks. Distance based similarity models for content based multimedia retrieval.

PhD thesis, 2013.

3. C. Beecks, M. S. Uysal, and T. Seidl. Signature quadratic form distance. In

Proceedings of the ACM International Conference on Image and Video Retrieval,

CIVR ’10, pages 438–445, New York, NY, USA, 2010. ACM.

4. E. Bingham and H. Mannila. Random projection in dimensionality reduction:

applications to image and text data. In Proceedings of the seventh ACM SIGKDD

international conference on Knowledge discovery and data mining, pages 245–250.

ACM, 2001.

5. L. Boytsov and B. Naidan. Engineering eﬃcient and eﬀective

Non-Metric Space Library. In N. Brisaboa, O. Pedreira, and P. Zezula, ed-

itors, Similarity Search and Applications, volume 8199 of Lecture Notes in

Computer Science, pages 280–293. Springer Berlin Heidelberg, 2013.

6. L. Boytsov and B. Naidan. Learning to prune in metric and non-metric spaces. In

Advances in Neural Information Processing Systems, 2013.

7. T. Bozkaya and M. Ozsoyoglu. Indexing large metric spaces for similarity search

queries. ACM Transactions on Database Systems (TODS), 24(3):361–404, 1999.

8. L. Bregman. The relaxation method of ﬁnding the common point of convex sets

and its application to the solution of problems in convex programming. {USSR}

Computational Mathematics and Mathematical Physics, 7(3):200 – 217, 1967.

9. S. B¨uttcher, C. L. A. Clarke, and G. V. Cormack. Information Retrieval - Imple-

menting and Evaluating Search Engines. MIT Press, 2010.

10. L. Cayton. Fast nearest neighbor retrieval for bregman divergences. In Proceedings

of the 25th international conference on Machine learning, ICML ’08, pages 112–

119, New York, NY, USA, 2008. ACM.

11. L. Cayton. Eﬃcient bregman range search. In Advances in Neural Information

Processing Systems, pages 243–251, 2009.

12. M. S. Charikar. Similarity estimation techniques from rounding algorithms. In

Proceedings of the thiry-fourth annual ACM symposium on Theory of computing,

pages 380–388. ACM, 2002.

13. E. Ch´avez and G. Navarro. A compact space decomposition for eﬀective metric

indexing. Pattern Recognition Letters, 26(9):1363–1376, 2005.

14. E. Ch´avez, G. Navarro, R. Baeza-Yates, and J. L. Marroquin. Searching in metric

spaces. ACM Computing Surveys, 33(3):273–321, 2001.

15. M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing

scheme based on p-stable distributions. In Proceedings of the twentieth annual

symposium on Computational geometry, pages 253–262. ACM, 2004.

16. W. Dong. High-Dimensional Similarity Search for Large Datasets. PhD thesis,

Princeton University, 2011.

NMSLIB Manual 39

17. W. Dong, C. Moses, and K. Li. Eﬃcient k-nearest neighbor graph construction for

generic similarity measures. In Proceedings of the 20th international conference on

World wide web, pages 577–586. ACM, 2011.

18. W. Dong, Z. Wang, W. Josephson, M. Charikar, and K. Li. Modeling lsh for

performance tuning. In Proceedings of the 17th ACM conference on Information

and knowledge management, CIKM ’08, pages 669–678, New York, NY, USA, 2008.

ACM.

19. D. M. Endres and J. E. Schindelin. A new metric for probability distributions.

Information Theory, IEEE Transactions on, 49(7):1858–1860, 2003.

20. A. Esuli. Use of permutation preﬁxes for eﬃcient and scalable approximate simi-

larity search. Inf. Process. Manage., 48(5):889–902, Sept. 2012.

21. R. Fagin, R. Kumar, and D. Sivakumar. Eﬃcient similarity search and classiﬁcation

via rank aggregation. In Proceedings of the 2003 ACM SIGMOD International

Conference on Management of Data, SIGMOD ’03, pages 301–312, New York,

NY, USA, 2003. ACM.

22. C. Faloutsos and K.-I. Lin. FastMap: A fast algorithm for indexing, data-mining

and visualization of traditional and multimedia datasets, volume 24. ACM, 1995.

23. K. Figueroa and K. Frediksson. Speeding up permutation based indexing with

indexing. In Proceedings of the 2009 Second International Workshop on Similarity

Search and Applications, pages 107–114. IEEE Computer Society, 2009.

24. E. Gonzalez, K. Figueroa, and G. Navarro. Eﬀective proximity retrieval by ordering

permutations. Pattern Analysis and Machine Intelligence, IEEE Transactions on,

30(9):1647–1658, 2008.

25. P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the

curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on

Theory of computing, pages 604–613. ACM, 1998.

26. E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Eﬃcient search for approximate

nearest neighbor in high dimensional spaces. In Proceedings of the 30th annual

ACM symposium on Theory of computing, STOC ’98, pages 614–623, New York,

NY, USA, 1998. ACM.

27. V. Levenshtein. Binary codes capable of correcting deletions, insertions, and re-

versals. Doklady Akademii Nauk SSSR,, 163(4):845–848, 1966.

28. C. Li, J. Lu, and Y. Lu. Eﬃcient merging and ﬁltering algorithms for approximate

string searches. In Data Engineering, 2008. ICDE 2008. IEEE 24th International

Conference on, pages 257–266. IEEE, 2008.

29. Q. Lv, M. Charikar, and K. Li. Image similarity search with compact data struc-

tures. In Proceedings of the thirteenth ACM international conference on Informa-

tion and knowledge management, pages 208–217. ACM, 2004.

30. Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe lsh: eﬃcient

indexing for high-dimensional similarity search. In Proceedings of the 33rd inter-

national conference on Very large data bases, pages 950–961. VLDB Endowment,

2007.

31. Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov. Scalable distributed al-

gorithm for approximate nearest neighbor search problem in high dimensional gen-

eral metric spaces. In Similarity Search and Applications, pages 132–147. Springer,

2012.

32. Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov. Approximate nearest

neighbor algorithm based on navigable small world graphs. Inf. Syst., 45:61–68,

2014.

40 Naidan et al

33. Y. A. Malkov and D. A. Yashunin. Eﬃcient and robust approximate nearest

neighbor search using Hierarchical Navigable Small World graphs. ArXiv e-prints,

Mar. 2016.

34. B. Naidan, L. Boytsov, and E. Nyberg. Permutation search methods are eﬃcient,

yet faster search is possible. PVLDB, 8(12):1618–1629, 2015.

35. G. Navarro. Searching in metric spaces by spatial approximation. The VLDB

Journal, 11(1):28–46, 2002.

36. S. B. Needleman and C. D. Wunsch. A general method applicable to the search for

similarities in the amino acid sequence of two proteins. J Mol Biol, 48(3):443–453,

March 1970.

37. A. Ponomarenko, Y. Malkov, A. Logvinov, , and V. Krylov. Approximate nearest

neighbor search small world approach, 2011. Available at http://www.iiis.org/

CDs2011/CD2011IDI/ICTA_2011/Abstract.asp?myurl=CT175ON.pdf.

38. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpa-

thy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale

Visual Recognition Challenge, 2014.

39. H. Sakoe and S. Chiba. A dynamic programming approach to continuous speech

recognition. In Proceedings of the Seventh International Congress on Acoustics,

pages 65–68, August 1971. paper 20C13.

40. D. Sankoﬀ. The early introduction of dynamic programming into computational

biology. Bioinformatics, 16(1):41–47, 2000.

41. B. Schlegel, T. Willhalm, and W. Lehner. Fast sorted-set intersection using simd

instructions. In ADMS@ VLDB, pages 1–8, 2011.

42. E. S. T´ellez, E. Ch´avez, and A. Camarena-Ibarrola. A brief index for proximity

searching. In Progress in Pattern Recognition, Image Analysis, Computer Vision,

and Applications, pages 529–536. Springer, 2009.

43. E. S. Tellez, E. Ch´avez, and G. Navarro. Succinct nearest neighbor search. Infor-

mation Systems, 38(7):1019–1030, 2013.

44. H. R. Turtle and J. Flood. Query evaluation: Strategies and optimizations. Inf.

Process. Manage., 31(6):831–850, 1995.

45. J. Uhlmann. Satisfying general proximity similarity queries with metric trees.

Information Processing Letters, 40:175–179, 1991.

46. V. Velichko and N. Zagoruyko. Automatic recognition of 200 words. International

Journal of Man-Machine Studies, 2(3):223 – 234, 1970.

47. T. Vintsyuk. Speech discrimination by dynamic programming. Cybernetics,

4(1):52–57, 1968.

48. R. A. Wagner and M. J. Fischer. The string-to-string correction problem. Journal

of the ACM, 21(1):168–173, 1974.

49. Z. Wang, W. Dong, W. Josephson, Q. Lv, M. Charikar, and K. Li. Sizing sketches:

a rank-based analysis for similarity search. ACM SIGMETRICS Performance

Evaluation Review, 35(1):157–168, 2007.

50. P. N. Yianilos. Data structures and algorithms for nearest neighbor search in

general metric spaces. In Proceedings of the Fourth Annual ACM-SIAM Symposium

on Discrete Algorithms, SODA ’93, pages 311–321, Philadelphia, PA, USA, 1993.

Society for Industrial and Applied Mathematics.

NMSLIB Manual 41

A Description of Projection Types

A.1 The classic random projections

The classic random projections work only for vector spaces (both sparse and

dense). At index time, we generate projDim vectors by sampling their elements

from the standard normal distribution N(0,1) and orthonormalizing them. 5

Coordinates in the projection spaces are obtained by computing scalar products

between a given vector and each of the projDim randomly generated vectors.

In the case of sparse vector spaces, the dimensionality is ﬁrst reduced via

the hashing trick: the value of the element iis equal to the sum of values for all

elements whose indices are hashed into number i. After hashing, classic random

projections are applied. The dimensionality of the intermediate space is deﬁned

by a method’s parameter intermDim.

The hashing trick is used purely for eﬃciency reasons. However, for large

enough values of the intermediate dimensionality, it has virtually no adverse

aﬀect on performance. For example, in the case of Wikipedia tf-idf vectors (see

https://github.com/nmslib/nmslib/tree/v1.8/manual/datasets.md), it is

safe to use the value intermDim=4096.

Random projections work best if both the source and the target space are

Euclidean, whereas the distance is either L2or the cosine distance. In this case,

there are theoretical guarantees that the projection preserves well distances in

the original space (see e.g. [4]).

A.2 FastMap

FastMap introduced by Faloutsos and Lin [22] is also a type of the random-

projection method. At indexing time, we randomly select projDim pairs Aiand

Bi. The i-th coordinate of vector xis computed using the formula:

FastMapi(x) = d(Ai, x)2−d(Bi, xi)2+d(Ai, Bi)

2d(Ai, Bi)2(7)

Given points Aand Bin the Euclidean space, Eq. 7 gives the length of the

orthogonal projection of xto the line connecting Aand B. However, FastMap

can be used in non-Euclidean spaces as well.

A.3 Distances to the Random Reference Points

This method is a folklore projection approach, where the i-th coordinate of point

xin the projected space is computed as simply d(x, πi), where πiis a pivot in

the original space, i.e., a randomly selected reference point. Pivots are selected

once during indexing time.

5If the dimensionality of the projection space is larger than the dimensionality of the

original space, only the ﬁrst projDim vectors are orthonormalized. The remaining

are simply divided by their norms.

42 Naidan et al

A.4 Permutation-based Projections.

In this approach, we also select projDim pivots at index time. However, instead

of using raw distances to the pivots, we rely on ordinal positions of pivots sorted

by their distance to a point. A more detailed description is given in §4.4.

Manual

Navigation menu

Versions of this User Manual:

Views

Navigation